Bare Metal on Linux-llama.cpp - OptiPlex 3040
A companion to the HP ENVY bare-metal guide, this covers the same goal — compiling llama.cpp from source for maximum CPU inference performance — on a Dell OptiPlex 3040 running Debian 13. The difference: this machine is Skylake. No AVX-512. That changes the compilation approach.
Hardware:
- Machine: Dell OptiPlex 3040
- CPU: Intel Core i5-6500 (Skylake, 4 cores / 4 threads, 3.2–3.6GHz)
- RAM: 16GB
- Storage: SSD
- OS: Debian 13, minimal Openbox + tint2
How Do You Verify AVX2 and FMA Support on Skylake?
Before building, confirm the CPU supports AVX, AVX2, and FMA:
grep -m1 'flags' /proc/cpuinfo | tr ' ' '\n' | grep -E 'avx|fma'
Expected output includes: avx, avx2, fma
The i5-6500 confirmed all three. Note: AVX-512 is NOT present — that arrived post-Skylake.
What Is FMA and Why Does It Matter for LLM Inference?
FMA (Fused Multiply-Add) combines a multiply and add into a single CPU instruction with one rounding step instead of two. LLM inference is almost entirely dot products — weight × input, summed repeatedly — so FMA gives roughly 2x throughput on that workload vs non-FMA. Combined with AVX2’s 256-bit wide registers (8 floats at once), this is the key SIMD combination for CPU inference on this machine.
How Do You Install the Build Dependencies?
sudo apt install build-essential cmake git -y
This installs: gcc, g++, make, cmake, git
How Do You Set Up the Directory Structure?
Keeping llama.cpp and models in /opt rather than the home directory:
sudo mkdir -p /opt/llama.cpp
sudo mkdir -p /opt/models
sudo chown $USER:$USER /opt/llama.cpp
sudo chown $USER:$USER /opt/models
How Do You Clone the llama.cpp Repository?
--depth=1 skips the full git history for a faster clone:
git clone https://github.com/ggml-org/llama.cpp /opt/llama.cpp --depth=1
How Do You Compile llama.cpp for Skylake with AVX2 and FMA?
On a machine with a known CPU, explicit flags are better than GGML_NATIVE=ON. Targeting Skylake directly tells GCC to select every supported SIMD extension for this architecture:
cd /opt/llama.cpp && cmake -B build \
-DGGML_AVX=ON \
-DGGML_AVX2=ON \
-DGGML_FMA=ON \
-DGGML_F16C=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS="-march=skylake" \
-DCMAKE_C_FLAGS="-march=skylake"
cmake --build build --config Release -j4
-j4 uses all 4 cores for the build. Takes a few minutes.
The binary will be at: /opt/llama.cpp/build/bin/llama-cli
Which Model Should You Use on a 16GB AVX2 Machine?
With 16GB RAM and no AVX-512, a 3B parameter model at Q4_K_M quantization is the right balance — fast enough to be practical, small enough to leave headroom, and Q4_K_M gives the best quality-to-size tradeoff for CPU inference.
Model: Qwen2.5 3B Instruct Q4_K_M (1.93GB, single file)
wget -O /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf \
https://huggingface.co/bartowski/Qwen2.5-3B-Instruct-GGUF/resolve/main/Qwen2.5-3B-Instruct-Q4_K_M.gguf
You can push to a 7B model on 16GB RAM, but on a 4-thread Skylake at Q4_K_M you’ll see tokens/sec drop noticeably. Start with 3B, confirm performance is acceptable, then decide whether the quality gain of 7B is worth the slower response.
How Do You Set Up a Shell Alias to Launch the Model?
Add to ~/.bashrc:
alias qwen='/opt/llama.cpp/build/bin/llama-cli -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf -n 512 -c 2048 --threads 4 --temp 0.7 --repeat-penalty 1.1'
Then reload:
source ~/.bashrc
How Do You Run llama.cpp from the Command Line?
Basic prompt:
llama-cli -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf -p "Your prompt here"
Interactive chat mode:
llama-cli -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf -cnv -p "You are a helpful assistant."
Server mode — OpenAI-compatible API on port 8080:
/opt/llama.cpp/build/bin/llama-server \
-m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf \
-c 4096 --threads 4 --host 0.0.0.0 --port 8080
Server mode exposes an OpenAI-compatible REST API. Any client that targets the OpenAI API can point at http://localhost:8080 instead — no code changes required.
Flag Reference
| Flag | Value | Purpose |
|---|---|---|
-m |
path to .gguf | Model file |
--threads |
4 |
Match physical core count — not logical threads |
-c |
2048 |
Context window (push to 4096 on 16GB if needed) |
-n |
512 |
Max tokens to generate per response |
--temp |
0.7 |
Temperature — lower is more focused, higher more creative |
--repeat-penalty |
1.1 |
Reduces repetition loops in output |
-cnv |
— | Conversation/chat mode with correct instruct template |
-p |
"..." |
System prompt or initial prompt |
What Performance Can You Expect on a Skylake i5?
With AVX2 + FMA on the i5-6500 at Q4_K_M quantization: approximately 8–15 tokens/sec on the Qwen2.5 3B model. Response latency is practical for interactive use — not fast, but not painful.
For context, the HP ENVY (11th gen i7 with AVX-512 VNNI) running an 8B model at the same quantization is noticeably faster per token. The Skylake machine compensates with 16GB RAM, the smaller model, and the cleaner /opt install layout for a dedicated box.
FAQ
Q: Why use explicit AVX2/FMA flags instead of GGML_NATIVE=ON?
GGML_NATIVE=ON auto-detects your CPU at compile time, which works well when you don’t know the exact hardware. On a known CPU like the i5-6500, explicit flags combined with -march=skylake give the compiler full context to optimize for that specific architecture without relying on autodetection.
Q: What is the difference between AVX2 and AVX-512 for inference? AVX2 operates on 256-bit wide registers — 8 single-precision floats at once. AVX-512 doubles that to 512-bit — 16 floats — and adds VNNI instructions specifically designed for integer dot products used in quantized inference. On this machine, AVX2 + FMA is the ceiling. It’s capable, but a post-Skylake CPU will outperform it on the same model.
Q: What model size fits comfortably in 16GB RAM? A Q4_K_M quantized 3B model uses roughly 2GB. A 7B model at Q4_K_M uses roughly 4–5GB. Both fit with room to spare on 16GB. The 3B is recommended here for the Skylake’s thread count — keeping generation fast enough for interactive use is more practical than maximum quality.
Q: What is server mode and when would you use it?
llama-server starts an HTTP server exposing an OpenAI-compatible REST API. Use it when you want to query the model programmatically, connect a local client tool, or access the model from another machine on your network. The --host 0.0.0.0 flag makes it reachable on your LAN; use 127.0.0.1 to restrict it to localhost only.
Q: Does llama.cpp inference on this machine require root or sudo?
No. The build process writes to /opt which you chown to your user. The model download and all inference commands run as a normal user.
Ben Santora - March 2026