← Posts

Bare Metal on Linux-llama.cpp - OptiPlex 3040

A companion to the HP ENVY bare-metal guide, this covers the same goal — compiling llama.cpp from source for maximum CPU inference performance — on a Dell OptiPlex 3040 running Debian 13. The difference: this machine is Skylake. No AVX-512. That changes the compilation approach.

Hardware:

How Do You Verify AVX2 and FMA Support on Skylake?

Before building, confirm the CPU supports AVX, AVX2, and FMA:

grep -m1 'flags' /proc/cpuinfo | tr ' ' '\n' | grep -E 'avx|fma'

Expected output includes: avx, avx2, fma

The i5-6500 confirmed all three. Note: AVX-512 is NOT present — that arrived post-Skylake.

What Is FMA and Why Does It Matter for LLM Inference?

FMA (Fused Multiply-Add) combines a multiply and add into a single CPU instruction with one rounding step instead of two. LLM inference is almost entirely dot products — weight × input, summed repeatedly — so FMA gives roughly 2x throughput on that workload vs non-FMA. Combined with AVX2’s 256-bit wide registers (8 floats at once), this is the key SIMD combination for CPU inference on this machine.

How Do You Install the Build Dependencies?

sudo apt install build-essential cmake git -y

This installs: gcc, g++, make, cmake, git

How Do You Set Up the Directory Structure?

Keeping llama.cpp and models in /opt rather than the home directory:

sudo mkdir -p /opt/llama.cpp
sudo mkdir -p /opt/models
sudo chown $USER:$USER /opt/llama.cpp
sudo chown $USER:$USER /opt/models

How Do You Clone the llama.cpp Repository?

--depth=1 skips the full git history for a faster clone:

git clone https://github.com/ggml-org/llama.cpp /opt/llama.cpp --depth=1

How Do You Compile llama.cpp for Skylake with AVX2 and FMA?

On a machine with a known CPU, explicit flags are better than GGML_NATIVE=ON. Targeting Skylake directly tells GCC to select every supported SIMD extension for this architecture:

cd /opt/llama.cpp && cmake -B build \
  -DGGML_AVX=ON \
  -DGGML_AVX2=ON \
  -DGGML_FMA=ON \
  -DGGML_F16C=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_FLAGS="-march=skylake" \
  -DCMAKE_C_FLAGS="-march=skylake"

cmake --build build --config Release -j4

-j4 uses all 4 cores for the build. Takes a few minutes.

The binary will be at: /opt/llama.cpp/build/bin/llama-cli

Which Model Should You Use on a 16GB AVX2 Machine?

With 16GB RAM and no AVX-512, a 3B parameter model at Q4_K_M quantization is the right balance — fast enough to be practical, small enough to leave headroom, and Q4_K_M gives the best quality-to-size tradeoff for CPU inference.

Model: Qwen2.5 3B Instruct Q4_K_M (1.93GB, single file)

wget -O /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf \
  https://huggingface.co/bartowski/Qwen2.5-3B-Instruct-GGUF/resolve/main/Qwen2.5-3B-Instruct-Q4_K_M.gguf

You can push to a 7B model on 16GB RAM, but on a 4-thread Skylake at Q4_K_M you’ll see tokens/sec drop noticeably. Start with 3B, confirm performance is acceptable, then decide whether the quality gain of 7B is worth the slower response.

How Do You Set Up a Shell Alias to Launch the Model?

Add to ~/.bashrc:

alias qwen='/opt/llama.cpp/build/bin/llama-cli -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf -n 512 -c 2048 --threads 4 --temp 0.7 --repeat-penalty 1.1'

Then reload:

source ~/.bashrc

How Do You Run llama.cpp from the Command Line?

Basic prompt:

llama-cli -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf -p "Your prompt here"

Interactive chat mode:

llama-cli -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf -cnv -p "You are a helpful assistant."

Server mode — OpenAI-compatible API on port 8080:

/opt/llama.cpp/build/bin/llama-server \
  -m /opt/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf \
  -c 4096 --threads 4 --host 0.0.0.0 --port 8080

Server mode exposes an OpenAI-compatible REST API. Any client that targets the OpenAI API can point at http://localhost:8080 instead — no code changes required.

Flag Reference

Flag Value Purpose
-m path to .gguf Model file
--threads 4 Match physical core count — not logical threads
-c 2048 Context window (push to 4096 on 16GB if needed)
-n 512 Max tokens to generate per response
--temp 0.7 Temperature — lower is more focused, higher more creative
--repeat-penalty 1.1 Reduces repetition loops in output
-cnv Conversation/chat mode with correct instruct template
-p "..." System prompt or initial prompt

What Performance Can You Expect on a Skylake i5?

With AVX2 + FMA on the i5-6500 at Q4_K_M quantization: approximately 8–15 tokens/sec on the Qwen2.5 3B model. Response latency is practical for interactive use — not fast, but not painful.

For context, the HP ENVY (11th gen i7 with AVX-512 VNNI) running an 8B model at the same quantization is noticeably faster per token. The Skylake machine compensates with 16GB RAM, the smaller model, and the cleaner /opt install layout for a dedicated box.

FAQ

Q: Why use explicit AVX2/FMA flags instead of GGML_NATIVE=ON? GGML_NATIVE=ON auto-detects your CPU at compile time, which works well when you don’t know the exact hardware. On a known CPU like the i5-6500, explicit flags combined with -march=skylake give the compiler full context to optimize for that specific architecture without relying on autodetection.

Q: What is the difference between AVX2 and AVX-512 for inference? AVX2 operates on 256-bit wide registers — 8 single-precision floats at once. AVX-512 doubles that to 512-bit — 16 floats — and adds VNNI instructions specifically designed for integer dot products used in quantized inference. On this machine, AVX2 + FMA is the ceiling. It’s capable, but a post-Skylake CPU will outperform it on the same model.

Q: What model size fits comfortably in 16GB RAM? A Q4_K_M quantized 3B model uses roughly 2GB. A 7B model at Q4_K_M uses roughly 4–5GB. Both fit with room to spare on 16GB. The 3B is recommended here for the Skylake’s thread count — keeping generation fast enough for interactive use is more practical than maximum quality.

Q: What is server mode and when would you use it? llama-server starts an HTTP server exposing an OpenAI-compatible REST API. Use it when you want to query the model programmatically, connect a local client tool, or access the model from another machine on your network. The --host 0.0.0.0 flag makes it reachable on your LAN; use 127.0.0.1 to restrict it to localhost only.

Q: Does llama.cpp inference on this machine require root or sudo? No. The build process writes to /opt which you chown to your user. The model download and all inference commands run as a normal user.

Ben Santora - March 2026