Multi-token prediction (MTP) in llama.cpp speeds up local Qwen 3.6 generation by building speculative decoding into the model itself—Hugging Face CTO Julien Chaumond’s quickstart shows you only need a recent build, an MTP GGUF from ggml-org, and two flags on llama-server.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
CLI[llama-server + MTP GGUF] --> FLAGS["--spec-type draft-mtp"]
FLAGS --> DENSE[Dense 27B MTP]
FLAGS --> MOE[MoE 35B-A3B MTP]
DENSE --> OUT[Faster token stream]
MOE --> OUT
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
class CLI agent
class OUT agent
class FLAGS hook

Multi-token prediction bundles draft guesses inside the same model file so decode steps emit more accepted text.
What MTP changes
MTP is a draft head trained with the base model, not a separate small “speculator” you download and wire up by hand. At decode time the head proposes several candidate next tokens; the main model verifies them in one pass. When draft tokens are accepted, you emit more text per forward step—Chaumond and the merged llama.cpp MTP PR (#22673) describe roughly ~2× generation throughput in favourable setups, though real gains depend on hardware, quantisation, and how many draft tokens you allow.
The MTP weights ship in the same GGUF as the main checkpoint; llama.cpp loads a lightweight MTP context (extra KV cache, typically under ~10% memory versus the full model). You opt in with flags—MTP does not run unless you ask for it.

Both checkpoints use the same MTP flags; pick the variant that matches your RAM and speed goals.
Prerequisites
| Requirement | Detail |
|---|---|
| llama.cpp build | MTP merged 16 May 2026; Chaumond suggests brew upgrade llama.cpp or brew install llama.cpp --HEAD until package managers ship build 9200+ |
| Model files | Qwen3.6-27B-MTP-GGUF (dense) or Qwen3.6-35B-A3B-MTP-GGUF (MoE) |
| Memory | ~48–64 GB RAM or VRAM comfortable; ~36 GB may work with stronger quants (Q4/Q6, Unsloth-style packs) |
| Pull models | -hf ggml-org/… on llama-server downloads from the Hub automatically |
Commands (copy-paste)
Install or refresh llama.cpp, then start the server with MTP enabled. Chaumond’s post uses --spec-draft-n-max 2 on dense and 3 on MoE; community benchmarks on the MoE often favour n-max 2 when acceptance rate drops at wider draft windows.
# Refresh llama.cpp (macOS example)
brew upgrade llama.cpp
# Or until stable packages catch up:
# brew install llama.cpp --HEAD
# Dense 27B — balanced quality (~30 tok/s on author’s box)
llama-server -hf ggml-org/Qwen3.6-27B-MTP-GGUF \
--spec-type draft-mtp --spec-draft-n-max 2
# MoE 35B-A3B — much faster when it fits (~100 tok/s in the post)
llama-server -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF \
--spec-type draft-mtp --spec-draft-n-max 3
Optional: add --no-mmproj if you do not need vision—saves memory. Advanced users can combine MTP with ngram drafting on supported builds; treat that as experimental.
Dense vs MoE: which to pick
| Variant | When it fits | Draft depth (starting point) | Notes from the thread |
|---|---|---|---|
| Dense 27B MTP | Single-GPU rigs aiming for steady quality | --spec-draft-n-max 2 | Chaumond reports ~30 tok/s locally; PR benches show ~1.8–2× decode vs no MTP on RTX 3090-class setups |
| MoE 35B-A3B MTP | High RAM/VRAM, throughput-first coding/chat | Try 2 first, then 3 | Post claims ~100 tok/s; independent runs show +20–30% at n-max 2, shrinking or negative returns at n-max 4 when acceptance falls |
How to read speed-up claims
- Decode vs prefill: MTP mainly helps token generation; prompt processing can be slower because of extra embedding transfers (noted in the PR).
- Acceptance rate: Wider
--spec-draft-n-maxdrafts more tokens per step but wastes work when guesses are wrong—measurepredicted_per_secondand draft acceptance, not prompt-processing rate. - Quality: PR authors ran AIME-style evals; scores stayed in line with Qwen’s published benchmarks when MTP is enabled.
- Hardware spread: Strix Halo, RTX 4090/5090, and laptop 6 GB+RAM reports range from modest (~1.2×) to near ~2× depending on quant and n-max.
Common confusion (answered)
| Question | Answer |
|---|---|
| Do I need a second GGUF for the draft model? | No for MTP—one MTP-tagged GGUF includes the head; classic speculative decoding still uses a separate small draft checkpoint. |
Why does my MoE slow down with n-max 3? | Lower acceptance means rejected drafts cost extra compute—try 2 and watch acceptance in server logs. |
| Does MTP work with tensor parallel / vision? | Yes in principle per the PR; some backend combos (e.g. tensor split + MTP) were still being fixed—test your stack. |
| Is this the same as “sharing to the Hub”? | No—the LinkedIn slug is generic; this post is specifically about running Qwen 3.6 MTP locally in llama.cpp. |
Performance snapshot
| Scenario | Approximate effect | Source |
|---|---|---|
| 27B Q6_K, RTX 3090 decode | 22.4 → 42.5 tok/s (~1.9×) | PR comment benchmark, MTP on vs off |
| 35B-A3B MoE, 6 GB VRAM + 64 GB RAM | 22.9 → 29.4 tok/s at n-max 2 | Community bench in PR thread |
| Author machine (Chaumond) | ~30 tok/s dense, ~100 tok/s MoE | LinkedIn post (May 2026) |
| MoE MXFP4, RTX PRO 24 GB | 91 → 111 tok/s at n-max 2 (~+22%) | LinkedIn comment (not ~2×) |
MTP turns Qwen 3.6 local runs from “one token per heavy step” into “verify a short bundle of guesses”—with a single Hub pull and two CLI flags once llama.cpp is current. Start with the dense GGUF if memory is tight; reach for the MoE MTP pack when you have headroom and care about tokens per second for long coding or agent loops.
Research supplement
Web search was not available in this session. The following context is drawn from training knowledge and the author's reference links.
- MTP origins: Multi-Token Prediction as a training objective was formalised in Meta's 2024 paper showing that training models to predict multiple future tokens simultaneously improves both sample efficiency and downstream task performance, with the side effect of producing usable draft heads for inference-time speculation.
- DeepSeek precedent: DeepSeek models (notably DeepSeek-V3 and DeepSeek-R1) also shipped with MTP heads and demonstrated real-world inference speedups using them, establishing the pattern that Qwen 3.6 follows.
- llama.cpp PR #22673: The merged pull request is the authoritative reference for implementation details, accepted flags, and any caveats around quantization compatibility. Readers building from source should verify their commit is at or after this merge.
- ggml-org GGUF files: The Qwen3.6-27B-MTP-GGUF and Qwen3.6-35B-A3B-MTP-GGUF repositories on Hugging Face are the canonical download locations and include model cards with quantization options.


































