the trade off is not good (new update)

#28
by rosspanda0 - opened

prompt processing speed is as important as token output speed in coding agent like Claude Code, while decreasing prompt processing by half, reducing available vram so get reduced context length, 1.5x speed up to output to me is actually not an improvement. it's a downgrade.
03-06:
for multiple GPUs users, the tricky part of llama.cpp utilizing MTP is llama.cpp offloads the additional VRAM requirement fully to the last GPU!!! see the post below.

What hardware are you running on? I'm seeing a ~9% drop in PP and a ~20% increase in TG on a single V100 (32GB). I have to be very careful with RAM, the MTB is running two models so it's easy for one of the models to drop to CPU and not notice - so I adjusted context down to compensate on the MTB model. Otherwise the 2nd MTB model drops to system memory and TG drops to ~16.

Also I did notice that I can not run spec-draft-n-max at the recommended (6) - for some reason it's very CPU intensive even when it fits in RAM. However running at (4) is no problem.

Qwen3.6-27B:Q6_K (100k context)
./llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q6_K --temp 0.6 --top-p 0.95 --top-k 20 --port 8001 --host 0.0.0.0 --reasoning off -c 102400 -b 2048 -ub 2048 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --split-mode none --main-gpu 0

Results: pp: 488, tg: 21

Qwen3.6-27B-MTB:Q6_K (60k context)
./llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K --temp 0.6 --top-p 0.95 --top-k 20 --port 8001 --host 0.0.0.0 --reasoning off -c 61440 -b 2048 -ub 2048 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --split-mode none --main-gpu 0 --spec-type draft-mtp --spec-draft-n-max 4

Results: pp: 447, tg: 26

6 years old hardwares but pretty good with vanilla version for productivities. I used to have 200000K context length, when I said 1.5 x faster, it's not honest. on my machines, the token speed is improved only a bit , maybe not, while in-taking speed plunge, the whole system become unusable.

I've been noticing inconsistent results with prompt-processing. Sometimes it's fast. Sometimes it's slow.
Going to try reverting this KV cache reuse regression, since 5/19 https://github.com/ggml-org/llama.cpp/issues/23589

turn out llama.cpp offload the additional vram requirement for MTP to the LAST GPU, the -ts parameter has to be very carefully tuned, especially big reduction of the vram estimation for LAst GPU. previously slow performance was mainly caused by the cache was not fully offload to last GPU which I didn't notice. Today I retest it, found out this tricky thing. now MTP working as expected.

rosspanda0 changed discussion title from the trade off is not good to the trade off is not good (new update)
shimmyshimmer changed discussion status to closed

Sign up or log in to comment