- canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
W4A16 INT4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — the first DeepSeek-V4-Flash quantization that ships a working MTP block, giving ~1.5× speculative decoding (spec-decode) speedup at bs=1 with no quality cost. Extends the W4A16-FP8 predecessor by patching the transformers calibration path so the MTP block survives the load.
TL;DR
| Recommended hardware | RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated · or 8× H200 TP=2 |
| Quality | GSM8K 93.71% (8-shot strict); HumanEval 84.76% pass@1; MMLU 86.88% |
| Throughput | RTX PRO 6000 98.83 @ TP=2 / 107.32 @ TP=4 at bs=1; 88.35 on H200 TP=2 |
| MTP acceptance | 89% calibrated workload / 70% on random prompts at bs=1 k=1 |
| Spec-decode speedup | 1.49× at bs=1, k=1 (TPOT 6.02 ms vs 8.93 ms, same artifact) |
| Differentiator | First V4-Flash W4A16 quant where MTP survives the calibration load; transformers 5.8.1 silently strips MTP keys by default |
Family / related artifacts
| Repo | Role | Relation to this artifact |
|---|---|---|
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 |
predecessor | Same W4A16 + FP8 recipe; MTP dropped at load (the bug this artifact fixes) |
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP |
sibling | Same MTP-retention pattern; NVFP4 routed experts instead of W4A16 (Blackwell-native) |
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP |
larger sibling | V4-Pro at NVFP4 + MTP, B300-only deployment |
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 |
upstream reference | Original NVFP4 recipe (no MTP — same silent-drop bug) |
Why this exists
The W4A16-FP8 predecessor and RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 both drop the MTP block because transformers 5.8.1's DeepseekV4PreTrainedModel declares:
_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]
which silently filters every mtp.* tensor at from_pretrained time — without warning, without error. Calibration pipelines that go through from_pretrained produce quantized main weights paired with an absent MTP block; serving falls back to plain decode, losing the ~1.5–2× spec-decode speedup that V4-Flash's architecture provides.
This artifact bypasses the silent drop, runs the full 8-rank GPTQ calibration on a 768-sample corpus against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.
Architecture & precision
Base model
| Property | Value |
|---|---|
| Total parameters | |
| Decoder layers | 43 |
| Routed experts / layer | 256 (top-K = 6) |
| Hidden size | 4096 |
| Base BF16 size | ~543 GB |
| Quantized size | 159 GB (+3 GB vs predecessor for the BF16 MTP block) |
Component precisions
| Component | Format | Method |
|---|---|---|
| Routed experts (256 × 43 layers × 3 projections) | W4A16 INT4, group_size=128, symmetric | GPTQ via llm-compressor, 768 calibration samples |
Attention path (wq_a, wq_b, wkv, wo_a, wo_b, indexer, compressor) |
FP8_BLOCK 128×128 | Dynamic scales, scale_fmt=ue8m0 |
MTP block (mtp.0.*) |
BF16 | Excluded from quantization, preserved verbatim |
HC plumbing (hc_attn_*, hc_ffn_*, hc_head_*), attn_sink, ffn.gate.bias, indexer/compressor ape |
FP32 | Restored post-save from BF16 source (see Upstream contributions) |
head.weight (LM head) |
FP32 | Upcast from BF16 to match sibling artifact's MTP loader path |
Embeddings (embed.weight, mtp.0.emb.tok_emb.weight) |
BF16 | Source dtype preserved |
Hardware validated
| Platform | SM | HBM/GPU | Interconnect | TP | Role |
|---|---|---|---|---|---|
| 8× NVIDIA H200 SXM5 | 9.0a | 141 GB HBM3e | NVLink | 2 (4× replicas) | Calibration + initial benchmarks (p5en.48xlarge) |
| 4× NVIDIA RTX PRO 6000 Blackwell Server Edition | 12.0, sm_120 | 96 GB HBM | PCIe | TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) | Workstation Blackwell deployment + $/token sweet spot |
Same artifact, no weight changes between SKUs. Both validated cuda graphs ON.
Benchmarks
All numbers from the same artifact, vLLM HEAD 50d9dd902 + 4 patches cherry-picked (PRs #43248 / #43288 / #43290 / #43319).
Quality
Sampling: greedy, temperature 0. Methodology disclosed per row.
| Benchmark | Setting | This artifact | Predecessor (W4A16-FP8, no MTP) | RedHat (NVFP4-FP8, no MTP) | Delta |
|---|---|---|---|---|---|
| GSM8K | 8-shot, strict-match | 93.71% ± 0.67 | 95.07% (RTX PRO 6000) / 95.45% (Spark) | 91.0% (self-reported) | -1.28 pts vs predecessor (within 1 SE) |
| GSM8K | 8-shot, flexible-extract | 93.63% ± 0.67 | 95.37% (Spark) | — | within SE |
| MMLU | 5-shot | 86.88% ± 0.27 | 87.27% (H200) | — | -0.39 pts (within SE) |
| MMLU-Pro | 5-shot, 12k prompts, custom-extract | 71.28% ± 0.40 | — | — | sibling NVFP4-FP8-MTP scored 81.13% on B300 — expected gap given W4A16 has more quant noise than NVFP4 on knowledge-heavy harder benchmarks |
| HumanEval | 0-shot pass@1, --confirm_run_unsafe_code |
84.76% ± 2.82 | 80.49% (corrected, see predecessor card "Changes") | — | +4.27 pts vs corrected predecessor number |
| AIME 2024 | 30 problems, thinking=high, c=4, max_tokens=64K | 29/30 (96.7%) ✓ verified 2026-05-29 in fresh Docker (TP=4 RTX PRO 6000) | — | — | the prior 30.0% number was a scoring artifact (see footnote); proper chat-template thinking=high + max_tokens at the model-len cap (so reasoning isn't truncated) returns the right answer. TP=2 same config: 27/30 (90.0%). |
| GSM8K-50 chat-mode cross-check (RTX PRO 6000 TP=4, 2026-05-24 post-shipping-fix) | greedy, no thinking, concurrency=1 | 44/50 = 88.0% | — | — | matches Card B sibling's 88% strict TP=2 / 90% strict TP=4 on the same hardware — confirms dequant'd artifact preserves quality |
| IFEval prompt-strict | chat-template, no thinking | TBD² | — | — | not yet measured cleanly on this build |
| chat-smoke (quick / quality / coding) | harness | 4/4 · 4/4 · 2/2 | 4/4 · 4/4 · 2/2 | — | match |
| toolcall15 | 1 round, 30 points | 24/30 (80%) | 26/30 (87%) | — | -2 pts — see Honest limitations |
¹ The prior 30.0% AIME number was an lm-eval-harness aime24 task artifact — completions-mode prompt (no chat template), exact_match scorer on a thinking-mode model whose answers are wrapped in <think>…</think> + \boxed{N}. The scorer matched the literal answer string and missed virtually every correct response. A 1-shot smoke under proper chat-templated thinking=high methodology returned 2024-II-4: pred=33, exp=33, correct in 2072 completion tokens — model behavior is correct. Full 30-problem re-bench attempted 2026-05-24 on this RTX PRO 6000 box hit a reproducible CUDA illegal memory access (Worker_TP2: torch.AcceleratorError) under any concurrent thinking-mode load (cuda graphs and --enforce-eager both crash; concurrency=4 dies at ~11 min, concurrency=8 dies at ~90 s). Single-shot inference works. Re-bench deferred to H200 with jasl/vllm@ds4-sm120-experimental@abad5dc71 (the build the original Card D H200 numbers used).
² IFEval re-bench attempted 2026-05-24 hit the same RTX PRO 6000 stability issue. Deferred alongside AIME.
Throughput
vllm bench serve random 256-in / 256-out, MTP-spec num_speculative_tokens=1 (k=1 cap on this build — see Honest limitations), cuda graphs ON.
| Hardware | TP | bs=1 output tok/s | bs=1 TPOT median | bs=4 output tok/s | bs=16 output tok/s | MTP acceptance @ bs=1 |
|---|---|---|---|---|---|---|
| 8× H200 | 2 (per replica) | 88.35 | 6.02 ms | 138.80 | 367.13 | 89% calibrated / 70% random |
| 4× RTX PRO 6000 box | TP=2 (per replica, 2 replicas fit) | 98.83 | 8.55 ms | 219.53 | 482.61 | 71% |
| 4× RTX PRO 6000 box | TP=4 (single replica) | 107.32 | 7.77 ms | 221.52 | 584.04 | 68% |
Per-replica, RTX PRO 6000 wins output throughput at every batch size; H200 still wins per-token TPOT median.
MTP draft-token acceptance per workload
Same artifact, bs=1, k=1.
| Workload | Prompts | Accepted / emitted | Acceptance |
|---|---|---|---|
| Random 256-token prompts (200 samples) | random | 21024 / 30058 | 69.94% |
| Code, raw completion (15 short signature+docstring prompts) | code-raw | 1847 / 1988 | 92.91% |
| Chat-templated prose (15 prompts) | chat-prose | 1946 / 2376 | 81.90% |
| Raw natural language (15 continuation prompts) | nl-raw | 1745 / 2086 | 83.65% |
Spec-decode wins at low concurrency (single-user interactive). At bs≥4 the verifier is already filling its batch lane, so extra verifier passes add overhead without saving wall-clock — matches the sibling artifact's framing of bs=1 as the headline operating point.
Cost per output token (node-level)
Boxes priced for cloud-rented hardware. Single-replica numbers measured; multi-replica totals are linear extrapolation.
| Box | Replicas | bs=1 total tok/s | bs=16 total tok/s | $/h | $/(1000 tok/h) at bs=1 |
|---|---|---|---|---|---|
p5en.48xlarge (8× H200) |
4× TP=2 | ~353 | ~1468 | $98 | $278 |
g7e.24xlarge (4× RTX PRO 6000) |
2× TP=2 | ~198 | ~965 | $19.92 | $101 |
g7e.24xlarge (4× RTX PRO 6000) |
1× TP=4 | 107.32 | 584.04 | $19.92 | $186 |
At bs=1 (interactive), RTX PRO 6000 2×TP=2 is ~2.7× cheaper than H200 4×TP=2. At bs=16 the gap narrows because H200's per-replica throughput scales better with batch — H200 wins absolute throughput when you can fill it; RTX wins on $/token unless you genuinely need >1500 tok/s aggregate output.
Cross-validation: 2026-05-29 fresh Docker on RTX PRO 6000 ✓
Hardware coverage: all numbers below are from RTX PRO 6000 Blackwell Server Edition (SM 12.0a) on a Brev
g7e.24xlarge. The image is expected to work on Workstation Edition (same SM 12.0a, same Marlin native cubins, same model + serve path) but we have not directly verified it ourselves. Reference TP=2 Workstation numbers from jasl's bench harness (baselines/20260512_sm120_deployment_1c20f1a6d) confirm the underlying stack runs on Workstation. Expect a 5-15% throughput delta from clock/memory-bandwidth differences. If anything misbehaves on Workstation Edition, open an issue at the repo.
Full bench matrix on canada-quant/dsv4-w4a16-rtxpro6000:v1 (the HF-published Docker image, built from jasl/vllm@27fd665b + canada-quant BF16-MTP cherry-pick + Marlin MoE c_tmp/workspace patches). All AIME runs at max_tokens = max_model_len - 500 = 65036 so reasoning runs to natural stop:
| AIME-2024 thinking-mode sweep (c=4, n=30) | TP=2 (max_num_seqs=4) | TP=4 (max_num_seqs=16) |
|---|---|---|
| chat (no think) | 18/30 · MTP 95.78% · 53m | 19/30 · MTP 93.06% · 5m |
| thinking-high | 27/30 · MTP 91.97% · 152m | 29/30 · MTP 91.01% · 13m |
| thinking-max | 24/30 · MTP 92.52% · 177m | 27/30 · MTP 91.68% · 26m |
| AIME-2024 single-shot reference (c=1, thinking-high, n=30) | TP=2 | TP=4 |
|---|---|---|
| c=1 high | 27/30 · MTP 91.68% · 48m | 28/30 · MTP 90.76% · 41m |
| GSM8K (n=50, 8-shot) | TP=2 | TP=4 |
|---|---|---|
| flexible-extract | 45/50 (90.0%) | 43/50 (86.0%) |
| strict-match | 42/50 (84.0%) | 40/50 (80.0%) |
| Throughput random 256/256 (single replica, MTP on) | TP=2 tok/s @ TPOT p50 | TP=4 tok/s @ TPOT p50 |
|---|---|---|
| bs=1 | 95.2 @ 8.05 ms | 108.1 @ 7.32 ms |
| bs=4 | 40.6 @ 83.18 ms | 104.3 @ 11.31 ms |
| bs=8 | 45.7 @ 79.02 ms | 433.2 @ 16.44 ms (sweet spot) |
| bs=16 | 34.9 (capped by max_num_seqs=4) | 164.3 (scheduler thrash) |
| Throughput random 1024/1024 | TP=2 | TP=4 |
|---|---|---|
| bs=1 | 30.7 tok/s | 138.1 tok/s |
| bs=4 | 45.1 tok/s | 363.7 tok/s |
Headlines from this run:
- Zero CUDA illegal-memory-access in 240 AIME thinking-mode problems across c=4 chat/high/max + c=1 high on both TP=2 and TP=4 = the Marlin MoE concurrent-decode race is fixed by the
c_tmpclamp removal in PR vllm#43730 (which is baked into the v3 image viajasl/vllm@27fd665b). - TP=4 is 7-12× faster than TP=2 at AIME (chat 53m→5m, high 152m→13m, max 177m→26m). MoE expert sharding across 4 GPUs decisively wins.
- Thinking-max regresses correctness AND triples wall (TP=4: high 29/30 in 13m vs max 27/30 in 26m). The artifact's sweet spot is
reasoning_effort=high. - MTP holds 91-93% across all thinking modes and TP configs — the BF16-retained draft head is doing its job everywhere.
Raw JSON + per-bench logs in the reproduction repo.
Tuning attempts that DID NOT win on TP=4 Server (documenting so you don't repeat them)
We A/B-tested adopting jasl's TP=2 Workstation env tunings at TP=4 Server — none of them transferred. Stick with the v3 image defaults:
| Change from defaults | Result | Why |
|---|---|---|
num_speculative_tokens=2 (jasl's deepseek_mtp k=2 default) |
−86% bs=8 (433 → 60 tok/s) | k=2 doubles main-model forward cost; at TP=4 the all-reduce overhead exceeds the ~1.5 tokens-per-draft acceptance gain that's net-positive at TP=2 |
--enable-expert-parallel (jasl recommends) |
similarly bad combined with k=2 | TP=4 all-to-all expert-gather is expensive |
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=512 + ..._TOPK_CHUNK_SIZE=512 (jasl's chunk-size tunings) |
CUDA illegal memory access at cudagraph capture | Tuned for SM 12.0a c128a Workstation single-request prefill; exceed safe limits at TP=4 Server |
--no-enable-flashinfer-autotune (jasl recommends) |
−74% bs=8 (433 → 111 tok/s) | Triton block-FP8 autotune is load-bearing at TP=4 — disabling locks in default tile sizes that don't match the 4-GPU shape |
--gpu-memory-utilization 0.985 (jasl recommends) |
crash potential combined with sparse-MLA env | 0.95 is the safe value the v3 image ships with at TP=4 |
The image's defaults are the optimal config for TP=4 RTX PRO 6000 Server Edition as of 2026-05-29. If you're deploying on TP=2 Workstation Edition, jasl's reference config (sm120_tp2_serve.env.example) is the right starting point — it was tuned on that exact hardware.
Quick start
RTX PRO 6000 Blackwell — Docker (recommended)
The pre-built canada-quant/dsv4-w4a16-rtxpro6000:v1
image bakes the full 13-layer recipe (jasl/vllm@27fd665b + canada-quant BF16
MTP cherry-pick + Marlin MoE c_tmp/workspace patches + cute.arch.fmin shim).
~3-5 min from docker load to a working endpoint on a g7e.24xlarge.
# 1. Pull the image tarball (~14 GB compressed)
hf download canada-quant/dsv4-flash-w4a16-rtxpro6000-image \
--include "*.tar.gz" --local-dir .
docker load < dsv4-w4a16-rtxpro6000-v1.tar.gz
# 2. Cache the W4A16 model onto NVMe (~159 GB, ~1-2 min via xet on Brev)
HF_HOME=/opt/dlami/nvme/hf-cache hf download \
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
# 3. Pull the serve helper
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
# 4. Serve TP=2 (or TP=4 with --gpus all -e TP=4 -e MAX_NUM_SEQS=16)
docker run -d --gpus '"device=0,1"' --name dsv4-w4a16-serve \
--shm-size=16g --ipc=host -p 8000:8000 \
-v /opt/dlami/nvme/hf-cache:/root/.cache/huggingface \
-v $(pwd)/scripts:/workspace/scripts:ro \
-e TP=2 -e MAX_NUM_SEQS=4 -e MAX_MODEL_LEN=65536 -e GPU_MEM_UTIL=0.95 \
canada-quant/dsv4-w4a16-rtxpro6000:v1 \
bash /workspace/scripts/serve_rtx6000pro_w4a16.sh
# 5. Wait for /v1/models (~3-5 min model load + cudagraph capture)
until curl -sf http://127.0.0.1:8000/v1/models >/dev/null; do sleep 5; done
# 6. Run the full bench matrix (AIME chat/high/max + GSM8K + throughput)
docker exec dsv4-w4a16-serve bash -c \
"TAG=tp2_64k MAX_MODEL_LEN=65536 bash /workspace/scripts/bench_matrix.sh"
RTX PRO 6000 Blackwell — from-source install (advanced)
# 1. Bootstrap vLLM (~25 min for source build)
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
bash scripts/bootstrap_rtx6000pro.sh
# 2. Extra pins
source ~/venv-serve/bin/activate
pip install --quiet "flashinfer-python==0.6.8.post1" "flashinfer-cubin==0.6.8.post1" \
"numba==0.65.0" "tilelang==0.1.9" "apache-tvm-ffi==0.1.9" "fastsafetensors>=0.2.2"
# 3. Apply patches
python scripts/patch_v4_forcausal_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_mtp_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_nvidia_attn_scale.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
bash scripts/patch_wo_a_bf16_path.sh "$(python -c 'import vllm; print(vllm.__path__[0])')"
# 4. Download artifact (159 GiB) — already dequant'd in-artifact as of 2026-05-24,
# no local preprocessing step required.
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
--local-dir /scratch/weights/w4a16-fp8-mtp-gptq
# 5. Serve TP=2 (or TP=4 with 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1 bash scripts/serve_rtx6000pro.sh \
/scratch/weights/w4a16-fp8-mtp-gptq 8000 2
Required runtime env vars on SM 12.x (already set inside serve_rtx6000pro.sh but worth knowing):
export VLLM_TRITON_MLA_SPARSE=1
export VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4
export VLLM_USE_FLASHINFER_SAMPLER=0
Without VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel can crash during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access in _dequantize_and_gather_k_kernel. The FlashInfer sampler is also broken on TORCH_CUDA_ARCH_LIST=12.0a — fall back to PyTorch-native via VLLM_USE_FLASHINFER_SAMPLER=0.
H200
vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 --block-size 256 \
--max-model-len 4096 \
--gpu-memory-utilization 0.80 \
--no-enable-prefix-caching \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--trust-remote-code
Quantization recipe
| Property | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k (V4 chat template) |
| Samples | 768 |
| Max sequence length | 512 |
| Per-rank batch size | 4 |
| Calibration hardware | 8× NVIDIA H200 (p5en.48xlarge) |
| Walltime | ~15.4h (15.09h oneshot + ~16 min save) |
| Per-subgraph cadence | ~20 min/subgraph × 44 subgraphs (43 MoE + 1 MTP no-op) |
Calibration recipe identical to the W4A16-FP8 predecessor with one change: the modeling class is patched to remove mtp.* from _keys_to_ignore_on_load_unexpected before from_pretrained, so the MTP block survives the load and is written back to the artifact at BF16.
vLLM build
Common patches (all platforms)
| PR | Purpose | Status |
|---|---|---|
vllm-project/vllm#43248 |
bool() wrap on is_static_input_scheme |
open |
vllm-project/vllm#43288 |
.get("scale_fmt", "ue8m0") on missing key + BF16 getattr follow-up |
open |
vllm-project/vllm#43290 |
weight_scale_inv-or-weight_scale fallback (attention) |
open |
vllm-project/vllm#43319 |
MTP-quant-detect from safetensors header + BF16 wo_a fallback path |
open |
RTX PRO 6000 Blackwell (SM 12.0) only
| Patch | Purpose |
|---|---|
packed_modules_mapping on DeepseekV4ForCausalLM + DeepSeekV4MTP |
Required as of ds4-sm120-experimental@abad5dc71 |
BF16 wo_a path for MTP block |
Static weight.dtype == bfloat16 check (dynamo-safe) |
--disable-custom-all-reduce |
No NVLink between RTX PRO 6000 boards |
CMakeLists USE_SABI 3.11 removal |
For Python 3.10 |
(Previously this list also required a compressor/indexer FP8 → BF16 dequant preprocess step run against the local artifact. As of 2026-05-24 the dequant is baked into the published artifact — see Changes.)
H200 deployments need only the four common patches.
Honest limitations
- k=1 cap on spec-decode — current vLLM build limits
num_speculative_tokensto 1 due to DeepGemm kernel assertionnext_n == 1 or next_n == 2insmxx_fp8_fp4_paged_mqa_logits.hpp:233. vLLM passesnext_n = num_speculative_tokens + 1, so practical k is 1. TheFLASHINFER_MLA_SPARSEattention backend hits the same kernel-side assertion. With the assertion relaxed, expect bs=1 speedup to rise from 1.49× to ~1.85× (matching sibling NVFP4 artifact's k=2 published number). - Concurrent thinking-mode workloads on RTX PRO 6000 produce token-corrupted output — under concurrency ≥ 2 with thinking=high (long-decode workloads like AIME), the Marlin W4A16 MoE decode kernel on SM 12.0 produces token-stream corruption (CJK / Cyrillic / garbled ASCII spliced into the model's reasoning trace). The same hardware + same vLLM build serving the NVFP4 sibling (
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTPviaflashinfer_trtllmMoE) is essentially clean on the same workload (1/30 vs 14/30 corrupted at c=4 thinking). The bug is specific to the W4A16 + Marlin MoE decode path on SM 12.0. Investigation isolated through 7 controlled tests (sparse-MLA topk-chunk size, MTP-off, matmul_decode-off, eager-mode, concurrency sweep, NVFP4 vs W4A16 path comparison). Workaround on RTX PRO 6000: for batched thinking-mode workloads, serve the NVFP4 sibling artifact instead. For sequential (c=1) thinking-mode or any batched chat-mode (no thinking), this W4A16-MTP artifact works cleanly (GSM8K-20 chat-mode sequential = 20/20 = 100%, MTP draft acceptance 92.46%). Full debug log + reproducible benches:docs/findings/sm12x_token_corruption_2026_05_24.md. Filed upstream asjasl/vllm#12. - toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion (TC-07 stopped mid-chain to ask a clarifying question) and multi-tool extraction (TC-06 returned both translations as content text instead of routing two
translatecalls). Quality-wise the model completes the underlying intent; the harness scores tool-call-protocol fidelity, not task completion. Not a parser issue (confirmed by replay through--tool-call-parser deepseek_v4). - GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within one SE, but technically below. Likely calibration-set sensitivity rather than recipe drift (recipe is identical, hardware differs).
- NVFP4 native kernels on RTX PRO 6000 not auto-selected — even though
csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cuexists in upstream vLLM, the backend selector doesn't pick it (vllm-project/vllm#31085). Until that lands, the sibling NVFP4 artifact on this hardware would route through Marlin too. This artifact's W4A16 path is the tested choice for RTX PRO 6000.
Reproduction
Full pipeline at canada-quant/dsv4-flash-w4a16-fp8-mtp. From a fresh 8× H200 box:
# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh
# Phase 1 — download upstream + dequant to BF16-MTP source (~30 min, ~660 GB)
bash scripts/phase1_dequant.sh
# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh
# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh
# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq
# Phase 5 — serve (see Quick start above for serve command)
Upstream contributions filed during this work
| Contribution | Description | Status |
|---|---|---|
transformers — save_pretrained silent FP32 → BF16 downcast |
417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing, gate bias, attn_sink, indexer/compressor ape) are silently written as BF16 by save_pretrained when model torch_dtype is BF16. Workaround: postprocess restore from BF16 source via scripts/fixup_artifact.py. Upstream filing pending |
local |
vLLM — MTP loader silently skips top-level head.weight + embed.weight |
DeepSeekV4MTP.load_weights calls name.replace("mtp.0.", "") which no-ops on non-mtp.0.* keys; get_spec_layer_idx returns None → loop skips. head.weight and embed.weight never reach shared_head.head / embed_tokens → uninitialized → 0% MTP acceptance with no load-time error. Workaround: postprocess injects mtp.0.head.weight and mtp.0.emb.tok_emb.weight as duplicates. Upstream filing pending |
local |
vLLM — DeepGemm paged_mqa_logits asserts on num_speculative_tokens > 1 |
smxx_fp8_fp4_paged_mqa_logits.hpp:233 enforces next_n == 1 or next_n == 2. With next_n = k+1, practical k cap is 1. Caps spec-decode speedup at 1.49× vs sibling's published 2.03× at k=2 |
upstream (DeepGemm) — filing pending |
vllm-project/vllm#43248 |
bool() wrap on is_static_input_scheme |
open |
vllm-project/vllm#43288 |
scale_fmt defensive .get() + BF16 getattr wrap |
open |
vllm-project/vllm#43290 |
weight_scale_inv-or-weight_scale fallback |
open |
vllm-project/vllm#43319 |
MTP-quant-detect from safetensors header + BF16 wo_a fallback path |
open |
Changes
| Date | Change |
|---|---|
| 2026-05-22 | Initial release on H200 (jasl/vllm@ds4-sm120-experimental@abad5dc71). GSM8K 93.71% strict, MMLU 86.88%, HumanEval 84.76%, MTP acceptance 89% on calibrated workload / 70% on random prompts |
| 2026-05-24 (morning) | RTX PRO 6000 Blackwell (SM 12.0) added. TP=2 and TP=4 both validated, chat-smoke 4/4 PASS, MTP acceptance 68-72%, MTP-on per-replica throughput 98.83 tok/s @ TP=2 / 107.32 @ TP=4. Per-replica throughput beats H200 at every batch size. vllm-project/vllm#41511 (Marlin TP > 2 bug) did not fire on this build |
| 2026-05-24 (afternoon) | Shipping-bug fix. Artifact previously shipped FP8_BLOCK compressor/indexer with .weight_scale keys; current upstream/preview-dev vLLM constructs those modules as plain BF16 (quant_config=None), so the artifact failed to load with KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'. 166 compressor/indexer weights dequantized in-place (FP8 + BF16 scale → BF16, mathematically lossless) and re-uploaded. Artifact now loads cleanly on modern vLLM on both RTX PRO 6000 and H200 without local preprocessing. H200 historical numbers above remain valid: the original H200 build supported FP8 compressor; modern vLLM serving the new BF16-compressor format produces equivalent outputs (verified post-fix on RTX PRO 6000 TP=4: GSM8K-50 chat-mode 44/50 = 88.0%, matches sibling Card B's 88% strict TP=2 / 90% strict TP=4 on this hardware). |
| 2026-05-24 (afternoon) | AIME 2024 methodology correction. Prior 30.0% exact_match was an lm-eval-harness aime24 task artifact (completions-mode prompt, no chat template, exact-string scorer on a thinking model whose answers are \boxed{N}). 1-shot smoke under chat-templated thinking=high returns the correct integer. Full re-bench blocked by the RTX PRO 6000 concurrent-thinking CUDA crash (see Honest limitations); deferred to H200. Prior 30.0% struck through in Quality table. |
| 2026-05-24 (evening) | Root-cause investigation of the RTX PRO 6000 concurrent-thinking issue. Updated to jasl/vllm@a937d4b28 (Stabilize SM12x sparse MLA long prefill) — server no longer crashes under concurrent thinking-mode load, but produces token-stream corruption on ~50% of long generations at c=4. Tested workarounds (VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256, VLLM_TRITON_MLA_SPARSE_MATMUL_DECODE=0, MTP-off, eager-mode, concurrency sweep) — none reduce corruption; all increase crash rate. Diagnostic comparison: same hardware + same build + same workload on Card B (NVFP4 + flashinfer_trtllm MoE) = 1/30 corrupted vs Card D (W4A16 + Marlin MoE) = 14/30 corrupted. Bug isolated to W4A16 + Marlin MoE decode path on SM 12.0. Production-config verified on Card D: GSM8K-20 chat-mode sequential = 20/20 = 100%, MTP draft acceptance 92.46%. Full debug log: docs/findings/sm12x_token_corruption_2026_05_24.md. Filed upstream: jasl/vllm#12. |
| 2026-05-25 | Applied vllm-project/vllm#40923 "Marlin MoE: include SM 12.x in default arch list" + clean rebuild. PR #40923's description matches our symptom verbatim ("V4-Flash MoE decode emits gibberish on RTX 50-series GB10/DGX Spark... driver JIT-promotes 8.0+PTX fallback"). After applying the patch (12.0a;12.1a for CUDA 12.9) and forcing a full Marlin MoE source regen + rebuild: AIME c=4 thinking corruption dropped from 14/30 → 0/30, but the underlying kernel race surfaced as a different failure mode — CUDA error: an illegal memory access was encountered in Worker_TP*, with 29/30 errors and 1/30 completing correctly. PR #40923 is necessary but not sufficient: native SM 12.0a Marlin MoE cubins eliminate the JIT-PTX corruption, but a second race in the W4A16 Marlin MoE decode path under concurrent thinking-mode on SM 12.0 still crashes the worker. NVFP4 sibling canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP remains the recommendation for batched thinking-mode on this hardware. 1-shot smoke + sequential workloads still clean. PR #40923 status: OPEN, member-approved 2026-04-27 by Harry-Chen, blocked on core-maintainer SM120 policy review; canada-quant repro will be posted as additional evidence. |
Files in the artifact
- 4 sharded
model-*.safetensorsfiles +model.safetensors.index.json(159 GB total) config.json— vLLM-compatible quantization_config with MTP block excludedtokenizer.json,tokenizer_config.json,generation_config.json,chat_template.jinja— upstream DSV4-Flashrecipe.yaml— the llm-compressor GPTQ recipeREADME.md— this file
Citation
@misc{canada-quant-dsv4-flash-w4a16-fp8-mtp-2026,
title = {DeepSeek-V4-Flash W4A16-FP8 with BF16 MTP retained for vLLM speculative decoding},
author = {Canada Quant},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP}
}
License
MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash. Review at the upstream repo before commercial deployment.
Acknowledgments
- DeepSeek for the base model + MTP architecture + inference reference.
- jasl (
jasl/vllmandjasl/vllm-ds4-sm120-harness) for the vLLM build pins (ds4-sm120-experimentalfor H200;ds4-sm120-preview-devfor RTX PRO 6000 SM 12.0) and the benchmark harness. canada-quant/DeepSeek-V4-Flash-W4A16-FP8(predecessor) for the proven recipe topology this artifact extends with MTP.canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP(sibling) for the alias-injection pattern and MTP acceptance methodology.- vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.
- Downloads last month
- 9,132
Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
Base model
deepseek-ai/DeepSeek-V4-Flash