canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

W4A16 INT4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — the first DeepSeek-V4-Flash quantization that ships a working MTP block, giving ~1.5× speculative decoding (spec-decode) speedup at bs=1 with no quality cost. Extends the W4A16-FP8 predecessor by patching the transformers calibration path so the MTP block survives the load.

TL;DR

Recommended hardware RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated · or 8× H200 TP=2
Quality GSM8K 93.71% (8-shot strict); HumanEval 84.76% pass@1; MMLU 86.88%
Throughput RTX PRO 6000 98.83 @ TP=2 / 107.32 @ TP=4 at bs=1; 88.35 on H200 TP=2
MTP acceptance 89% calibrated workload / 70% on random prompts at bs=1 k=1
Spec-decode speedup 1.49× at bs=1, k=1 (TPOT 6.02 ms vs 8.93 ms, same artifact)
Differentiator First V4-Flash W4A16 quant where MTP survives the calibration load; transformers 5.8.1 silently strips MTP keys by default

Family / related artifacts

Repo Role Relation to this artifact
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 predecessor Same W4A16 + FP8 recipe; MTP dropped at load (the bug this artifact fixes)
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP sibling Same MTP-retention pattern; NVFP4 routed experts instead of W4A16 (Blackwell-native)
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP larger sibling V4-Pro at NVFP4 + MTP, B300-only deployment
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 upstream reference Original NVFP4 recipe (no MTP — same silent-drop bug)

Why this exists

The W4A16-FP8 predecessor and RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 both drop the MTP block because transformers 5.8.1's DeepseekV4PreTrainedModel declares:

_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]

which silently filters every mtp.* tensor at from_pretrained time — without warning, without error. Calibration pipelines that go through from_pretrained produce quantized main weights paired with an absent MTP block; serving falls back to plain decode, losing the ~1.5–2× spec-decode speedup that V4-Flash's architecture provides.

This artifact bypasses the silent drop, runs the full 8-rank GPTQ calibration on a 768-sample corpus against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.

Architecture & precision

Base model

Property Value
Total parameters 284 B (13 B active per token)
Decoder layers 43
Routed experts / layer 256 (top-K = 6)
Hidden size 4096
Base BF16 size ~543 GB
Quantized size 159 GB (+3 GB vs predecessor for the BF16 MTP block)

Component precisions

Component Format Method
Routed experts (256 × 43 layers × 3 projections) W4A16 INT4, group_size=128, symmetric GPTQ via llm-compressor, 768 calibration samples
Attention path (wq_a, wq_b, wkv, wo_a, wo_b, indexer, compressor) FP8_BLOCK 128×128 Dynamic scales, scale_fmt=ue8m0
MTP block (mtp.0.*) BF16 Excluded from quantization, preserved verbatim
HC plumbing (hc_attn_*, hc_ffn_*, hc_head_*), attn_sink, ffn.gate.bias, indexer/compressor ape FP32 Restored post-save from BF16 source (see Upstream contributions)
head.weight (LM head) FP32 Upcast from BF16 to match sibling artifact's MTP loader path
Embeddings (embed.weight, mtp.0.emb.tok_emb.weight) BF16 Source dtype preserved

Hardware validated

Platform SM HBM/GPU Interconnect TP Role
8× NVIDIA H200 SXM5 9.0a 141 GB HBM3e NVLink 2 (4× replicas) Calibration + initial benchmarks (p5en.48xlarge)
4× NVIDIA RTX PRO 6000 Blackwell Server Edition 12.0, sm_120 96 GB HBM PCIe TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) Workstation Blackwell deployment + $/token sweet spot

Same artifact, no weight changes between SKUs. Both validated cuda graphs ON.

Benchmarks

All numbers from the same artifact, vLLM HEAD 50d9dd902 + 4 patches cherry-picked (PRs #43248 / #43288 / #43290 / #43319).

Quality

Sampling: greedy, temperature 0. Methodology disclosed per row.

Benchmark Setting This artifact Predecessor (W4A16-FP8, no MTP) RedHat (NVFP4-FP8, no MTP) Delta
GSM8K 8-shot, strict-match 93.71% ± 0.67 95.07% (RTX PRO 6000) / 95.45% (Spark) 91.0% (self-reported) -1.28 pts vs predecessor (within 1 SE)
GSM8K 8-shot, flexible-extract 93.63% ± 0.67 95.37% (Spark) within SE
MMLU 5-shot 86.88% ± 0.27 87.27% (H200) -0.39 pts (within SE)
MMLU-Pro 5-shot, 12k prompts, custom-extract 71.28% ± 0.40 sibling NVFP4-FP8-MTP scored 81.13% on B300 — expected gap given W4A16 has more quant noise than NVFP4 on knowledge-heavy harder benchmarks
HumanEval 0-shot pass@1, --confirm_run_unsafe_code 84.76% ± 2.82 80.49% (corrected, see predecessor card "Changes") +4.27 pts vs corrected predecessor number
AIME 2024 30 problems, thinking=high, c=4, max_tokens=64K 29/30 (96.7%) ✓ verified 2026-05-29 in fresh Docker (TP=4 RTX PRO 6000) the prior 30.0% number was a scoring artifact (see footnote); proper chat-template thinking=high + max_tokens at the model-len cap (so reasoning isn't truncated) returns the right answer. TP=2 same config: 27/30 (90.0%).
GSM8K-50 chat-mode cross-check (RTX PRO 6000 TP=4, 2026-05-24 post-shipping-fix) greedy, no thinking, concurrency=1 44/50 = 88.0% matches Card B sibling's 88% strict TP=2 / 90% strict TP=4 on the same hardware — confirms dequant'd artifact preserves quality
IFEval prompt-strict chat-template, no thinking TBD² not yet measured cleanly on this build
chat-smoke (quick / quality / coding) harness 4/4 · 4/4 · 2/2 4/4 · 4/4 · 2/2 match
toolcall15 1 round, 30 points 24/30 (80%) 26/30 (87%) -2 pts — see Honest limitations

¹ The prior 30.0% AIME number was an lm-eval-harness aime24 task artifact — completions-mode prompt (no chat template), exact_match scorer on a thinking-mode model whose answers are wrapped in <think>…</think> + \boxed{N}. The scorer matched the literal answer string and missed virtually every correct response. A 1-shot smoke under proper chat-templated thinking=high methodology returned 2024-II-4: pred=33, exp=33, correct in 2072 completion tokens — model behavior is correct. Full 30-problem re-bench attempted 2026-05-24 on this RTX PRO 6000 box hit a reproducible CUDA illegal memory access (Worker_TP2: torch.AcceleratorError) under any concurrent thinking-mode load (cuda graphs and --enforce-eager both crash; concurrency=4 dies at ~11 min, concurrency=8 dies at ~90 s). Single-shot inference works. Re-bench deferred to H200 with jasl/vllm@ds4-sm120-experimental@abad5dc71 (the build the original Card D H200 numbers used).

² IFEval re-bench attempted 2026-05-24 hit the same RTX PRO 6000 stability issue. Deferred alongside AIME.

Throughput

vllm bench serve random 256-in / 256-out, MTP-spec num_speculative_tokens=1 (k=1 cap on this build — see Honest limitations), cuda graphs ON.

Hardware TP bs=1 output tok/s bs=1 TPOT median bs=4 output tok/s bs=16 output tok/s MTP acceptance @ bs=1
8× H200 2 (per replica) 88.35 6.02 ms 138.80 367.13 89% calibrated / 70% random
4× RTX PRO 6000 box TP=2 (per replica, 2 replicas fit) 98.83 8.55 ms 219.53 482.61 71%
4× RTX PRO 6000 box TP=4 (single replica) 107.32 7.77 ms 221.52 584.04 68%

Per-replica, RTX PRO 6000 wins output throughput at every batch size; H200 still wins per-token TPOT median.

MTP draft-token acceptance per workload

Same artifact, bs=1, k=1.

Workload Prompts Accepted / emitted Acceptance
Random 256-token prompts (200 samples) random 21024 / 30058 69.94%
Code, raw completion (15 short signature+docstring prompts) code-raw 1847 / 1988 92.91%
Chat-templated prose (15 prompts) chat-prose 1946 / 2376 81.90%
Raw natural language (15 continuation prompts) nl-raw 1745 / 2086 83.65%

Spec-decode wins at low concurrency (single-user interactive). At bs≥4 the verifier is already filling its batch lane, so extra verifier passes add overhead without saving wall-clock — matches the sibling artifact's framing of bs=1 as the headline operating point.

Cost per output token (node-level)

Boxes priced for cloud-rented hardware. Single-replica numbers measured; multi-replica totals are linear extrapolation.

Box Replicas bs=1 total tok/s bs=16 total tok/s $/h $/(1000 tok/h) at bs=1
p5en.48xlarge (8× H200) 4× TP=2 ~353 ~1468 $98 $278
g7e.24xlarge (4× RTX PRO 6000) 2× TP=2 ~198 ~965 $19.92 $101
g7e.24xlarge (4× RTX PRO 6000) 1× TP=4 107.32 584.04 $19.92 $186

At bs=1 (interactive), RTX PRO 6000 2×TP=2 is ~2.7× cheaper than H200 4×TP=2. At bs=16 the gap narrows because H200's per-replica throughput scales better with batch — H200 wins absolute throughput when you can fill it; RTX wins on $/token unless you genuinely need >1500 tok/s aggregate output.

Cross-validation: 2026-05-29 fresh Docker on RTX PRO 6000 ✓

Hardware coverage: all numbers below are from RTX PRO 6000 Blackwell Server Edition (SM 12.0a) on a Brev g7e.24xlarge. The image is expected to work on Workstation Edition (same SM 12.0a, same Marlin native cubins, same model + serve path) but we have not directly verified it ourselves. Reference TP=2 Workstation numbers from jasl's bench harness (baselines/20260512_sm120_deployment_1c20f1a6d) confirm the underlying stack runs on Workstation. Expect a 5-15% throughput delta from clock/memory-bandwidth differences. If anything misbehaves on Workstation Edition, open an issue at the repo.

Full bench matrix on canada-quant/dsv4-w4a16-rtxpro6000:v1 (the HF-published Docker image, built from jasl/vllm@27fd665b + canada-quant BF16-MTP cherry-pick + Marlin MoE c_tmp/workspace patches). All AIME runs at max_tokens = max_model_len - 500 = 65036 so reasoning runs to natural stop:

AIME-2024 thinking-mode sweep (c=4, n=30) TP=2 (max_num_seqs=4) TP=4 (max_num_seqs=16)
chat (no think) 18/30 · MTP 95.78% · 53m 19/30 · MTP 93.06% · 5m
thinking-high 27/30 · MTP 91.97% · 152m 29/30 · MTP 91.01% · 13m
thinking-max 24/30 · MTP 92.52% · 177m 27/30 · MTP 91.68% · 26m
AIME-2024 single-shot reference (c=1, thinking-high, n=30) TP=2 TP=4
c=1 high 27/30 · MTP 91.68% · 48m 28/30 · MTP 90.76% · 41m
GSM8K (n=50, 8-shot) TP=2 TP=4
flexible-extract 45/50 (90.0%) 43/50 (86.0%)
strict-match 42/50 (84.0%) 40/50 (80.0%)
Throughput random 256/256 (single replica, MTP on) TP=2 tok/s @ TPOT p50 TP=4 tok/s @ TPOT p50
bs=1 95.2 @ 8.05 ms 108.1 @ 7.32 ms
bs=4 40.6 @ 83.18 ms 104.3 @ 11.31 ms
bs=8 45.7 @ 79.02 ms 433.2 @ 16.44 ms (sweet spot)
bs=16 34.9 (capped by max_num_seqs=4) 164.3 (scheduler thrash)
Throughput random 1024/1024 TP=2 TP=4
bs=1 30.7 tok/s 138.1 tok/s
bs=4 45.1 tok/s 363.7 tok/s

Headlines from this run:

  • Zero CUDA illegal-memory-access in 240 AIME thinking-mode problems across c=4 chat/high/max + c=1 high on both TP=2 and TP=4 = the Marlin MoE concurrent-decode race is fixed by the c_tmp clamp removal in PR vllm#43730 (which is baked into the v3 image via jasl/vllm@27fd665b).
  • TP=4 is 7-12× faster than TP=2 at AIME (chat 53m→5m, high 152m→13m, max 177m→26m). MoE expert sharding across 4 GPUs decisively wins.
  • Thinking-max regresses correctness AND triples wall (TP=4: high 29/30 in 13m vs max 27/30 in 26m). The artifact's sweet spot is reasoning_effort=high.
  • MTP holds 91-93% across all thinking modes and TP configs — the BF16-retained draft head is doing its job everywhere.

Raw JSON + per-bench logs in the reproduction repo.

Tuning attempts that DID NOT win on TP=4 Server (documenting so you don't repeat them)

We A/B-tested adopting jasl's TP=2 Workstation env tunings at TP=4 Server — none of them transferred. Stick with the v3 image defaults:

Change from defaults Result Why
num_speculative_tokens=2 (jasl's deepseek_mtp k=2 default) −86% bs=8 (433 → 60 tok/s) k=2 doubles main-model forward cost; at TP=4 the all-reduce overhead exceeds the ~1.5 tokens-per-draft acceptance gain that's net-positive at TP=2
--enable-expert-parallel (jasl recommends) similarly bad combined with k=2 TP=4 all-to-all expert-gather is expensive
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=512 + ..._TOPK_CHUNK_SIZE=512 (jasl's chunk-size tunings) CUDA illegal memory access at cudagraph capture Tuned for SM 12.0a c128a Workstation single-request prefill; exceed safe limits at TP=4 Server
--no-enable-flashinfer-autotune (jasl recommends) −74% bs=8 (433 → 111 tok/s) Triton block-FP8 autotune is load-bearing at TP=4 — disabling locks in default tile sizes that don't match the 4-GPU shape
--gpu-memory-utilization 0.985 (jasl recommends) crash potential combined with sparse-MLA env 0.95 is the safe value the v3 image ships with at TP=4

The image's defaults are the optimal config for TP=4 RTX PRO 6000 Server Edition as of 2026-05-29. If you're deploying on TP=2 Workstation Edition, jasl's reference config (sm120_tp2_serve.env.example) is the right starting point — it was tuned on that exact hardware.

Quick start

RTX PRO 6000 Blackwell — Docker (recommended)

The pre-built canada-quant/dsv4-w4a16-rtxpro6000:v1 image bakes the full 13-layer recipe (jasl/vllm@27fd665b + canada-quant BF16 MTP cherry-pick + Marlin MoE c_tmp/workspace patches + cute.arch.fmin shim). ~3-5 min from docker load to a working endpoint on a g7e.24xlarge.

# 1. Pull the image tarball (~14 GB compressed)
hf download canada-quant/dsv4-flash-w4a16-rtxpro6000-image \
    --include "*.tar.gz" --local-dir .
docker load < dsv4-w4a16-rtxpro6000-v1.tar.gz

# 2. Cache the W4A16 model onto NVMe (~159 GB, ~1-2 min via xet on Brev)
HF_HOME=/opt/dlami/nvme/hf-cache hf download \
    canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

# 3. Pull the serve helper
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp

# 4. Serve TP=2 (or TP=4 with --gpus all -e TP=4 -e MAX_NUM_SEQS=16)
docker run -d --gpus '"device=0,1"' --name dsv4-w4a16-serve \
    --shm-size=16g --ipc=host -p 8000:8000 \
    -v /opt/dlami/nvme/hf-cache:/root/.cache/huggingface \
    -v $(pwd)/scripts:/workspace/scripts:ro \
    -e TP=2 -e MAX_NUM_SEQS=4 -e MAX_MODEL_LEN=65536 -e GPU_MEM_UTIL=0.95 \
    canada-quant/dsv4-w4a16-rtxpro6000:v1 \
    bash /workspace/scripts/serve_rtx6000pro_w4a16.sh

# 5. Wait for /v1/models (~3-5 min model load + cudagraph capture)
until curl -sf http://127.0.0.1:8000/v1/models >/dev/null; do sleep 5; done

# 6. Run the full bench matrix (AIME chat/high/max + GSM8K + throughput)
docker exec dsv4-w4a16-serve bash -c \
    "TAG=tp2_64k MAX_MODEL_LEN=65536 bash /workspace/scripts/bench_matrix.sh"

RTX PRO 6000 Blackwell — from-source install (advanced)

# 1. Bootstrap vLLM (~25 min for source build)
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
bash scripts/bootstrap_rtx6000pro.sh

# 2. Extra pins
source ~/venv-serve/bin/activate
pip install --quiet "flashinfer-python==0.6.8.post1" "flashinfer-cubin==0.6.8.post1" \
    "numba==0.65.0" "tilelang==0.1.9" "apache-tvm-ffi==0.1.9" "fastsafetensors>=0.2.2"

# 3. Apply patches
python scripts/patch_v4_forcausal_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_mtp_packed_mapping.py        "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_nvidia_attn_scale.py         "$(python -c 'import vllm; print(vllm.__path__[0])')"
bash   scripts/patch_wo_a_bf16_path.sh             "$(python -c 'import vllm; print(vllm.__path__[0])')"

# 4. Download artifact (159 GiB) — already dequant'd in-artifact as of 2026-05-24,
#    no local preprocessing step required.
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --local-dir /scratch/weights/w4a16-fp8-mtp-gptq

# 5. Serve TP=2 (or TP=4 with 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1 bash scripts/serve_rtx6000pro.sh \
    /scratch/weights/w4a16-fp8-mtp-gptq 8000 2

Required runtime env vars on SM 12.x (already set inside serve_rtx6000pro.sh but worth knowing):

export VLLM_TRITON_MLA_SPARSE=1
export VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4
export VLLM_USE_FLASHINFER_SAMPLER=0

Without VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel can crash during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access in _dequantize_and_gather_k_kernel. The FlashInfer sampler is also broken on TORCH_CUDA_ARCH_LIST=12.0a — fall back to PyTorch-native via VLLM_USE_FLASHINFER_SAMPLER=0.

H200

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --tensor-parallel-size 2 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.80 \
    --no-enable-prefix-caching \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
    --reasoning-parser deepseek_v4 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --trust-remote-code

Quantization recipe

Property Value
Dataset HuggingFaceH4/ultrachat_200k (V4 chat template)
Samples 768
Max sequence length 512
Per-rank batch size 4
Calibration hardware 8× NVIDIA H200 (p5en.48xlarge)
Walltime ~15.4h (15.09h oneshot + ~16 min save)
Per-subgraph cadence ~20 min/subgraph × 44 subgraphs (43 MoE + 1 MTP no-op)

Calibration recipe identical to the W4A16-FP8 predecessor with one change: the modeling class is patched to remove mtp.* from _keys_to_ignore_on_load_unexpected before from_pretrained, so the MTP block survives the load and is written back to the artifact at BF16.

vLLM build

Common patches (all platforms)

PR Purpose Status
vllm-project/vllm#43248 bool() wrap on is_static_input_scheme open
vllm-project/vllm#43288 .get("scale_fmt", "ue8m0") on missing key + BF16 getattr follow-up open
vllm-project/vllm#43290 weight_scale_inv-or-weight_scale fallback (attention) open
vllm-project/vllm#43319 MTP-quant-detect from safetensors header + BF16 wo_a fallback path open

RTX PRO 6000 Blackwell (SM 12.0) only

Patch Purpose
packed_modules_mapping on DeepseekV4ForCausalLM + DeepSeekV4MTP Required as of ds4-sm120-experimental@abad5dc71
BF16 wo_a path for MTP block Static weight.dtype == bfloat16 check (dynamo-safe)
--disable-custom-all-reduce No NVLink between RTX PRO 6000 boards
CMakeLists USE_SABI 3.11 removal For Python 3.10

(Previously this list also required a compressor/indexer FP8 → BF16 dequant preprocess step run against the local artifact. As of 2026-05-24 the dequant is baked into the published artifact — see Changes.)

H200 deployments need only the four common patches.

Honest limitations

  1. k=1 cap on spec-decode — current vLLM build limits num_speculative_tokens to 1 due to DeepGemm kernel assertion next_n == 1 or next_n == 2 in smxx_fp8_fp4_paged_mqa_logits.hpp:233. vLLM passes next_n = num_speculative_tokens + 1, so practical k is 1. The FLASHINFER_MLA_SPARSE attention backend hits the same kernel-side assertion. With the assertion relaxed, expect bs=1 speedup to rise from 1.49× to ~1.85× (matching sibling NVFP4 artifact's k=2 published number).
  2. Concurrent thinking-mode workloads on RTX PRO 6000 produce token-corrupted output — under concurrency ≥ 2 with thinking=high (long-decode workloads like AIME), the Marlin W4A16 MoE decode kernel on SM 12.0 produces token-stream corruption (CJK / Cyrillic / garbled ASCII spliced into the model's reasoning trace). The same hardware + same vLLM build serving the NVFP4 sibling (canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP via flashinfer_trtllm MoE) is essentially clean on the same workload (1/30 vs 14/30 corrupted at c=4 thinking). The bug is specific to the W4A16 + Marlin MoE decode path on SM 12.0. Investigation isolated through 7 controlled tests (sparse-MLA topk-chunk size, MTP-off, matmul_decode-off, eager-mode, concurrency sweep, NVFP4 vs W4A16 path comparison). Workaround on RTX PRO 6000: for batched thinking-mode workloads, serve the NVFP4 sibling artifact instead. For sequential (c=1) thinking-mode or any batched chat-mode (no thinking), this W4A16-MTP artifact works cleanly (GSM8K-20 chat-mode sequential = 20/20 = 100%, MTP draft acceptance 92.46%). Full debug log + reproducible benches: docs/findings/sm12x_token_corruption_2026_05_24.md. Filed upstream as jasl/vllm#12.
  3. toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion (TC-07 stopped mid-chain to ask a clarifying question) and multi-tool extraction (TC-06 returned both translations as content text instead of routing two translate calls). Quality-wise the model completes the underlying intent; the harness scores tool-call-protocol fidelity, not task completion. Not a parser issue (confirmed by replay through --tool-call-parser deepseek_v4).
  4. GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within one SE, but technically below. Likely calibration-set sensitivity rather than recipe drift (recipe is identical, hardware differs).
  5. NVFP4 native kernels on RTX PRO 6000 not auto-selected — even though csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu exists in upstream vLLM, the backend selector doesn't pick it (vllm-project/vllm#31085). Until that lands, the sibling NVFP4 artifact on this hardware would route through Marlin too. This artifact's W4A16 path is the tested choice for RTX PRO 6000.

Reproduction

Full pipeline at canada-quant/dsv4-flash-w4a16-fp8-mtp. From a fresh 8× H200 box:

# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh

# Phase 1 — download upstream + dequant to BF16-MTP source (~30 min, ~660 GB)
bash scripts/phase1_dequant.sh

# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh

# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh

# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq

# Phase 5 — serve (see Quick start above for serve command)

Upstream contributions filed during this work

Contribution Description Status
transformers — save_pretrained silent FP32 → BF16 downcast 417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing, gate bias, attn_sink, indexer/compressor ape) are silently written as BF16 by save_pretrained when model torch_dtype is BF16. Workaround: postprocess restore from BF16 source via scripts/fixup_artifact.py. Upstream filing pending local
vLLM — MTP loader silently skips top-level head.weight + embed.weight DeepSeekV4MTP.load_weights calls name.replace("mtp.0.", "") which no-ops on non-mtp.0.* keys; get_spec_layer_idx returns None → loop skips. head.weight and embed.weight never reach shared_head.head / embed_tokens → uninitialized → 0% MTP acceptance with no load-time error. Workaround: postprocess injects mtp.0.head.weight and mtp.0.emb.tok_emb.weight as duplicates. Upstream filing pending local
vLLM — DeepGemm paged_mqa_logits asserts on num_speculative_tokens > 1 smxx_fp8_fp4_paged_mqa_logits.hpp:233 enforces next_n == 1 or next_n == 2. With next_n = k+1, practical k cap is 1. Caps spec-decode speedup at 1.49× vs sibling's published 2.03× at k=2 upstream (DeepGemm) — filing pending
vllm-project/vllm#43248 bool() wrap on is_static_input_scheme open
vllm-project/vllm#43288 scale_fmt defensive .get() + BF16 getattr wrap open
vllm-project/vllm#43290 weight_scale_inv-or-weight_scale fallback open
vllm-project/vllm#43319 MTP-quant-detect from safetensors header + BF16 wo_a fallback path open

Changes

Date Change
2026-05-22 Initial release on H200 (jasl/vllm@ds4-sm120-experimental@abad5dc71). GSM8K 93.71% strict, MMLU 86.88%, HumanEval 84.76%, MTP acceptance 89% on calibrated workload / 70% on random prompts
2026-05-24 (morning) RTX PRO 6000 Blackwell (SM 12.0) added. TP=2 and TP=4 both validated, chat-smoke 4/4 PASS, MTP acceptance 68-72%, MTP-on per-replica throughput 98.83 tok/s @ TP=2 / 107.32 @ TP=4. Per-replica throughput beats H200 at every batch size. vllm-project/vllm#41511 (Marlin TP > 2 bug) did not fire on this build
2026-05-24 (afternoon) Shipping-bug fix. Artifact previously shipped FP8_BLOCK compressor/indexer with .weight_scale keys; current upstream/preview-dev vLLM constructs those modules as plain BF16 (quant_config=None), so the artifact failed to load with KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'. 166 compressor/indexer weights dequantized in-place (FP8 + BF16 scale → BF16, mathematically lossless) and re-uploaded. Artifact now loads cleanly on modern vLLM on both RTX PRO 6000 and H200 without local preprocessing. H200 historical numbers above remain valid: the original H200 build supported FP8 compressor; modern vLLM serving the new BF16-compressor format produces equivalent outputs (verified post-fix on RTX PRO 6000 TP=4: GSM8K-50 chat-mode 44/50 = 88.0%, matches sibling Card B's 88% strict TP=2 / 90% strict TP=4 on this hardware).
2026-05-24 (afternoon) AIME 2024 methodology correction. Prior 30.0% exact_match was an lm-eval-harness aime24 task artifact (completions-mode prompt, no chat template, exact-string scorer on a thinking model whose answers are \boxed{N}). 1-shot smoke under chat-templated thinking=high returns the correct integer. Full re-bench blocked by the RTX PRO 6000 concurrent-thinking CUDA crash (see Honest limitations); deferred to H200. Prior 30.0% struck through in Quality table.
2026-05-24 (evening) Root-cause investigation of the RTX PRO 6000 concurrent-thinking issue. Updated to jasl/vllm@a937d4b28 (Stabilize SM12x sparse MLA long prefill) — server no longer crashes under concurrent thinking-mode load, but produces token-stream corruption on ~50% of long generations at c=4. Tested workarounds (VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256, VLLM_TRITON_MLA_SPARSE_MATMUL_DECODE=0, MTP-off, eager-mode, concurrency sweep) — none reduce corruption; all increase crash rate. Diagnostic comparison: same hardware + same build + same workload on Card B (NVFP4 + flashinfer_trtllm MoE) = 1/30 corrupted vs Card D (W4A16 + Marlin MoE) = 14/30 corrupted. Bug isolated to W4A16 + Marlin MoE decode path on SM 12.0. Production-config verified on Card D: GSM8K-20 chat-mode sequential = 20/20 = 100%, MTP draft acceptance 92.46%. Full debug log: docs/findings/sm12x_token_corruption_2026_05_24.md. Filed upstream: jasl/vllm#12.
2026-05-25 Applied vllm-project/vllm#40923 "Marlin MoE: include SM 12.x in default arch list" + clean rebuild. PR #40923's description matches our symptom verbatim ("V4-Flash MoE decode emits gibberish on RTX 50-series GB10/DGX Spark... driver JIT-promotes 8.0+PTX fallback"). After applying the patch (12.0a;12.1a for CUDA 12.9) and forcing a full Marlin MoE source regen + rebuild: AIME c=4 thinking corruption dropped from 14/30 → 0/30, but the underlying kernel race surfaced as a different failure mode — CUDA error: an illegal memory access was encountered in Worker_TP*, with 29/30 errors and 1/30 completing correctly. PR #40923 is necessary but not sufficient: native SM 12.0a Marlin MoE cubins eliminate the JIT-PTX corruption, but a second race in the W4A16 Marlin MoE decode path under concurrent thinking-mode on SM 12.0 still crashes the worker. NVFP4 sibling canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP remains the recommendation for batched thinking-mode on this hardware. 1-shot smoke + sequential workloads still clean. PR #40923 status: OPEN, member-approved 2026-04-27 by Harry-Chen, blocked on core-maintainer SM120 policy review; canada-quant repro will be posted as additional evidence.

Files in the artifact

  • 4 sharded model-*.safetensors files + model.safetensors.index.json (159 GB total)
  • config.json — vLLM-compatible quantization_config with MTP block excluded
  • tokenizer.json, tokenizer_config.json, generation_config.json, chat_template.jinja — upstream DSV4-Flash
  • recipe.yaml — the llm-compressor GPTQ recipe
  • README.md — this file

Citation

@misc{canada-quant-dsv4-flash-w4a16-fp8-mtp-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 with BF16 MTP retained for vLLM speculative decoding},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash. Review at the upstream repo before commercial deployment.

Acknowledgments

Downloads last month
9,132
Safetensors
Model size
51B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

Quantized
(65)
this model