canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

W4A16 INT4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — the first DeepSeek-V4-Flash quantization that ships a working MTP block, giving ~1.5× speculative decoding (spec-decode) speedup at bs=1 with no quality cost. Extends the W4A16-FP8 predecessor by patching the transformers calibration path so the MTP block survives the load.

TL;DR


Recommended hardware	RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated · or 8× H200 TP=2
Quality	GSM8K 93.71% (8-shot strict); HumanEval 84.76% pass@1; MMLU 86.88%
Throughput	RTX PRO 6000 98.83 @ TP=2 / 107.32 @ TP=4 at bs=1; 88.35 on H200 TP=2
MTP acceptance	89% calibrated workload / 70% on random prompts at bs=1 k=1
Spec-decode speedup	1.49× at bs=1, k=1 (TPOT 6.02 ms vs 8.93 ms, same artifact)
Differentiator	First V4-Flash W4A16 quant where MTP survives the calibration load; `transformers` 5.8.1 silently strips MTP keys by default

Family / related artifacts

Repo	Role	Relation to this artifact
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8`	predecessor	Same W4A16 + FP8 recipe; MTP dropped at load (the bug this artifact fixes)
`canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP`	sibling	Same MTP-retention pattern; NVFP4 routed experts instead of W4A16 (Blackwell-native)
`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`	larger sibling	V4-Pro at NVFP4 + MTP, B300-only deployment
`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`	upstream reference	Original NVFP4 recipe (no MTP — same silent-drop bug)

Why this exists

The W4A16-FP8 predecessor and RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 both drop the MTP block because transformers 5.8.1's DeepseekV4PreTrainedModel declares:

_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]

which silently filters every mtp.* tensor at from_pretrained time — without warning, without error. Calibration pipelines that go through from_pretrained produce quantized main weights paired with an absent MTP block; serving falls back to plain decode, losing the ~1.5–2× spec-decode speedup that V4-Flash's architecture provides.

This artifact bypasses the silent drop, runs the full 8-rank GPTQ calibration on a 768-sample corpus against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.

Architecture & precision

Base model

Property	Value
Total parameters	~~284 B (~~13 B active per token)
Decoder layers	43
Routed experts / layer	256 (top-K = 6)
Hidden size	4096
Base BF16 size	~543 GB
Quantized size	159 GB (+3 GB vs predecessor for the BF16 MTP block)

Component precisions

Component	Format	Method
Routed experts (256 × 43 layers × 3 projections)	W4A16 INT4, group_size=128, symmetric	GPTQ via llm-compressor, 768 calibration samples
Attention path (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b`, indexer, compressor)	FP8_BLOCK 128×128	Dynamic scales, `scale_fmt=ue8m0`
*MTP block (`mtp.0.`)**	BF16	Excluded from quantization, preserved verbatim
HC plumbing (`hc_attn_`, `hc_ffn_`, `hc_head_*`), `attn_sink`, `ffn.gate.bias`, indexer/compressor `ape`	FP32	Restored post-save from BF16 source (see Upstream contributions)
`head.weight` (LM head)	FP32	Upcast from BF16 to match sibling artifact's MTP loader path
Embeddings (`embed.weight`, `mtp.0.emb.tok_emb.weight`)	BF16	Source dtype preserved

Hardware validated

Platform	SM	HBM/GPU	Interconnect	TP	Role
8× NVIDIA H200 SXM5	9.0a	141 GB HBM3e	NVLink	2 (4× replicas)	Calibration + initial benchmarks (`p5en.48xlarge`)
4× NVIDIA RTX PRO 6000 Blackwell Server Edition	12.0, sm_120	96 GB HBM	PCIe	TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica)	Workstation Blackwell deployment + $/token sweet spot

Same artifact, no weight changes between SKUs. Both validated cuda graphs ON.

Benchmarks

All numbers from the same artifact, vLLM HEAD 50d9dd902 + 4 patches cherry-picked (PRs #43248 / #43288 / #43290 / #43319).

Quality

Sampling: greedy, temperature 0. Methodology disclosed per row.

Benchmark	Setting	This artifact	Predecessor (W4A16-FP8, no MTP)	RedHat (NVFP4-FP8, no MTP)	Delta
GSM8K	8-shot, strict-match	93.71% ± 0.67	95.07% (RTX PRO 6000) / 95.45% (Spark)	91.0% (self-reported)	-1.28 pts vs predecessor (within 1 SE)
GSM8K	8-shot, flexible-extract	93.63% ± 0.67	95.37% (Spark)	—	within SE
MMLU	5-shot	86.88% ± 0.27	87.27% (H200)	—	-0.39 pts (within SE)
MMLU-Pro	5-shot, 12k prompts, custom-extract	71.28% ± 0.40	—	—	sibling NVFP4-FP8-MTP scored 81.13% on B300 — expected gap given W4A16 has more quant noise than NVFP4 on knowledge-heavy harder benchmarks
HumanEval	0-shot pass@1, `--confirm_run_unsafe_code`	84.76% ± 2.82	80.49% (corrected, see predecessor card "Changes")	—	+4.27 pts vs corrected predecessor number
AIME 2024	30 problems, thinking=high, c=4, max_tokens=64K	29/30 (96.7%) ✓ verified 2026-05-29 in fresh Docker (TP=4 RTX PRO 6000)	—	—	the prior 30.0% number was a scoring artifact (see footnote); proper chat-template thinking=high + max_tokens at the model-len cap (so reasoning isn't truncated) returns the right answer. TP=2 same config: 27/30 (90.0%).
GSM8K-50 chat-mode cross-check (RTX PRO 6000 TP=4, 2026-05-24 post-shipping-fix)	greedy, no thinking, concurrency=1	44/50 = 88.0%	—	—	matches Card B sibling's 88% strict TP=2 / 90% strict TP=4 on the same hardware — confirms dequant'd artifact preserves quality
IFEval prompt-strict	chat-template, no thinking	TBD²	—	—	not yet measured cleanly on this build
chat-smoke (quick / quality / coding)	harness	4/4 · 4/4 · 2/2	4/4 · 4/4 · 2/2	—	match
toolcall15	1 round, 30 points	24/30 (80%)	26/30 (87%)	—	-2 pts — see Honest limitations

¹ The prior 30.0% AIME number was an lm-eval-harness aime24 task artifact — completions-mode prompt (no chat template), exact_match scorer on a thinking-mode model whose answers are wrapped in <think>…</think> + \boxed{N}. The scorer matched the literal answer string and missed virtually every correct response. A 1-shot smoke under proper chat-templated thinking=high methodology returned 2024-II-4: pred=33, exp=33, correct in 2072 completion tokens — model behavior is correct. Full 30-problem re-bench attempted 2026-05-24 on this RTX PRO 6000 box hit a reproducible CUDA illegal memory access (Worker_TP2: torch.AcceleratorError) under any concurrent thinking-mode load (cuda graphs and --enforce-eager both crash; concurrency=4 dies at ~11 min, concurrency=8 dies at ~90 s). Single-shot inference works. Re-bench deferred to H200 with jasl/vllm@ds4-sm120-experimental@abad5dc71 (the build the original Card D H200 numbers used).

² IFEval re-bench attempted 2026-05-24 hit the same RTX PRO 6000 stability issue. Deferred alongside AIME.

Throughput

vllm bench serve random 256-in / 256-out, MTP-spec num_speculative_tokens=1 (k=1 cap on this build — see Honest limitations), cuda graphs ON.

Hardware	TP	bs=1 output tok/s	bs=1 TPOT median	bs=4 output tok/s	bs=16 output tok/s	MTP acceptance @ bs=1
8× H200	2 (per replica)	88.35	6.02 ms	138.80	367.13	89% calibrated / 70% random
4× RTX PRO 6000 box	TP=2 (per replica, 2 replicas fit)	98.83	8.55 ms	219.53	482.61	71%
4× RTX PRO 6000 box	TP=4 (single replica)	107.32	7.77 ms	221.52	584.04	68%

Per-replica, RTX PRO 6000 wins output throughput at every batch size; H200 still wins per-token TPOT median.

MTP draft-token acceptance per workload

Same artifact, bs=1, k=1.

Workload	Prompts	Accepted / emitted	Acceptance
Random 256-token prompts (200 samples)	random	21024 / 30058	69.94%
Code, raw completion (15 short signature+docstring prompts)	code-raw	1847 / 1988	92.91%
Chat-templated prose (15 prompts)	chat-prose	1946 / 2376	81.90%
Raw natural language (15 continuation prompts)	nl-raw	1745 / 2086	83.65%

Spec-decode wins at low concurrency (single-user interactive). At bs≥4 the verifier is already filling its batch lane, so extra verifier passes add overhead without saving wall-clock — matches the sibling artifact's framing of bs=1 as the headline operating point.

Cost per output token (node-level)

Boxes priced for cloud-rented hardware. Single-replica numbers measured; multi-replica totals are linear extrapolation.

Box	Replicas	bs=1 total tok/s	bs=16 total tok/s	$/h	$/(1000 tok/h) at bs=1
`p5en.48xlarge` (8× H200)	4× TP=2	~353	~1468	$98	$278
`g7e.24xlarge` (4× RTX PRO 6000)	2× TP=2	~198	~965	$19.92	$101
`g7e.24xlarge` (4× RTX PRO 6000)	1× TP=4	107.32	584.04	$19.92	$186

At bs=1 (interactive), RTX PRO 6000 2×TP=2 is ~2.7× cheaper than H200 4×TP=2. At bs=16 the gap narrows because H200's per-replica throughput scales better with batch — H200 wins absolute throughput when you can fill it; RTX wins on $/token unless you genuinely need >1500 tok/s aggregate output.

Cross-validation: 2026-05-29 fresh Docker on RTX PRO 6000 ✓

Hardware coverage: all numbers below are from RTX PRO 6000 Blackwell Server Edition (SM 12.0a) on a Brev g7e.24xlarge. The image is expected to work on Workstation Edition (same SM 12.0a, same Marlin native cubins, same model + serve path) but we have not directly verified it ourselves. Reference TP=2 Workstation numbers from jasl's bench harness (baselines/20260512_sm120_deployment_1c20f1a6d) confirm the underlying stack runs on Workstation. Expect a 5-15% throughput delta from clock/memory-bandwidth differences. If anything misbehaves on Workstation Edition, open an issue at the repo.

Full bench matrix on canada-quant/dsv4-w4a16-rtxpro6000:v1 (the HF-published Docker image, built from jasl/vllm@27fd665b + canada-quant BF16-MTP cherry-pick + Marlin MoE c_tmp/workspace patches). All AIME runs at max_tokens = max_model_len - 500 = 65036 so reasoning runs to natural stop:

AIME-2024 thinking-mode sweep (c=4, n=30)	TP=2 (max_num_seqs=4)	TP=4 (max_num_seqs=16)
chat (no think)	18/30 · MTP 95.78% · 53m	19/30 · MTP 93.06% · 5m
thinking-high	27/30 · MTP 91.97% · 152m	29/30 · MTP 91.01% · 13m
thinking-max	24/30 · MTP 92.52% · 177m	27/30 · MTP 91.68% · 26m

AIME-2024 single-shot reference (c=1, thinking-high, n=30)	TP=2	TP=4
c=1 high	27/30 · MTP 91.68% · 48m	28/30 · MTP 90.76% · 41m

GSM8K (n=50, 8-shot)	TP=2	TP=4
flexible-extract	45/50 (90.0%)	43/50 (86.0%)
strict-match	42/50 (84.0%)	40/50 (80.0%)

Throughput random 256/256 (single replica, MTP on)	TP=2 tok/s @ TPOT p50	TP=4 tok/s @ TPOT p50
bs=1	95.2 @ 8.05 ms	108.1 @ 7.32 ms
bs=4	40.6 @ 83.18 ms	104.3 @ 11.31 ms
bs=8	45.7 @ 79.02 ms	433.2 @ 16.44 ms (sweet spot)
bs=16	34.9 (capped by max_num_seqs=4)	164.3 (scheduler thrash)

Throughput random 1024/1024	TP=2	TP=4
bs=1	30.7 tok/s	138.1 tok/s
bs=4	45.1 tok/s	363.7 tok/s

Headlines from this run:

Zero CUDA illegal-memory-access in 240 AIME thinking-mode problems across c=4 chat/high/max + c=1 high on both TP=2 and TP=4 = the Marlin MoE concurrent-decode race is fixed by the c_tmp clamp removal in PR vllm#43730 (which is baked into the v3 image via jasl/vllm@27fd665b).
TP=4 is 7-12× faster than TP=2 at AIME (chat 53m→5m, high 152m→13m, max 177m→26m). MoE expert sharding across 4 GPUs decisively wins.
Thinking-max regresses correctness AND triples wall (TP=4: high 29/30 in 13m vs max 27/30 in 26m). The artifact's sweet spot is reasoning_effort=high.
MTP holds 91-93% across all thinking modes and TP configs — the BF16-retained draft head is doing its job everywhere.

Raw JSON + per-bench logs in the reproduction repo.

Tuning attempts that DID NOT win on TP=4 Server (documenting so you don't repeat them)

We A/B-tested adopting jasl's TP=2 Workstation env tunings at TP=4 Server — none of them transferred. Stick with the v3 image defaults:

Change from defaults	Result	Why
`num_speculative_tokens=2` (jasl's `deepseek_mtp` k=2 default)	−86% bs=8 (433 → 60 tok/s)	k=2 doubles main-model forward cost; at TP=4 the all-reduce overhead exceeds the ~1.5 tokens-per-draft acceptance gain that's net-positive at TP=2
`--enable-expert-parallel` (jasl recommends)	similarly bad combined with k=2	TP=4 all-to-all expert-gather is expensive
`VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=512` + `..._TOPK_CHUNK_SIZE=512` (jasl's chunk-size tunings)	CUDA illegal memory access at cudagraph capture	Tuned for SM 12.0a c128a Workstation single-request prefill; exceed safe limits at TP=4 Server
`--no-enable-flashinfer-autotune` (jasl recommends)	−74% bs=8 (433 → 111 tok/s)	Triton block-FP8 autotune is load-bearing at TP=4 — disabling locks in default tile sizes that don't match the 4-GPU shape
`--gpu-memory-utilization 0.985` (jasl recommends)	crash potential combined with sparse-MLA env	0.95 is the safe value the v3 image ships with at TP=4

The image's defaults are the optimal config for TP=4 RTX PRO 6000 Server Edition as of 2026-05-29. If you're deploying on TP=2 Workstation Edition, jasl's reference config (sm120_tp2_serve.env.example) is the right starting point — it was tuned on that exact hardware.

Quick start

RTX PRO 6000 Blackwell — Docker (recommended)

The pre-built canada-quant/dsv4-w4a16-rtxpro6000:v1 image bakes the full 13-layer recipe (jasl/vllm@27fd665b + canada-quant BF16 MTP cherry-pick + Marlin MoE c_tmp/workspace patches + cute.arch.fmin shim). ~3-5 min from docker load to a working endpoint on a g7e.24xlarge.

# 1. Pull the image tarball (~14 GB compressed)
hf download canada-quant/dsv4-flash-w4a16-rtxpro6000-image \
    --include "*.tar.gz" --local-dir .
docker load < dsv4-w4a16-rtxpro6000-v1.tar.gz

# 2. Cache the W4A16 model onto NVMe (~159 GB, ~1-2 min via xet on Brev)
HF_HOME=/opt/dlami/nvme/hf-cache hf download \
    canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

# 3. Pull the serve helper
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp

# 4. Serve TP=2 (or TP=4 with --gpus all -e TP=4 -e MAX_NUM_SEQS=16)
docker run -d --gpus '"device=0,1"' --name dsv4-w4a16-serve \
    --shm-size=16g --ipc=host -p 8000:8000 \
    -v /opt/dlami/nvme/hf-cache:/root/.cache/huggingface \
    -v $(pwd)/scripts:/workspace/scripts:ro \
    -e TP=2 -e MAX_NUM_SEQS=4 -e MAX_MODEL_LEN=65536 -e GPU_MEM_UTIL=0.95 \
    canada-quant/dsv4-w4a16-rtxpro6000:v1 \
    bash /workspace/scripts/serve_rtx6000pro_w4a16.sh

# 5. Wait for /v1/models (~3-5 min model load + cudagraph capture)
until curl -sf http://127.0.0.1:8000/v1/models >/dev/null; do sleep 5; done

# 6. Run the full bench matrix (AIME chat/high/max + GSM8K + throughput)
docker exec dsv4-w4a16-serve bash -c \
    "TAG=tp2_64k MAX_MODEL_LEN=65536 bash /workspace/scripts/bench_matrix.sh"

RTX PRO 6000 Blackwell — from-source install (advanced)

# 1. Bootstrap vLLM (~25 min for source build)
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
bash scripts/bootstrap_rtx6000pro.sh

# 2. Extra pins
source ~/venv-serve/bin/activate
pip install --quiet "flashinfer-python==0.6.8.post1" "flashinfer-cubin==0.6.8.post1" \
    "numba==0.65.0" "tilelang==0.1.9" "apache-tvm-ffi==0.1.9" "fastsafetensors>=0.2.2"

# 3. Apply patches
python scripts/patch_v4_forcausal_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_mtp_packed_mapping.py        "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_nvidia_attn_scale.py         "$(python -c 'import vllm; print(vllm.__path__[0])')"
bash   scripts/patch_wo_a_bf16_path.sh             "$(python -c 'import vllm; print(vllm.__path__[0])')"

# 4. Download artifact (159 GiB) — already dequant'd in-artifact as of 2026-05-24,
#    no local preprocessing step required.
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --local-dir /scratch/weights/w4a16-fp8-mtp-gptq

# 5. Serve TP=2 (or TP=4 with 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1 bash scripts/serve_rtx6000pro.sh \
    /scratch/weights/w4a16-fp8-mtp-gptq 8000 2

Required runtime env vars on SM 12.x (already set inside serve_rtx6000pro.sh but worth knowing):

export VLLM_TRITON_MLA_SPARSE=1
export VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4
export VLLM_USE_FLASHINFER_SAMPLER=0

Without VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel can crash during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access in _dequantize_and_gather_k_kernel. The FlashInfer sampler is also broken on TORCH_CUDA_ARCH_LIST=12.0a — fall back to PyTorch-native via VLLM_USE_FLASHINFER_SAMPLER=0.

H200

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --tensor-parallel-size 2 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.80 \
    --no-enable-prefix-caching \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
    --reasoning-parser deepseek_v4 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --trust-remote-code

Quantization recipe

Property	Value
Dataset	`HuggingFaceH4/ultrachat_200k` (V4 chat template)
Samples	768
Max sequence length	512
Per-rank batch size	4
Calibration hardware	8× NVIDIA H200 (`p5en.48xlarge`)
Walltime	~15.4h (15.09h oneshot + ~16 min save)
Per-subgraph cadence	~20 min/subgraph × 44 subgraphs (43 MoE + 1 MTP no-op)

Calibration recipe identical to the W4A16-FP8 predecessor with one change: the modeling class is patched to remove mtp.* from _keys_to_ignore_on_load_unexpected before from_pretrained, so the MTP block survives the load and is written back to the artifact at BF16.

vLLM build

Common patches (all platforms)

PR	Purpose	Status
`vllm-project/vllm#43248`	`bool()` wrap on `is_static_input_scheme`	open
`vllm-project/vllm#43288`	`.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up	open
`vllm-project/vllm#43290`	`weight_scale_inv`-or-`weight_scale` fallback (attention)	open
`vllm-project/vllm#43319`	MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path	open

RTX PRO 6000 Blackwell (SM 12.0) only

Patch	Purpose
`packed_modules_mapping` on `DeepseekV4ForCausalLM` + `DeepSeekV4MTP`	Required as of `ds4-sm120-experimental@abad5dc71`
BF16 `wo_a` path for MTP block	Static `weight.dtype == bfloat16` check (dynamo-safe)
`--disable-custom-all-reduce`	No NVLink between RTX PRO 6000 boards
CMakeLists `USE_SABI 3.11` removal	For Python 3.10

(Previously this list also required a compressor/indexer FP8 → BF16 dequant preprocess step run against the local artifact. As of 2026-05-24 the dequant is baked into the published artifact — see Changes.)

H200 deployments need only the four common patches.

Honest limitations

k=1 cap on spec-decode — current vLLM build limits num_speculative_tokens to 1 due to DeepGemm kernel assertion next_n == 1 or next_n == 2 in smxx_fp8_fp4_paged_mqa_logits.hpp:233. vLLM passes next_n = num_speculative_tokens + 1, so practical k is 1. The FLASHINFER_MLA_SPARSE attention backend hits the same kernel-side assertion. With the assertion relaxed, expect bs=1 speedup to rise from 1.49× to ~1.85× (matching sibling NVFP4 artifact's k=2 published number).
Concurrent thinking-mode workloads on RTX PRO 6000 produce token-corrupted output — under concurrency ≥ 2 with thinking=high (long-decode workloads like AIME), the Marlin W4A16 MoE decode kernel on SM 12.0 produces token-stream corruption (CJK / Cyrillic / garbled ASCII spliced into the model's reasoning trace). The same hardware + same vLLM build serving the NVFP4 sibling (canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP via flashinfer_trtllm MoE) is essentially clean on the same workload (1/30 vs 14/30 corrupted at c=4 thinking). The bug is specific to the W4A16 + Marlin MoE decode path on SM 12.0. Investigation isolated through 7 controlled tests (sparse-MLA topk-chunk size, MTP-off, matmul_decode-off, eager-mode, concurrency sweep, NVFP4 vs W4A16 path comparison). Workaround on RTX PRO 6000: for batched thinking-mode workloads, serve the NVFP4 sibling artifact instead. For sequential (c=1) thinking-mode or any batched chat-mode (no thinking), this W4A16-MTP artifact works cleanly (GSM8K-20 chat-mode sequential = 20/20 = 100%, MTP draft acceptance 92.46%). Full debug log + reproducible benches: docs/findings/sm12x_token_corruption_2026_05_24.md. Filed upstream as jasl/vllm#12.
toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion (TC-07 stopped mid-chain to ask a clarifying question) and multi-tool extraction (TC-06 returned both translations as content text instead of routing two translate calls). Quality-wise the model completes the underlying intent; the harness scores tool-call-protocol fidelity, not task completion. Not a parser issue (confirmed by replay through --tool-call-parser deepseek_v4).
GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within one SE, but technically below. Likely calibration-set sensitivity rather than recipe drift (recipe is identical, hardware differs).
NVFP4 native kernels on RTX PRO 6000 not auto-selected — even though csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu exists in upstream vLLM, the backend selector doesn't pick it (vllm-project/vllm#31085). Until that lands, the sibling NVFP4 artifact on this hardware would route through Marlin too. This artifact's W4A16 path is the tested choice for RTX PRO 6000.

Reproduction

Full pipeline at canada-quant/dsv4-flash-w4a16-fp8-mtp. From a fresh 8× H200 box:

# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh

# Phase 1 — download upstream + dequant to BF16-MTP source (~30 min, ~660 GB)
bash scripts/phase1_dequant.sh

# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh

# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh

# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq

# Phase 5 — serve (see Quick start above for serve command)

Upstream contributions filed during this work

Contribution	Description	Status
transformers — `save_pretrained` silent FP32 → BF16 downcast	417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing, gate bias, attn_sink, indexer/compressor `ape`) are silently written as BF16 by `save_pretrained` when model `torch_dtype` is BF16. Workaround: postprocess restore from BF16 source via `scripts/fixup_artifact.py`. Upstream filing pending	local
vLLM — MTP loader silently skips top-level `head.weight` + `embed.weight`	`DeepSeekV4MTP.load_weights` calls `name.replace("mtp.0.", "")` which no-ops on non-`mtp.0.` keys; `get_spec_layer_idx` returns None → loop skips. `head.weight` and `embed.weight` never reach `shared_head.head` / `embed_tokens` → uninitialized → 0% MTP acceptance with no load-time error*. Workaround: postprocess injects `mtp.0.head.weight` and `mtp.0.emb.tok_emb.weight` as duplicates. Upstream filing pending	local
vLLM — DeepGemm `paged_mqa_logits` asserts on `num_speculative_tokens > 1`	`smxx_fp8_fp4_paged_mqa_logits.hpp:233` enforces `next_n == 1 or next_n == 2`. With `next_n = k+1`, practical k cap is 1. Caps spec-decode speedup at 1.49× vs sibling's published 2.03× at k=2	upstream (DeepGemm) — filing pending
`vllm-project/vllm#43248`	`bool()` wrap on `is_static_input_scheme`	open
`vllm-project/vllm#43288`	`scale_fmt` defensive `.get()` + BF16 `getattr` wrap	open
`vllm-project/vllm#43290`	`weight_scale_inv`-or-`weight_scale` fallback	open
`vllm-project/vllm#43319`	MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path	open

Changes

Date	Change
2026-05-22	Initial release on H200 (`jasl/vllm@ds4-sm120-experimental@abad5dc71`). GSM8K 93.71% strict, MMLU 86.88%, HumanEval 84.76%, MTP acceptance 89% on calibrated workload / 70% on random prompts
2026-05-24 (morning)	RTX PRO 6000 Blackwell (SM 12.0) added. TP=2 and TP=4 both validated, chat-smoke 4/4 PASS, MTP acceptance 68-72%, MTP-on per-replica throughput 98.83 tok/s @ TP=2 / 107.32 @ TP=4. Per-replica throughput beats H200 at every batch size. `vllm-project/vllm#41511` (Marlin TP > 2 bug) did not fire on this build
2026-05-24 (afternoon)	Shipping-bug fix. Artifact previously shipped FP8_BLOCK compressor/indexer with `.weight_scale` keys; current upstream/preview-dev vLLM constructs those modules as plain BF16 (`quant_config=None`), so the artifact failed to load with `KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'`. 166 compressor/indexer weights dequantized in-place (FP8 + BF16 scale → BF16, mathematically lossless) and re-uploaded. Artifact now loads cleanly on modern vLLM on both RTX PRO 6000 and H200 without local preprocessing. H200 historical numbers above remain valid: the original H200 build supported FP8 compressor; modern vLLM serving the new BF16-compressor format produces equivalent outputs (verified post-fix on RTX PRO 6000 TP=4: GSM8K-50 chat-mode 44/50 = 88.0%, matches sibling Card B's 88% strict TP=2 / 90% strict TP=4 on this hardware).
2026-05-24 (afternoon)	AIME 2024 methodology correction. Prior 30.0% `exact_match` was an lm-eval-harness `aime24` task artifact (completions-mode prompt, no chat template, exact-string scorer on a thinking model whose answers are `\boxed{N}`). 1-shot smoke under chat-templated thinking=high returns the correct integer. Full re-bench blocked by the RTX PRO 6000 concurrent-thinking CUDA crash (see Honest limitations); deferred to H200. Prior 30.0% struck through in Quality table.
2026-05-24 (evening)	Root-cause investigation of the RTX PRO 6000 concurrent-thinking issue. Updated to `jasl/vllm@a937d4b28` (Stabilize SM12x sparse MLA long prefill) — server no longer crashes under concurrent thinking-mode load, but produces token-stream corruption on ~50% of long generations at c=4. Tested workarounds (`VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256`, `VLLM_TRITON_MLA_SPARSE_MATMUL_DECODE=0`, MTP-off, eager-mode, concurrency sweep) — none reduce corruption; all increase crash rate. Diagnostic comparison: same hardware + same build + same workload on Card B (NVFP4 + flashinfer_trtllm MoE) = 1/30 corrupted vs Card D (W4A16 + Marlin MoE) = 14/30 corrupted. Bug isolated to W4A16 + Marlin MoE decode path on SM 12.0. Production-config verified on Card D: GSM8K-20 chat-mode sequential = 20/20 = 100%, MTP draft acceptance 92.46%. Full debug log: `docs/findings/sm12x_token_corruption_2026_05_24.md`. Filed upstream: `jasl/vllm#12`.
2026-05-25	Applied `vllm-project/vllm#40923` "Marlin MoE: include SM 12.x in default arch list" + clean rebuild. PR #40923's description matches our symptom verbatim ("V4-Flash MoE decode emits gibberish on RTX 50-series GB10/DGX Spark... driver JIT-promotes 8.0+PTX fallback"). After applying the patch (12.0a;12.1a for CUDA 12.9) and forcing a full Marlin MoE source regen + rebuild: AIME c=4 thinking corruption dropped from 14/30 → 0/30, but the underlying kernel race surfaced as a different failure mode — `CUDA error: an illegal memory access was encountered` in Worker_TP, with 29/30 errors and 1/30 completing correctly. PR #40923 is necessary but not sufficient*: native SM 12.0a Marlin MoE cubins eliminate the JIT-PTX corruption, but a second race in the W4A16 Marlin MoE decode path under concurrent thinking-mode on SM 12.0 still crashes the worker. NVFP4 sibling `canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP` remains the recommendation for batched thinking-mode on this hardware. 1-shot smoke + sequential workloads still clean. PR #40923 status: OPEN, member-approved 2026-04-27 by Harry-Chen, blocked on core-maintainer SM120 policy review; canada-quant repro will be posted as additional evidence.

Files in the artifact

4 sharded model-*.safetensors files + model.safetensors.index.json (159 GB total)
config.json — vLLM-compatible quantization_config with MTP block excluded
tokenizer.json, tokenizer_config.json, generation_config.json, chat_template.jinja — upstream DSV4-Flash
recipe.yaml — the llm-compressor GPTQ recipe
README.md — this file

Citation

@misc{canada-quant-dsv4-flash-w4a16-fp8-mtp-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 with BF16 MTP retained for vLLM speculative decoding},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash. Review at the upstream repo before commercial deployment.

Acknowledgments

DeepSeek for the base model + MTP architecture + inference reference.
jasl (jasl/vllm and jasl/vllm-ds4-sm120-harness) for the vLLM build pins (ds4-sm120-experimental for H200; ds4-sm120-preview-dev for RTX PRO 6000 SM 12.0) and the benchmark harness.
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (predecessor) for the proven recipe topology this artifact extends with MTP.
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP (sibling) for the alias-injection pattern and MTP acceptance methodology.
vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.

Downloads last month: 9,132

Safetensors

Model size

51B params

Tensor type

I64

F32

I32

BF16

F8_E4M3

Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(65)

this model