MTP-layer weights?

#1
by CosmicRaisins - opened

Any plans on publishing an NVFP4 quant with MTP-layer weights?

StepFun org

Yes, this is in plan, we will update a version with NVFP4 + MTP

For my understanding, MTP layers are available in the GUFF variants?

For my understanding, MTP layers are available in the GUFF variants?

Yeah, but NVFP4 + MTP will be faster on NVIDIA hardware than a comparably sized GGUF + MTP (and potentially more accurate as well).

StepFun org

Update: the HF checkpoint has now been updated, so stepfun-ai/Step-3.7-Flash-NVFP4 should work with vLLM MTP speculative decoding directly.

NVFP4 + MTP

The Step-3.7-Flash-NVFP4 checkpoint has been updated with MTP draft layers and now supports vLLM speculative decoding with:

--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

On GPQA Diamond avg@16, NVFP4 + MTP matches quality within statistical noise compared with the same NVFP4 checkpoint without MTP: 77.81% vs. 78.41% item accuracy over 3168 records.

On a GB200 TP=4 vLLM setup with GPQA-style long-reasoning streaming prompts (~250 token prompt, ~1.6K token completion), NVFP4 + MTP improves aggregate decode throughput:

Concurrency NVFP4 + MTP NVFP4 no-MTP Speedup
8 1309 tok/s 1155 tok/s 1.13x
32 4391 tok/s 3480 tok/s 1.26x
64 8229 tok/s 5667 tok/s 1.45x

This makes the NVFP4 checkpoint a practical option for high-throughput long-reasoning workloads while keeping the original NVFP4 model weights unchanged.

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

Same ask here, we want to benchmarking on 2x RTX pro 6000

StepFun org

I will take a look at running on rtx pro 6k.

Sign up or log in to comment