Support this work → · X · GitHub · REAP paper · Cerebras REAP

DeepSeek-V4-Flash-162B-GGUF

GGUF quantization of 0xSero/DeepSeek-V4-Flash-162B.

At a glance

Base model 0xSero/DeepSeek-V4-Flash-162B
Format GGUF
Total params 162B
Active / token
Experts / layer
Layers
Hidden size
Context
On-disk size 149 GB

Which variant should I pick?

Variant Format Link
DeepSeek-V4-Flash-162B BF16 link
DeepSeek-V4-Flash-162B-GGUF (this) GGUF link
DeepSeek-V4-Flash-180B BF16 link
DeepSeek-V4-Flash-180B-GGUF GGUF link
DeepSeek-V4-Flash-213B BF16 link

This repository contains DS4/DwarfStar GGUF conversions of DeepSeek-V4-Flash-Spark-Mini.

The GGUFs point back to the original Spark Hugging Face model:

Files

File Size SHA256
DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf 48.98 GiB e917278028d7a9e25dfc9d04bf5848375dad7573c5aeab1720d6a83714352406

Quantization

  • Q2-REAP-ds4: compact DS4 profile using IQ2_XXS routed gate/up experts, Q2_K routed down experts, and Q8_0 shared/output/attention projections.

These are DS4/DwarfStar-specific GGUF files for DeepSeek-V4 Flash REAP checkpoints. They are not generic llama.cpp files unless your runtime supports the same DeepSeek-V4 Flash tensor layout and DS4 metadata.

Validation

Validation summaries are uploaded in this repo under:

  • validation/20260528T160633Z/SUMMARY.md
  • validation/20260528T160633Z/summary.json

The Mini Q2 GGUF completed the DS4 context sweep through 200000 context on one DGX Spark:

Context Prefill tok/s Decode tok/s KV bytes
2,048 348.19 12.75 52,184,460
4,096 358.51 13.50 80,373,132
8,192 352.29 13.32 136,750,476
16,384 348.25 13.24 249,505,164
32,768 322.07 12.40 475,014,540
65,536 287.26 11.49 926,033,292
131,072 241.57 9.81 1,828,070,796
200,000 194.24 9.17 2,776,775,308

API probes completed through at least the 131072 window before spark-2822 became unreachable during the tail of the 200000 validation step:

Context Prompt tokens TTFT seconds Prefill tok/s Decode tok/s Marker visible
65,536 59,867 176.54 339.12 13.01 true
131,072 119,696 390.59 306.45 11.70 true

This repo publishes the validated Q2 long-context profile only.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
1,507
GGUF
Model size
163B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/DeepSeek-V4-Flash-162B-GGUF

Quantized
(1)
this model

Collection including 0xSero/DeepSeek-V4-Flash-162B-GGUF

Paper for 0xSero/DeepSeek-V4-Flash-162B-GGUF