Instructions to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="0xSero/DeepSeek-V4-Flash-162B-GGUF",
	filename="DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF
# Run inference directly in the terminal:
llama-cli -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF
# Run inference directly in the terminal:
llama-cli -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF
# Run inference directly in the terminal:
./llama-cli -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF

Use Docker

docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B-GGUF

LM Studio
Jan

vLLM

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/DeepSeek-V4-Flash-162B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/DeepSeek-V4-Flash-162B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B-GGUF

Ollama
How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with Ollama:
```
ollama run hf.co/0xSero/DeepSeek-V4-Flash-162B-GGUF
```

Unsloth Studio

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for 0xSero/DeepSeek-V4-Flash-162B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for 0xSero/DeepSeek-V4-Flash-162B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for 0xSero/DeepSeek-V4-Flash-162B-GGUF to start chatting

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "0xSero/DeepSeek-V4-Flash-162B-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf 0xSero/DeepSeek-V4-Flash-162B-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default 0xSero/DeepSeek-V4-Flash-162B-GGUF

Run Hermes

hermes

Docker Model Runner
How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with Docker Model Runner:
```
docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B-GGUF
```

Lemonade

How to use 0xSero/DeepSeek-V4-Flash-162B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull 0xSero/DeepSeek-V4-Flash-162B-GGUF

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-162B-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Support this work → · X · GitHub · REAP paper · Cerebras REAP

DeepSeek-V4-Flash-162B-GGUF

GGUF quantization of 0xSero/DeepSeek-V4-Flash-162B.

At a glance


Base model	0xSero/DeepSeek-V4-Flash-162B
Format	GGUF
Total params	162B
Active / token	—
Experts / layer	—
Layers	—
Hidden size	—
Context	—
On-disk size	149 GB

Which variant should I pick?

Variant	Format	Link
`DeepSeek-V4-Flash-162B`	BF16	link
`DeepSeek-V4-Flash-162B-GGUF` (this)	GGUF	link
`DeepSeek-V4-Flash-180B`	BF16	link
`DeepSeek-V4-Flash-180B-GGUF`	GGUF	link
`DeepSeek-V4-Flash-213B`	BF16	link

This repository contains DS4/DwarfStar GGUF conversions of DeepSeek-V4-Flash-Spark-Mini.

The GGUFs point back to the original Spark Hugging Face model:

Original Spark model: https://huggingface.co/0xSero/DeepSeek-V4-Flash-162B
Conversion source checkpoint: https://huggingface.co/0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP
Runtime/converter repo: https://github.com/antirez/ds4
Spark deployment repo: https://github.com/0xSero/deepseek-spark

Files

File	Size	SHA256
`DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf`	48.98 GiB	`e917278028d7a9e25dfc9d04bf5848375dad7573c5aeab1720d6a83714352406`

Quantization

Q2-REAP-ds4: compact DS4 profile using IQ2_XXS routed gate/up experts, Q2_K routed down experts, and Q8_0 shared/output/attention projections.

These are DS4/DwarfStar-specific GGUF files for DeepSeek-V4 Flash REAP checkpoints. They are not generic llama.cpp files unless your runtime supports the same DeepSeek-V4 Flash tensor layout and DS4 metadata.

Validation

Validation summaries are uploaded in this repo under:

validation/20260528T160633Z/SUMMARY.md
validation/20260528T160633Z/summary.json

The Mini Q2 GGUF completed the DS4 context sweep through 200000 context on one DGX Spark:

Context	Prefill tok/s	Decode tok/s	KV bytes
2,048	348.19	12.75	52,184,460
4,096	358.51	13.50	80,373,132
8,192	352.29	13.32	136,750,476
16,384	348.25	13.24	249,505,164
32,768	322.07	12.40	475,014,540
65,536	287.26	11.49	926,033,292
131,072	241.57	9.81	1,828,070,796
200,000	194.24	9.17	2,776,775,308

API probes completed through at least the 131072 window before spark-2822 became unreachable during the tail of the 200000 validation step:

Context	Prompt tokens	TTFT seconds	Prefill tok/s	Decode tok/s	Marker visible
65,536	59,867	176.54	339.12	13.01	true
131,072	119,696	390.59	306.45	11.70	true

This repo publishes the validated Q2 long-context profile only.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}