Instructions to use 0xSero/Step-3.7-Flash-173B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/Step-3.7-Flash-173B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="0xSero/Step-3.7-Flash-173B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("0xSero/Step-3.7-Flash-173B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 0xSero/Step-3.7-Flash-173B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/Step-3.7-Flash-173B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Step-3.7-Flash-173B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/Step-3.7-Flash-173B

SGLang

How to use 0xSero/Step-3.7-Flash-173B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/Step-3.7-Flash-173B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Step-3.7-Flash-173B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/Step-3.7-Flash-173B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Step-3.7-Flash-173B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 0xSero/Step-3.7-Flash-173B with Docker Model Runner:
```
docker model run hf.co/0xSero/Step-3.7-Flash-173B
```

Support this work -> · X · GitHub · REAP paper · Cerebras REAP

Step-3.7-Flash-173B

REAP-pruned stepfun-ai/Step-3.7-Flash-NVFP4.

At a glance


Base model	stepfun-ai/Step-3.7-Flash-NVFP4
Format	NVFP4 expert weights with FP8 KV cache
Effective size	~173B
Parameters removed	25.10B
Experts / MoE layer	250 kept / 288 original
Experts pruned / MoE layer	38
MoE layers	42
Hidden size	4096
Context	262,144
On-disk size	108 GB

Which variant should I pick?

Variant	Format	Link
`Step-3.7-Flash-173B` (this)	NVFP4	link
`Step-3.7-Flash-148B`	NVFP4	link

173B effective parameters | REAP-pruned | private experimental checkpoint

This is a structurally pruned Step 3.7 Flash NVFP4 checkpoint. It is not the original StepFun model. It is a derivative checkpoint built by ranking routed MoE experts with router-weighted activation evidence and physically removing colder routed experts while preserving routing-critical and non-expert components.

This 173B variant is the more conservative of the two Step 3.7 REAP checkpoints. It removes about 25.10B parameters while keeping 250 of 288 routed experts per MoE layer.

What this is

Base: stepfun-ai/Step-3.7-Flash-NVFP4
Pruning: REAP (Routing-Enhanced Activation Pruning)
Routed experts kept per MoE layer: 250
Routed experts pruned per MoE layer: 38
Quantization: NVFP4 with FP8 KV cache metadata
Architecture: Step 3.7 Flash vision-language sparse MoE
Intended serving path: vLLM/SGLang/Transformers paths that support Step 3.7 Flash remote code and ModelOpt NVFP4
Status: private experimental checkpoint; validate fit, generation, and benchmark behavior before production use

How the REAP checkpoint was made

REAP is a one-shot MoE compression method that uses router-weighted expert activation observations to rank experts by practical usefulness. The observation pass records per-layer routed expert activity under calibration prompts, then each MoE layer is pruned independently.

For this checkpoint:

Start from the Step 3.7 Flash NVFP4 checkpoint.
Run calibration data through the model and record router/expert activation observations.
Aggregate expert scores with the reap_score metric.
Keep the top routed experts per MoE layer.
Rewrite the checkpoint with pruned expert tensors and updated routing metadata.

Embeddings, attention blocks, normalization, router gates, shared experts, selected routed experts, vision components, tokenizer files, and generation/config files are preserved. The prune_summary.json and layer_expert_metrics.parquet files in this repo contain the exact pruning map and expert metrics.

Calibration evidence

The pruning pass used Step 3.7 Flash REAP observation artifacts uploaded to:

Dataset: 0xSero/step-3.7-flash-reap-observations-v2
Manifest size: 24,576 samples
Uploaded observation frontier: 24,576 / 24,576
Aggregate rows used for this prune: 13,696
Aggregate tokens by sequence length: 101,735,663
Sources: open-r1/Mixture-of-Thoughts/{math,science,code} and SWE-bench/SWE-smith-trajectories/tool

Benchmark status

Terminal-Bench artifacts are uploaded separately to 0xSero/step37-prune-terminal-bench-artifacts.

Do not treat the current Terminal-Bench evidence as a final score. The available 50B-pruned diagnostic run was interrupted and the corrected rerun hit harness/client timeouts before score-bearing proxy rows. The artifacts are useful for debugging the benchmark path, not for claiming model quality.

Loading

Use trust_remote_code=True and a runtime that supports Step 3.7 Flash plus ModelOpt NVFP4. For vLLM, start from StepFun's Step 3.7-compatible image and adapt the base NVFP4 launch profile:

python3 -m vllm.entrypoints.openai.api_server \
  --model 0xSero/Step-3.7-Flash-173B \
  --served-model-name step3p7-flash-173b \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --trust-remote-code \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5

Hardware fit is not guaranteed by the upload alone. Run a load smoke, generation smoke, and memory audit before longer evaluation or serving.

Limitations

This is an experimental private derivative, not an official StepFun release.
No full quality benchmark should be inferred from the pruning summary alone.
Some serving stacks may need patched Step 3.7 Flash support for the pruned expert count.
Model cards and manifests intentionally avoid hostnames, IPs, absolute local paths, and credentials.

Downloads last month: 533

Safetensors

Model size

104B params

Tensor type

F32

BF16

F8_E4M3

Model tree for 0xSero/Step-3.7-Flash-173B

Base model

stepfun-ai/Step-3.7-Flash-NVFP4

Quantized

(6)

this model

Paper for 0xSero/Step-3.7-Flash-173B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20