Instructions to use microsoft/Phi-4-mini-flash-reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-4-mini-flash-reasoning with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-4-mini-flash-reasoning", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-flash-reasoning", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Phi-4-mini-flash-reasoning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-4-mini-flash-reasoning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-flash-reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-4-mini-flash-reasoning

SGLang

How to use microsoft/Phi-4-mini-flash-reasoning with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-4-mini-flash-reasoning" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-flash-reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-4-mini-flash-reasoning" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-flash-reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-4-mini-flash-reasoning with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-4-mini-flash-reasoning
```

Why there isn't even a single Quantized Version for this model ?

#11

by kalashshah19 - opened Nov 7, 2025

Discussion

kalashshah19

Nov 7, 2025

•

edited Nov 7, 2025

I looked for Quantization for this model but didn't found any. Why is that ??

kalashshah19

Nov 7, 2025

skyasher27

Dec 19, 2025

Phi-4-mini-flash-reasoning
isn't readily available in GGUF format because its unique SambaY architecture (a Mamba variant) differs from traditional Transformer models, complicating direct conversion to GGUF, which is optimized for LLama/Transformer structures, though efforts are underway by the community to support its efficient, low-latency, long-context performance on consumer hardware.
Why the Confusion/Difficulty?

New Architecture: Unlike the original Phi-4-mini (which is Transformer-based and easily converts to GGUF), the "flash" version uses a State Space Model (SSM) backbone called SambaY, which has a different computational structure.
GGUF's Focus: GGUF (GPT-Generated Unified Format) was primarily designed to efficiently run Transformer-based models (like Llama, Mistral) on CPUs and GPUs using tools like llama.cpp.
Conversion Challenges: The different architecture means standard conversion scripts (like hf-to-gguf) struggle or fail because they expect Transformer layers, not SambaY's unique self-decoder/cross-decoder setup.

What's the Goal (and Solution)?

Speed & Context: The Flash model offers much lower latency and better long-context handling due to its architecture, making it great for production.
Community Efforts: Enthusiasts and developers are working on creating specific tools or adapting llama.cpp to support this new architecture for local inference, similar to how the original Phi-4-mini was made accessible.

In short, it's a format compatibility issue due to a new, efficient underlying model design, not a bug, and people are working on making it work.

kalashshah19

Dec 22, 2025

Oh I see, thanks man !

kalashshah19 changed discussion status to closed Dec 22, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment