Instructions to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct",
	filename="PrunedHub-Qwen3.5-35B-A3B-80pct-Q3_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M
# Run inference directly in the terminal:
llama-cli -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M
# Run inference directly in the terminal:
llama-cli -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M
# Run inference directly in the terminal:
./llama-cli -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Use Docker

docker model run hf.co/GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

LM Studio
Jan

vLLM

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Ollama
How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with Ollama:
```
ollama run hf.co/GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M
```

Unsloth Studio

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct to start chatting

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Run Hermes

hermes

Docker Model Runner
How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with Docker Model Runner:
```
docker model run hf.co/GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M
```

Lemonade

How to use GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct:Q3_K_M

Run and chat with the model

lemonade run user.PrunedHub-Qwen3.5-35B-A3B-80pct-Q3_K_M

List all available models

lemonade list

PrunedHub Qwen3.5-35B-A3B 80%: Expert Pruning for 24GB Mac

A pruned variant of Qwen3.5-35B-A3B that fits entirely in GPU memory on a 24GB Mac while preserving near-original quality. The official Q4_K_M (21.2 GB) requires swap; this model runs fully GPU-resident.

Highlights

13.67 GB: full GPU-resident on 24GB Apple Silicon (no swap, no SSD streaming)
19% smaller than original Q3_K_M (16.80 GB), 36% smaller than Q4_K_M (21.2 GB)
MMLU 80% (-1pp from original 81%) -- near-lossless knowledge
GSM8K 82% (equal to original) -- math reasoning fully preserved
LiveCodeBench Easy 83.1% (142 problems, contamination-free) -- strong code generation
DeltaNet hybrid architecture (75% Gated DeltaNet + 25% Full Attention)
Compatible with llama.cpp (build 8140+)

Benchmark Results

Benchmark	Original Q3_K_M (16.80 GB)	This Model (13.67 GB)	Delta
MMLU (0-shot, 100Q, no-think)	81%	80%	-1 pp
GSM8K (0-shot, 50Q, no-think)	82%	82%	0 pp
LCB Easy (142Q, no-think)	93.0%	83.1%	-9.9 pp
HumanEval (50Q)	50%	38%	-12 pp
JA Quality (20Q)	100%	85%	-15 pp
File size	16.80 GB	13.67 GB	-18.6%

Note: LiveCodeBench Easy (142 competitive programming problems from LeetCode/AtCoder/Codeforces, 2024+) is the primary code benchmark. HumanEval 50Q is unreliable at this sample size -- LCB and HumanEval gave opposite conclusions for model comparisons in our testing.

Why Not MxMoE?

We also tested Weight-80% + MxMoE (mixed quantization, 12.69 GB) but found it damages code quality:

Model	Size	MMLU	LCB Easy
This model (Weight-80%)	13.67 GB	80%	83.1%
Weight-80% + MxMoE	12.69 GB	78%	77.5%

The extra 1 GB savings is not worth -5.6pp on code generation.

Model Details

Property	Value
Base model	Qwen/Qwen3.5-35B-A3B
Total parameters	~35B (3B active per token)
Architecture	Gated DeltaNet (75%) + Full Attention (25%) with Sparse MoE
Layers	40 (30 DeltaNet + 10 Full Attention)
Experts per layer	204 (pruned from 256)
Routing	Top-8, softmax
Expert FFN dim	512 (SwiGLU)
Shared expert	1 per layer (FFN dim 512)
Hidden size	2048
Attention heads	16 (2 KV heads, GQA)
Context length	262K tokens
Quantization	Q3_K_M (Unsloth imatrix)
Pruning	Weight-based importance, 80% expert retention
File size	13.67 GB (12.73 GiB)
License	Apache 2.0

How to Use

With llama.cpp (recommended)

Requires llama.cpp build 8140+ for Qwen3.5 DeltaNet support.

llama-server \
  -m PrunedHub-Qwen3.5-35B-A3B-80pct-Q3_K_M.gguf \
  --port 8090 \
  -ngl 99 \
  -c 4096

With OpenAI-compatible API

Once llama-server is running:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8090/v1", api_key="none")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write a Python function to find the longest palindromic substring."}],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

Requirements

llama.cpp build 8140+ (for Qwen3.5 / DeltaNet support)
24 GB RAM for full GPU-resident inference (model uses ~13.7 GB; rest for KV cache)
Apple Silicon (Metal) recommended; CUDA also supported

Compression Methodology

Weight-Based Expert Pruning (80% Keep)

Each expert is scored using weight magnitude importance -- the L2 norm of the expert's FFN weight matrices:

importance(layer, expert) = ||gate_weight||_F + ||up_weight||_F + ||down_weight||_F

The bottom 20% of experts per layer are removed (256 -> 204 experts/layer). Weight-based scoring outperformed activation-based calibration for this model:

Method	MMLU	GSM8K	Notes
Weight-based (this model)	80%	82%	Weight magnitude captures full training data
Activation calibration	78%	78%	55 prompts insufficient for 256 experts
Union (weight + activation)	78%	80%	No improvement over weight-only

Why Weight-Based Wins for 256 Experts

With 256 experts per layer, activation-based calibration (55 prompts x 32 tokens) cannot adequately cover all experts. Weight magnitude, which encodes information from the full training data, provides a more reliable importance signal. This is consistent with findings that weight-based pruning dominates for high expert-count models.

Size Breakdown

Original Q4_K_M:      21.20 GB (256 experts x 40 layers)
Original Q3_K_M:      16.80 GB (256 experts x 40 layers)
After 80% pruning:    13.67 GB (204 experts x 40 layers, Q3_K_M)
Total reduction:      -18.6% (from Q3_K_M), -35.5% (from Q4_K_M)

Pruning Curve (Qwen3.5-35B-A3B)

Keep %	Experts/Layer	MMLU	Notes
100% (original Q3_K_M)	256	81%	Baseline
90%	230	79%	Minor loss
80% (this model)	204	80%	Best tradeoff
70%	179	~65%	Quality cliff

Qwen3.5-35B-A3B vs Qwen3.5-27B (Dense)

The Dense 27B model outperforms the MoE 35B-A3B on many benchmarks, but at 9x inference cost:

Metric	35B-A3B (MoE)	27B (Dense)
MMLU-Pro	85.3	86.1
LiveCodeBench v6	74.6	80.7
Active params	3B	27B
Inference cost	1x	9x

MoE's value is cost efficiency: 3B active parameters achieve 85-93% of the Dense 27B's quality. And only MoE models can be further compressed via expert pruning.

Limitations

Japanese quality: -15pp (100% -> 85%) due to pruning of some language-specialized experts. Expert Tuning could recover this
Code generation: -9.9pp on LiveCodeBench Easy. Trade-off for 19% size reduction
No post-pruning training: Pruned without fine-tuning. Quality could be improved with expert tuning
llama.cpp 8140+ required: Older builds do not support Qwen3.5 / DeltaNet architecture

Related Models

Model	Size	MMLU	Notes
PrunedHub-GPT-OSS-20B-28x	10.4 GB	78%	Lossless pruning, GPU-resident on 16GB
PrunedHub-Qwen3-30B-A3B-EN-MxMoE	13.5 GB	70%	EN-optimized, mixed quantization
PrunedHub-Qwen3-30B-A3B-JP-MxMoE	13.5 GB	73%	JP-optimized, mixed quantization
This model	13.67 GB	80%	Qwen3.5, best quality-per-GB

Citation

@misc{goba2026qwen35prune,
  title   = {Weight-Based Expert Pruning for Qwen3.5-35B-A3B},
  author  = {GOBA-AI-Labs},
  year    = {2026},
  url     = {https://huggingface.co/GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct},
  note    = {80\% expert retention achieves 19\% size reduction with MMLU -1pp on 256-expert DeltaNet MoE}
}

Acknowledgments

Qwen Team for releasing Qwen3.5-35B-A3B under Apache 2.0
Unsloth for the Q3_K_M quantization with imatrix
llama.cpp for DeltaNet support and GGUF format

License

This model inherits the Apache 2.0 License from the base model (Qwen/Qwen3.5-35B-A3B).

Downloads last month: 39

GGUF

Model size

28B params

Architecture

qwen35moe

Hardware compatibility

3-bit

Model tree for GOBA-AI-Labs/PrunedHub-Qwen3.5-35B-A3B-80pct

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(266)

this model

GOBA-AI-Labs
/

PrunedHub-Qwen3.5-35B-A3B-80pct