π“Œ³ REAPπ“Œ³ the Experts: Why Pruning Prevails for One-Shot MoE Compression
πŸ“„ Paper β€’ πŸ’» Code β€’ πŸ“ Blog

GLM-4.7-REAP-40-W4A16

✨ Highlights

40% Expert-Pruned + INT4 Quantized β€” Double compression for efficient deployment.

  • ~6.5x Total Compression: 700GB β†’ ~108GB
  • REAP + AutoRound: Expert pruning + weight quantization
  • Optimized for Code & Tools: Calibrated on code generation and function calling
  • Lower VRAM: Fits on 2-4x fewer GPUs than BF16

πŸ™ Acknowledgments


πŸ“‹ Model Specifications

Property Value
Base Model GLM-4.7-REAP-40
Original (GLM-4.7) 358B params, ~700GB
After REAP 40% 218B params
After W4A16 Quant ~108GB on disk
Quantization INT4 weights, FP16 activations
Group Size 128
Format GPTQ (AutoRound)
Experts per Layer 96 (was 160)
VRAM Required ~115GB

Compression Pipeline

GLM-4.7 (358B, 700GB)
        β”‚
        β–Ό  REAP 40% expert pruning
        β”‚
GLM-4.7-REAP-40 (218B)
        β”‚
        β–Ό  AutoRound W4A16 quantization
        β”‚
GLM-4.7-REAP-40-W4A16 (~108GB)  ◀── This model

Total: ~6.5x compression

πŸ”¬ Calibration Dataset: Deep Dive

REAP's effectiveness depends critically on calibration data that represents the target use case. We specifically optimized for code generation, function/tool calling, and agentic workflows.

Why These 3 Datasets?

Dataset Samples Purpose Why It Matters
evol-codealpaca-v1 700 Code generation 51% of mix β€” Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability
xlam-function-calling-60k 330 Function/tool calling 24% of mix β€” Tool use requires structured JSON output; experts handling schema generation must be preserved
SWE-smith-trajectories 330 Agentic multi-turn 24% of mix β€” Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning

The Science Behind Dataset Selection

REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight Γ— activation_norm
4. Prune lowest-saliency experts

Key Insight: Experts are TASK-SPECIFIC
β”œβ”€β”€ Some experts specialize in natural language
β”œβ”€β”€ Some experts specialize in code syntax
β”œβ”€β”€ Some experts specialize in JSON/structured output
└── Some experts specialize in multi-turn context

If calibration lacks code β†’ code-specialized experts appear "unused" β†’ get pruned β†’ model loses coding ability

Cerebras' Original Mix (from paper)

Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:

  • evol-codealpaca-v1 for code generation
  • xlam-function-calling-60k for tool calling
  • SWE-smith-trajectories for agentic tasks

We followed this exact recipe for reproducibility.

Combined Dataset

Our calibration mix: 0xSero/glm47-reap-calibration-v2


πŸš€ Deployment

vLLM (Recommended)

vllm serve 0xSero/GLM-4.7-REAP-40-W4A16 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --quantization gptq

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/GLM-4.7-REAP-40-W4A16",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-40-W4A16", trust_remote_code=True)

🧩 Reproduction

Step 1: REAP Pruning

#!/usr/bin/env python3
"""
REAP Pruning Script for MoE Models
Adapted from: https://github.com/CerebrasResearch/reap
"""

import subprocess
import sys

def run_reap(
    model_path: str,
    compression_ratio: float,
    dataset: str = "0xSero/glm47-reap-calibration-v2",
    samples: int = 1360,
    seed: int = 42,
    distance: str = "angular",
    reuse_observations: str = None,
):
    """
    Run REAP expert pruning.

    Args:
        model_path: Path to base model
        compression_ratio: 0.30 = prune 30%, keep 70%
        dataset: Calibration dataset (code + tools + agentic)
        samples: Number of calibration samples
        seed: Random seed for reproducibility
        distance: Distance metric for expert clustering
        reuse_observations: Path to pre-computed observations for instant pruning
    """
    cmd = [
        sys.executable, "src/reap/prune.py",
        "--model-name", model_path,
        "--dataset-name", dataset,
        "--compression-ratio", str(compression_ratio),
        "--prune-method", "reap",
        "--seed", str(seed),
        "--samples_per_category", str(samples),
        "--model_max_length", "2048",
        "--distance_measure", distance,
        "--record_pruning_metrics_only", "true",
    ]

    if reuse_observations:
        # Instant pruning: skip calibration, reuse precomputed expert scores
        cmd.extend(["--load_observations", reuse_observations])

    subprocess.run(cmd, check=True)

# Example: Create 40% pruned model
run_reap(
    model_path="/path/to/GLM-4.7",
    compression_ratio=0.40,  # Prune 40% of experts
)

Step 2: AutoRound Quantization

#!/usr/bin/env python3
"""
AutoRound W4A16 Quantization
Intel's state-of-the-art weight quantization using signed gradient descent.
"""

from auto_round import AutoRound

def quantize_w4a16(
    model_path: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128,
    format: str = "auto_gptq",
):
    """
    Quantize model to INT4 weights with FP16 activations.

    Args:
        model_path: Path to REAP-pruned model
        output_dir: Output directory
        bits: Weight bit width (4 for W4A16)
        group_size: Quantization group size (128 is optimal)
        format: Output format (auto_gptq for vLLM compatibility)
    """
    ar = AutoRound(
        model_path,
        scheme="W4A16",
        device="cuda",
        device_map="auto",
        trust_remote_code=True,
        batch_size=1,
        seqlen=512,
        nsamples=64,
    )
    ar.quantize_and_save(output_dir, format=format)

# Example: Quantize REAP-40 to W4A16
quantize_w4a16(
    model_path="./GLM-4.7-REAP-40",
    output_dir="./GLM-4.7-REAP-40-W4A16",
)

βš–οΈ License

Apache 2.0


🧾 Citation

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025},
  url={https://arxiv.org/abs/2510.13999}
}
Downloads last month
442
Safetensors
Model size
2B params
Tensor type
BF16
Β·
F32
Β·
I32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 0xSero/GLM-4.7-REAP-40-W4A16

Quantized
(2)
this model