Qwopus3.6-35B-A3B-v1-int4-mixed
W4A16 mixed-precision quantization of Jackrong/Qwopus3.6-35B-A3B-v1, created with llm-compressor. Native vLLM support via compressed-tensors format.
Only MoE expert layers are quantized to INT4. All attention layers, shared experts, vision encoder, and MTP layers are preserved in BF16 to maintain quality and prevent infinite thinking loops.
Quantization Details
| Parameter | Value |
|---|---|
| Method | llm-compressor AWQ + GPTQ (compressed-tensors) |
| Scheme | W4A16 (4-bit weights, 16-bit activations) |
| Group size | 32 |
| Symmetric | Yes |
| AWQ smoothing | duo_scaling (both), n_grid=20 |
| Calibration dataset | ultrachat_200k (train_sft) |
| Calibration seq_length | 512 |
Layers preserved in BF16
To maintain quality and prevent infinite thinking loops, the following layers are kept in full precision:
- self_attn — full attention q/k/v/o projections (layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39)
- linear_attn — all linear attention projections (in_proj_qkv, in_proj_z, in_proj_a, in_proj_b, out_proj)
- shared_expert — shared expert gate/up/down projections
- shared_expert_gate — shared expert routing
- mlp.gate — MoE router
- embed_tokens — input embeddings
- visual — entire vision encoder
- mtp — multi-token prediction layers
- lm_head — output head
What is quantized
Only the MoE expert layers (256 experts × 3 Linear layers × 40 blocks) are quantized to INT4. These account for ~90% of total parameters but are the least sensitive to quantization.
Benchmark Results
Best scores from 3 evaluation runs using EvalScope with thinking mode enabled.
| Benchmark | bf16 Original | INT4 (this model) | Difference |
|---|---|---|---|
| GPQA Diamond | 74.8 | 73.2 | -1.6 |
| GSM8K | 96.7 | 96.9 | +0.2 |
| HumanEval (pass@1) | 96.3 | 95.1 | -1.2 |
| IFEval (prompt strict) | 84.3 | 85.4 | +1.1 |
| IFEval (inst strict) | 88.4 | 88.9 | +0.5 |
Evaluation parameters
Evaluation framework: EvalScope (evalscope)Generation parameters (both models):
{
"max_tokens": 20000,
"temperature": 1.0,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0
}
EvalScope command:
OPENAI_TIMEOUT=600 evalscope eval \
--model <model_name> \
--api-url http://localhost:8000/v1 \
--api-key none \
--datasets gpqa_diamond humaneval gsm8k ifeval \
--eval-batch-size 8 \
--generation-config '{"max_tokens": 20000, "temperature": 1.0, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0}'
Note: presence_penalty=1.5 is Qwen's official recommendation for thinking mode to prevent infinite thinking loops. Scores vary significantly with different generation parameters — results without presence_penalty are substantially lower for both bf16 and INT4.
Usage
vLLM (recommended)
vllm serve Avesed/Qwopus3.6-35B-A3B-v1-int4-mixed \
--tokenizer Qwen/Qwen3.6-35B-A3B \
--reasoning-parser qwen3 \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--generation-config vllm \
--override-generation-config '{"temperature": 1.0, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0, "max_tokens": 65536}'
Sampling parameters
Qwen3.6 in thinking mode benefits from presence_penalty=1.5 to prevent infinite thinking loops. The generation config above reflects Qwen's official recommendations.
Acknowledgements
- Downloads last month
- 1,977
Model tree for Avesed/Qwopus3.6-35B-A3B-v1-int4-mixed
Base model
Qwen/Qwen3.6-35B-A3BEvaluation results
- accuracy on GPQA Diamondself-reported0.732
- accuracy on GSM8Kself-reported0.969
- pass@1 on HumanEvalself-reported0.951
- prompt_level_strict_acc on IFEvalself-reported0.854
- inst_level_strict_acc on IFEvalself-reported0.889