Qwopus3.6-35B-A3B-v1-int4-mixed

W4A16 mixed-precision quantization of Jackrong/Qwopus3.6-35B-A3B-v1, created with llm-compressor. Native vLLM support via compressed-tensors format.

Only MoE expert layers are quantized to INT4. All attention layers, shared experts, vision encoder, and MTP layers are preserved in BF16 to maintain quality and prevent infinite thinking loops.

Quantization Details

Parameter Value
Method llm-compressor AWQ + GPTQ (compressed-tensors)
Scheme W4A16 (4-bit weights, 16-bit activations)
Group size 32
Symmetric Yes
AWQ smoothing duo_scaling (both), n_grid=20
Calibration dataset ultrachat_200k (train_sft)
Calibration seq_length 512

Layers preserved in BF16

To maintain quality and prevent infinite thinking loops, the following layers are kept in full precision:

  • self_attn — full attention q/k/v/o projections (layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39)
  • linear_attn — all linear attention projections (in_proj_qkv, in_proj_z, in_proj_a, in_proj_b, out_proj)
  • shared_expert — shared expert gate/up/down projections
  • shared_expert_gate — shared expert routing
  • mlp.gate — MoE router
  • embed_tokens — input embeddings
  • visual — entire vision encoder
  • mtp — multi-token prediction layers
  • lm_head — output head

What is quantized

Only the MoE expert layers (256 experts × 3 Linear layers × 40 blocks) are quantized to INT4. These account for ~90% of total parameters but are the least sensitive to quantization.

Benchmark Results

Best scores from 3 evaluation runs using EvalScope with thinking mode enabled.

Benchmark bf16 Original INT4 (this model) Difference
GPQA Diamond 74.8 73.2 -1.6
GSM8K 96.7 96.9 +0.2
HumanEval (pass@1) 96.3 95.1 -1.2
IFEval (prompt strict) 84.3 85.4 +1.1
IFEval (inst strict) 88.4 88.9 +0.5
Evaluation parameters Evaluation framework: EvalScope (evalscope)

Generation parameters (both models):

{
  "max_tokens": 20000,
  "temperature": 1.0,
  "top_p": 0.8,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 1.5,
  "repetition_penalty": 1.0
}

EvalScope command:

OPENAI_TIMEOUT=600 evalscope eval \
  --model <model_name> \
  --api-url http://localhost:8000/v1 \
  --api-key none \
  --datasets gpqa_diamond humaneval gsm8k ifeval \
  --eval-batch-size 8 \
  --generation-config '{"max_tokens": 20000, "temperature": 1.0, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0}'

Note: presence_penalty=1.5 is Qwen's official recommendation for thinking mode to prevent infinite thinking loops. Scores vary significantly with different generation parameters — results without presence_penalty are substantially lower for both bf16 and INT4.

Usage

vLLM (recommended)

vllm serve Avesed/Qwopus3.6-35B-A3B-v1-int4-mixed \
  --tokenizer Qwen/Qwen3.6-35B-A3B \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code \
  --generation-config vllm \
  --override-generation-config '{"temperature": 1.0, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0, "max_tokens": 65536}'

Sampling parameters

Qwen3.6 in thinking mode benefits from presence_penalty=1.5 to prevent infinite thinking loops. The generation config above reflects Qwen's official recommendations.

Acknowledgements

Downloads last month
1,977
Safetensors
Model size
36B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Avesed/Qwopus3.6-35B-A3B-v1-int4-mixed

Quantized
(20)
this model

Evaluation results