Qwopus3.6-35B-A3B-v1-int4-mixed

W4A16 mixed-precision quantization of Jackrong/Qwopus3.6-35B-A3B-v1, created with llm-compressor. Native vLLM support via compressed-tensors format.

Only MoE expert layers are quantized to INT4. All attention layers, shared experts, vision encoder, and MTP layers are preserved in BF16 to maintain quality and prevent infinite thinking loops.

Quantization Details

Parameter	Value
Method	llm-compressor AWQ + GPTQ (compressed-tensors)
Scheme	W4A16 (4-bit weights, 16-bit activations)
Group size	32
Symmetric	Yes
AWQ smoothing	duo_scaling (both), n_grid=20
Calibration dataset	ultrachat_200k (train_sft)
Calibration seq_length	512

Layers preserved in BF16

To maintain quality and prevent infinite thinking loops, the following layers are kept in full precision:

self_attn — full attention q/k/v/o projections (layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39)
linear_attn — all linear attention projections (in_proj_qkv, in_proj_z, in_proj_a, in_proj_b, out_proj)
shared_expert — shared expert gate/up/down projections
shared_expert_gate — shared expert routing
mlp.gate — MoE router
embed_tokens — input embeddings
visual — entire vision encoder
mtp — multi-token prediction layers
lm_head — output head

What is quantized

Only the MoE expert layers (256 experts × 3 Linear layers × 40 blocks) are quantized to INT4. These account for ~90% of total parameters but are the least sensitive to quantization.

Benchmark Results

Best scores from 3 evaluation runs using EvalScope with thinking mode enabled.

Benchmark	bf16 Original	INT4 (this model)	Difference
GPQA Diamond	74.8	73.2	-1.6
GSM8K	96.7	96.9	+0.2
HumanEval (pass@1)	96.3	95.1	-1.2
IFEval (prompt strict)	84.3	85.4	+1.1
IFEval (inst strict)	88.4	88.9	+0.5

Evaluation parameters

Evaluation framework: EvalScope (evalscope)

Generation parameters (both models):

{
  "max_tokens": 20000,
  "temperature": 1.0,
  "top_p": 0.8,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 1.5,
  "repetition_penalty": 1.0
}

EvalScope command:

OPENAI_TIMEOUT=600 evalscope eval \
  --model <model_name> \
  --api-url http://localhost:8000/v1 \
  --api-key none \
  --datasets gpqa_diamond humaneval gsm8k ifeval \
  --eval-batch-size 8 \
  --generation-config '{"max_tokens": 20000, "temperature": 1.0, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0}'

Note: presence_penalty=1.5 is Qwen's official recommendation for thinking mode to prevent infinite thinking loops. Scores vary significantly with different generation parameters — results without presence_penalty are substantially lower for both bf16 and INT4.

Usage

vLLM (recommended)

vllm serve Avesed/Qwopus3.6-35B-A3B-v1-int4-mixed \
  --tokenizer Qwen/Qwen3.6-35B-A3B \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code \
  --generation-config vllm \
  --override-generation-config '{"temperature": 1.0, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0, "max_tokens": 65536}'

Sampling parameters

Qwen3.6 in thinking mode benefits from presence_penalty=1.5 to prevent infinite thinking loops. The generation config above reflects Qwen's official recommendations.

Acknowledgements

Original model by Jackrong
Original base model by Qwen

Downloads last month: 1,977

Safetensors

Model size

36B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Avesed/Qwopus3.6-35B-A3B-v1-int4-mixed

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

unsloth/Qwen3.6-35B-A3B

Adapter

Jackrong/Qwopus3.6-35B-A3B-v1

Quantized

(20)

this model

Evaluation results

accuracy on GPQA Diamond
self-reported

0.732
accuracy on GSM8K
self-reported

0.969
pass@1 on HumanEval
self-reported

0.951
prompt_level_strict_acc on IFEval
self-reported

0.854
inst_level_strict_acc on IFEval
self-reported

0.889