Gemma 4 12B IT — INT4 (MLP-only, auto-round)

4-bit (W4A16, group size 128) quantization of google/gemma-4-12B-it, produced with intel/auto-round in RTN mode and exported to the GPTQ-compatible format.

Following NVIDIA's dense-Gemma-4 recipe (nvidia/Gemma-4-31B-IT-NVFP4), only the MLP / feed-forward linear layers are quantized to 4-bit, while the attention projections (q/k/v/o) are kept in BF16. Gemma 4's attention activations carry large per-channel outliers that 4-bit quantization cannot represent well, so quantizing attention degrades quality and breaks several inference kernels. Keeping attention in BF16 avoids this.

The vision/audio embedders, embeddings and lm_head are also kept at full precision.

Base model google/gemma-4-12B-it (dense, 11.95B params)
Method auto-round, RTN mode (--iters 0 --disable_opt_rtn)
Scheme W4A16, group size 128, symmetric
Quantized layers MLP only (gate_proj, up_proj, down_proj)
Kept in BF16 attention (q/k/v/o), embeddings, lm_head, vision/audio
Format auto_gptq (GPTQ-compatible)
Checkpoint size ~11 GB (vs ~24 GB BF16)

Serving with vLLM (verified)

Tested on RTX 5090 (Blackwell, sm120), CUDA 13.

Gemma 4 12B "unified" support landed in vllm-project/vllm#44429 and is not yet in a stable release — you need a vLLM nightly build. On Blackwell, the FlashInfer sampler fails to JIT-compile, so disable it.

Install nightly (CUDA 13; use cu129 URLs on CUDA 12.9 hosts):

uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

Serve:

export VLLM_USE_FLASHINFER_SAMPLER=0
vllm serve <path-to-this-model> \
  --served-model-name gemma4-12b \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

The model loads in ~11 GB, leaving plenty of room on a 32 GB card for KV cache (raise --max-model-len accordingly). Recommended sampling for Gemma 4: temperature=1.0, top_p=0.95, top_k=64.

Quick test:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-12b",
    "messages": [{"role": "user", "content": "Explain quantization in one paragraph."}],
    "max_tokens": 200, "temperature": 1.0, "top_p": 0.95, "top_k": 64
  }'

Usage (transformers)

Also loads under transformers (requires gptqmodel):

pip install transformers torch gptqmodel optimum
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "aleksandard/gemma-4-12B-it-int4-MLPonly-AutoRound"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="cuda"
)

messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]
ids = tok.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=False
).to("cuda")
out = model.generate(ids, max_new_tokens=256)
print(tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True))

Notes

  • vLLM stable (<= 0.22.0) does not serve Gemma 4 dense 12B — it hits a shape mismatch in the attention path caused by Gemma 4's heterogeneous head dimensions (head_dim 256 for sliding-window layers vs 512 for global layers). Use a nightly build as described above.
  • On Blackwell, VLLM_USE_FLASHINFER_SAMPLER=0 is required to avoid a FlashInfer JIT-compile failure during sampling.

Reproduce

auto-round \
  --model google/gemma-4-12B-it \
  --scheme W4A16 \
  --iters 0 \
  --disable_opt_rtn \
  --layer_config '{"model.language_model.layers.\d+.self_attn.q_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.k_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.v_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.o_proj":{"bits":16}}' \
  --format auto_gptq \
  --output_dir ./gemma-4-12B-it-int4-MLPonly

Limitations

This is a quantized derivative; it inherits all limitations and biases of the base model and may show additional deviation due to 4-bit quantization. See the base model card for full details. Quantization was calibration-free (RTN); a calibrated build may recover some quality.

License

Apache 2.0, inherited from the base model. This repository changes only the numeric precision of the weights.

Downloads last month
159
Safetensors
Model size
5B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aleksandard/gemma-4-12B-it-int4-MLPonly-AutoRound

Quantized
(126)
this model