Instructions to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ironbcc/gemma-4-26B-A4B-it-MTP-GGUF",
	filename="gemma-4-26B-A4B-it-Q8_0.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ironbcc/gemma-4-26B-A4B-it-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ironbcc/gemma-4-26B-A4B-it-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

Ollama
How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with Ollama:
```
ollama run hf.co/ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M
```

Unsloth Studio

How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ironbcc/gemma-4-26B-A4B-it-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ironbcc/gemma-4-26B-A4B-it-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ironbcc/gemma-4-26B-A4B-it-MTP-GGUF to start chatting

Docker Model Runner
How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M
```

Lemonade

How to use ironbcc/gemma-4-26B-A4B-it-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-26B-A4B-it-MTP-GGUF-Q4_K_M

List all available models

lemonade list

gemma-4-26B-A4B-it MTP GGUF

Speculative-decoding bundle for google/gemma-4-26B-A4B-it using Google's official MTP drafter (gemma-4-26B-A4B-it-assistant), packaged for llama.cpp and LM Studio.

Achieves ~1.55x throughput vs no-MTP on Apple M5 Max (Q2_K drafter, n=3, thinking-off).

Files

File	Role	Size
`gemma-4-26B-A4B-it-Q8_0.gguf`	target (26B-A4B MoE, 128 experts)	26 GB
`gemma-4-26B-A4B-it-assistant-Q2_K.gguf`	drafter (recommended)	278 MB
`gemma-4-26B-A4B-it-assistant-Q4_K_M.gguf`	drafter (balanced)	310 MB
`gemma-4-26B-A4B-it-assistant-Q8_0.gguf`	drafter (high-precision)	440 MB
`gemma-4-26B-A4B-it-assistant-F16.gguf`	drafter (reference)	816 MB

Target re-hosted from unsloth/gemma-4-26B-A4B-it-GGUF (Unsloth Dynamic 2.0 quant, Apache-2.0). Drafter built from google/gemma-4-26B-A4B-it-assistant via llama.cpp PR #23398 (am17an, WIP).

Requirements

llama.cpp built from am17an's gemma4-mtp branch — Gemma-4 MTP not yet merged to master.

git clone -b gemma4-mtp https://github.com/am17an/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF
cmake --build llama.cpp/build -j --target llama-server llama-quantize

Run

llama-server \
  -m  gemma-4-26B-A4B-it-Q8_0.gguf \
  -md gemma-4-26B-A4B-it-assistant-Q2_K.gguf \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  -ngl 99 -c 8192 -fa on \
  --host 0.0.0.0 --port 8080

Or pull directly:

llama-server \
  -hf ironbcc/gemma-4-26B-A4B-it-MTP-GGUF \
  -hfd ironbcc/gemma-4-26B-A4B-it-MTP-GGUF:Q2_K \
  --spec-type draft-mtp --spec-draft-n-max 3 -ngl 99 -fa on

LM Studio

Search this repo, download target + drafter.
Load target.
Load settings → Speculative Decoding → select drafter file.

(Requires LM Studio with am17an's PR merged or custom llama.cpp runtime. As of 2026-05, mainline LM Studio runtime doesn't yet have draft-mtp for Gemma-4 — track upstream merge.)

Client request — recommended

{
  "messages": [{"role": "user", "content": "Your prompt"}],
  "temperature": 1.0, "top_p": 0.95, "top_k": 64,
  "chat_template_kwargs": {"enable_thinking": false}
}

enable_thinking: false skips <|channel>thought block. +12% throughput, +9 pp accept rate, direct answer. For reasoning tasks (math, code review), enable thinking — slower but better.

Benchmark (Mac M5 Max, 500-token gen, long prompt)

Config	tok/s	Accept
baseline (no MTP)	92.5	—
MTP Q8_0 drafter	121.7	65.9%
MTP Q4_K_M drafter	127.9	67.7%
MTP Q2_K drafter	131.2	70.3%
MTP + thinking off	143.2	74.6%

Lower drafter quant = faster (less bandwidth) and counterintuitively higher accept (quantization rounds toward target argmax).

Use with Hermes (CLI agent)

Hermes uses LM Studio's native /api/v1/models for model validation, which llama-server doesn't expose. A 60-line Python shim adapts it.

1. Launcher — `mtp-server.sh`

#!/usr/bin/env bash
set -euo pipefail
GGUF=~/gemma4-build/gguf
LLAMA=~/gemma4-build/llama.cpp/build/bin/llama-server
exec "$LLAMA" \
  -m  "$GGUF/gemma-4-26B-A4B-it-Q8_0.gguf" \
  -md "$GGUF/gemma-4-26B-A4B-it-assistant-Q2_K.gguf" \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  -ngl 99 -c 262144 -fa on \
  --host 127.0.0.1 --port 8080 \
  --alias "gemma-4-26B-A4B-it-MTP"

-c 262144 = full 256K context (Gemma-4 max). RSS ~31 GB at boot, grows w/ usage. Drop to -c 65536 if low-memory.

2. LM Studio shim — `lmstudio-shim.py`

#!/usr/bin/env python3
"""Reverse proxy adapting llama-server to LM Studio's native API for Hermes."""
import http.server, json, socketserver, urllib.request, urllib.error, os
UPSTREAM = os.environ.get("UPSTREAM", "http://127.0.0.1:8080")
PORT = int(os.environ.get("PORT", "8081"))

class H(http.server.BaseHTTPRequestHandler):
    def log_message(self, *a): pass

    def _proxy(self, method):
        body = self.rfile.read(int(self.headers.get("Content-Length") or 0)) or None
        req = urllib.request.Request(UPSTREAM + self.path, data=body, method=method)
        for k, v in self.headers.items():
            if k.lower() not in ("host", "content-length", "transfer-encoding"):
                req.add_header(k, v)
        try:
            r = urllib.request.urlopen(req, timeout=600)
            self.send_response(r.status)
            data = r.read()
            for k, v in r.headers.items():
                if k.lower() not in ("transfer-encoding", "connection", "content-length"):
                    self.send_header(k, v)
            self.send_header("Content-Length", str(len(data)))
            self.end_headers()
            self.wfile.write(data)
        except urllib.error.HTTPError as e:
            body = e.read()
            self.send_response(e.code)
            self.send_header("Content-Length", str(len(body)))
            self.end_headers()
            self.wfile.write(body)

    def _synth_models(self):
        with urllib.request.urlopen(UPSTREAM + "/v1/models", timeout=5) as r:
            src = json.loads(r.read())
        out = []
        for m in src.get("data", []):
            meta = m.get("meta", {}) or {}
            out.append({
                "id": m["id"], "type": "llm", "publisher": "ironbcc",
                "arch": "gemma4", "compatibility_type": "gguf",
                "quantization": "Q8_0", "state": "loaded",
                "max_context_length": meta.get("n_ctx_train", 8192),
                "loaded_context_length": meta.get("n_ctx", 8192),
                "capabilities": {
                    "reasoning": {"allowed_options": ["off", "low", "medium", "high"]},
                    "chat": True, "tool_use": True,
                },
            })
        r = json.dumps({"object": "list", "data": out, "models": out}).encode()
        self.send_response(200)
        self.send_header("Content-Type", "application/json")
        self.send_header("Content-Length", str(len(r)))
        self.end_headers()
        self.wfile.write(r)

    def do_GET(self):
        if self.path.startswith("/api/v1/models"): return self._synth_models()
        self._proxy("GET")
    def do_POST(self):
        if self.path.startswith("/api/v1/models/load"):
            r = b'{"ok": true, "already_loaded": true}'
            self.send_response(200); self.send_header("Content-Length", str(len(r)))
            self.end_headers(); self.wfile.write(r); return
        self._proxy("POST")
    def do_DELETE(self): self._proxy("DELETE")
    def do_PUT(self):    self._proxy("PUT")

class S(socketserver.ThreadingMixIn, http.server.HTTPServer):
    daemon_threads = True; allow_reuse_address = True

if __name__ == "__main__":
    print(f"lmstudio-shim :{PORT} -> {UPSTREAM}", flush=True)
    S(("127.0.0.1", PORT), H).serve_forever()

Synthesizes GET /api/v1/models from llama-server's /v1/models, no-ops POST /api/v1/models/load, proxies everything else.

3. Start both

nohup ./mtp-server.sh    >/tmp/mtp-srv.log  2>&1 & disown
nohup python3 ./lmstudio-shim.py >/tmp/mtp-shim.log 2>&1 & disown

Verify:

curl -s http://127.0.0.1:8081/api/v1/models | jq '.models[].id'
# -> "gemma-4-26B-A4B-it-MTP"

4. Add to `~/.hermes/config.yaml`

model_aliases:
  gemma-4-mtp: { model: gemma-4-26B-A4B-it-MTP, provider: lmstudio, base_url: http://127.0.0.1:8081/v1 }

Optional default:

model:
  default: gemma-4-26B-A4B-it-MTP
  provider: lmstudio
  base_url: http://127.0.0.1:8081/v1

5. Run

hermes --model gemma-4-mtp

Hermes hits :8081/api/v1/models for validation → shim returns LM Studio shape with our model id → Hermes accepts → chat goes to :8081/v1/chat/completions → shim proxies to llama-server on :8080.

Stop

pkill -f "llama-server|lmstudio-shim"

License

Apache-2.0 (inherits from Google Gemma terms). Accept the Gemma license on HF before download.

Attribution

google/gemma-4-26B-A4B-it — base model, Google.
google/gemma-4-26B-A4B-it-assistant — MTP drafter, Google.
unsloth/gemma-4-26B-A4B-it-GGUF — target Q8_0, Unsloth (Dynamic 2.0 quant).
llama.cpp MTP infra: am17an (Aman Gupta), PR #23398.

Downloads last month: 4,591

GGUF

Model size

0.4B params

Architecture

gemma4-assistant

Hardware compatibility

2-bit

4-bit

8-bit

16-bit

Model tree for ironbcc/gemma-4-26B-A4B-it-MTP-GGUF

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(226)

this model