Instructions to use bosonai/higgs-audio-v3-tts-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bosonai/higgs-audio-v3-tts-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="bosonai/higgs-audio-v3-tts-4b")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("bosonai/higgs-audio-v3-tts-4b", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Higgs Audio v3 TTS
Higgs Audio v3 TTS is built for voice chat: it speaks, not just reads. It turns model responses into expressive conversational speech across 100+ languages, with zero-shot voice cloning and inline control over emotion, style, prosody, pauses, and sound effects.
Released for research and non-commercial use under the Boson Higgs Audio v3 Research and Non-Commercial License. Production, hosted APIs, or revenue-generating use requires a separate commercial license. Prohibited: voice cloning without consent, impersonation, fraud, election deception, biometric surveillance, or any unlawful use.
Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the Higgs Tokenizer into 8 codebooks at 25 fps, staggered via a delay pattern, then mapped to backbone hidden states through a multi-codebook fused embedding. Output codes pass through a multi-codebook fused head, are de-delayed, and decoded back to waveform.
| Component | Spec |
|---|---|
| Backbone | ~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8) |
| Multi-codebook embedding / head | Fused single-tensor, tied with text embedding |
| Context length | 8,192 tokens (training sequence length) |
| Audio tokens | 8 codebooks × 1026 vocab, delay pattern |
| Sample rate | 24 kHz |
| Frame rate | 25 fps (40 ms / frame) |
Supported Languages
The model reaches single-digit WER/CER on 102 languages, which split into two tiers.
WER/CER under 5 — polished, production-quality (85)
🇿🇦 Afrikaans · 🇸🇦🇪🇬 Arabic · 🇦🇲 Armenian · 🇮🇳 Assamese · 🇪🇸 Asturian · 🇦🇿 Azerbaijani · 🇷🇺 Bashkir · 🇪🇸 Basque · 🇧🇾 Belarusian · 🇧🇩🇮🇳 Bengali · 🇧🇦 Bosnian · 🇧🇬 Bulgarian · 🇪🇸 Catalan · 🇵🇭 Cebuano · 🇮🇶 Central Kurdish · 🇨🇳 Chinese · 🇭🇷 Croatian · 🇨🇿 Czech · 🇩🇰 Danish · 🇳🇱🇧🇪 Dutch · 🇷🇺 Eastern Mari · 🇺🇸🇬🇧🇦🇺 English · 🌐 Esperanto · 🇪🇪 Estonian · 🇫🇮 Finnish · 🇫🇷🇨🇦 French · 🇪🇸 Galician · 🇬🇪 Georgian · 🇩🇪🇦🇹 German · 🇬🇷 Greek · 🇮🇳 Gujarati · 🇭🇹 Haitian Creole · 🇳🇬 Hausa · 🇮🇱 Hebrew · 🇮🇳 Hindi · 🇭🇺 Hungarian · 🇮🇩 Indonesian · 🇮🇹 Italian · 🇯🇵 Japanese · 🇮🇩 Javanese · 🇮🇳 Kannada · 🇰🇿 Kazakh · 🇰🇷 Korean · 🇷🇼 Kinyarwanda · 🇰🇬 Kyrgyz · 🇱🇻 Latvian · 🇨🇩 Lingala · 🇱🇹 Lithuanian · 🇰🇪 Luo · 🇲🇰 Macedonian · 🇲🇾🇮🇩 Malay · 🇮🇳 Malayalam · 🇲🇹 Maltese · 🇳🇿 Māori · 🇮🇳 Marathi · 🇲🇳 Mongolian · 🇳🇵 Nepali · 🇳🇴 Norwegian · 🇫🇷 Occitan · 🇮🇷🇦🇫 Persian · 🇵🇱 Polish · 🇵🇹🇧🇷 Portuguese · 🇷🇴 Romanian · 🇷🇺 Russian · 🇿🇦 Sepedi · 🇷🇸 Serbian · 🇿🇼 Shona · 🇸🇰 Slovak · 🇸🇮 Slovene · 🇪🇸🇲🇽 Spanish · 🇹🇿🇰🇪 Swahili · 🇸🇪 Swedish · 🇵🇭 Tagalog · 🇹🇯 Tajik · 🇮🇳🇱🇰 Tamil · 🇮🇳 Telugu · 🇹🇭 Thai · 🇹🇷 Turkish · 🇺🇦 Ukrainian · 🇵🇰🇮🇳 Urdu · 🇨🇳 Uyghur · 🇺🇿 Uzbek · 🇻🇳 Vietnamese · 🇿🇦 Xhosa · 🇿🇦 Zulu
WER/CER between 5 and 10 — usable, but less polished (17)
🇦🇱 Albanian · 🇲🇼🇿🇲 Chichewa/Nyanja · 🇮🇳🇵🇰 Eastern Punjabi · 🇺🇬 Ganda · 🇮🇸 Icelandic · 🇮🇪 Irish · 🇩🇿 Kabyle · 🇨🇻 Kabuverdianu · 🇰🇪 Kamba · 🇻🇦 Latin · 🇱🇺 Luxembourgish · 🇪🇹🇰🇪 Oromo · 🇦🇫🇵🇰 Pashto · 🇵🇰🇮🇳 Sindhi · 🇸🇴 Somali · 🇦🇴 Umbundu · 🇬🇧 Welsh
Control Tokens
All tags follow <|category:value|> syntax and can be inserted mid-utterance.
- Emotion —
elation,amusement,enthusiasm,determination,pride,contentment,affection,relief,contemplation,confusion,surprise,awe,longing,arousal,anger,fear,disgust,bitterness,sadness,shame,helplessness
| Token | Description |
|---|---|
<|emotion:elation|> | Elation / joy |
<|emotion:amusement|> | Amusement / playful laughter |
<|emotion:enthusiasm|> | Enthusiasm / excitement |
<|emotion:determination|> | Determination / firmness |
<|emotion:pride|> | Pride / confidence |
<|emotion:contentment|> | Calm satisfaction |
<|emotion:affection|> | Warmth / affection |
<|emotion:relief|> | Relief |
<|emotion:contemplation|> | Thoughtful / reflective |
<|emotion:confusion|> | Confused |
<|emotion:surprise|> | Surprised |
<|emotion:awe|> | Awe / wonder |
<|emotion:longing|> | Longing / yearning |
<|emotion:arousal|> | Heightened desire |
<|emotion:anger|> | Anger |
<|emotion:fear|> | Fear |
<|emotion:disgust|> | Disgust |
<|emotion:bitterness|> | Bitterness |
<|emotion:sadness|> | Sadness |
<|emotion:shame|> | Shame |
<|emotion:helplessness|> | Helplessness |
- Style —
singing,shouting,whispering
| Token | Description |
|---|---|
<|style:singing|> | Singing |
<|style:shouting|> | Shouting / projected voice |
<|style:whispering|> | Whisper |
- Sound effects —
cough,laughter,crying,screaming,burping,humming,sigh,sniff,sneeze
Pair each token with the matching onomatopoeia immediately after it.
| Token | Description | Suggested onomatopoeia |
|---|---|---|
<|sfx:cough|> | Cough | Ahem |
<|sfx:laughter|> | Laughter | Haha / Hehe |
<|sfx:crying|> | Crying | Boohoo / Sob |
<|sfx:screaming|> | Screaming | Ahh / Aaah |
<|sfx:burping|> | Burping | Burp |
<|sfx:humming|> | Humming | Hmm / Mmm |
<|sfx:sigh|> | Sigh | Uh / Ahh |
<|sfx:sniff|> | Sniff | Sff |
<|sfx:sneeze|> | Sneeze | Achoo |
- Prosody
- Speed —
speed_very_slow,speed_slow,speed_fast,speed_very_fast - Pauses —
pause,long_pause - Pitch —
pitch_low,pitch_high - Delivery —
expressive_high,expressive_low
- Speed —
| Token | Effect |
|---|---|
<|prosody:speed_very_slow|> | ≈0.65× speed |
<|prosody:speed_slow|> | ≈0.85× speed |
<|prosody:speed_fast|> | ≈1.2× speed |
<|prosody:speed_very_fast|> | ≈1.4× speed |
<|prosody:pitch_low|> | ≈−3 semitones |
<|prosody:pitch_high|> | ≈+2.5 semitones |
<|prosody:pause|> | ≈400–700 ms pause |
<|prosody:long_pause|> | ≈700–1500 ms pause |
<|prosody:expressive_high|> | More expressive delivery |
<|prosody:expressive_low|> | Flatter delivery |
Evaluation Benchmarks
Multilingual Voice Clone
We evaluate Higgs Audio v3 TTS on public multilingual TTS suites and our internal 111-language Higgs-Multilingual set, covering both common and lower-resource languages.
WER / CER (↓, ×100) macro-averaged across each benchmark's language set. Lower is better; bold marks the best per row. All numbers are reproducible end-to-end with original metrics and normalization.
| Benchmark | Higgs Audio v2 | Higgs Audio v3 | Fish Audio S2 Pro | Qwen3-TTS-1.7B | VibeVoice-7B | IndexTTS-2 | MiMo-Audio-7B-Instruct | MOSS-TTS-v1.5 | OmniVoice | ChatterBox | FireRedTTS-2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SeedTTS | 2.10 | 1.11 | 1.31 | 1.30 | 3.59 | 1.63 | 3.70 | 1.73 | 1.21 | 17.00 | 1.72 |
| CV3 | 21.19 | 4.41 | 4.60 | 7.73 | 11.66 | 129.26 | 71.55 | 6.11 | 4.92 | 32.62 | 19.20 |
| MiniMax-Multilingual | 49.86 | 2.74 | 5.15 | 27.41 | 8.21 | 112.91 | 85.67 | 3.78 | 2.98 | 49.30 | 12.52 |
| Higgs-Multilingual | 52.24 | 3.61 | 8.68 | 97.09 | 13.74 | 57.71 | 59.61 | 21.28 | 3.63 | 57.52 | 33.69 |
Emergent TTS
Win-rate (↑) per category — judge preference vs the BASELINE row; bold marks the highest win-rate per column. For a fair comparison, every model shares the same reference audio per prompt, and we run the benchmark text verbatim — no inline control tags inserted.
| Model | Overall ↑ | Emotions ↑ | Foreign Words ↑ | Paralinguistics ↑ | Complex Pronunciation ↑ | Questions ↑ | Syntactic Complexity ↑ |
|---|---|---|---|---|---|---|---|
| Higgs Audio v3 | 53.65% | 53.75% | 48.75% | 68.57% | 25.10% | 61.43% | 60.71% |
| Fish Audio S2 Pro | 43.80% | 53.04% | 33.93% | 53.75% | 18.16% | 55.00% | 45.71% |
| Qwen3-TTS-1.7B | 38.84% | 45.54% | 24.64% | 44.29% | 30.00% | 53.39% | 34.11% |
| IndexTTS-2 | 31.12% | 39.29% | 5.36% | 42.50% | 12.45% | 45.89% | 38.93% |
| MOSS-TTS-v1.5 | 43.89% | 60.54% | 35.18% | 51.43% | 11.63% | 53.21% | 47.32% |
| OmniVoice | 40.82% | 61.07% | 28.75% | 52.68% | 13.67% | 45.00% | 40.36% |
Usage
SGLang Usage
Pair the weights in this repo with SGLang-Omni — a production serving stack with continuous batching for multi-codebook decoding and the same inline tag controls. The Higgs TTS cookbook walks you through installation, server launch, request examples, and the full API reference.
See the Higgs TTS cookbook for the full details.
Install and Serve
docker pull lmsysorg/sglang-omni:dev
docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
lmsysorg/sglang-omni:dev /bin/zsh
git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -v -e .
export HF_TOKEN=hf_xxxxxxxxxxxxxxxx
hf download bosonai/higgs-audio-v3-tts-4b
sgl-omni serve \
--model-path bosonai/higgs-audio-v3-tts-4b \
--port 8000
Zero-shot synthesis
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
Voice cloning
Supplying the reference transcript (text) materially improves cloning fidelity.
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "Have a nice day and enjoy south california sunshine.",
"references": [{
"audio_path": "ref.wav",
"text": "Hey, Adam here. Let's create something that feels real, sounds human, and connects every time.",
}],
"temperature": 0.8, "top_k": 50, "max_new_tokens": 1024,
},
)
with open("output.wav", "wb") as f:
f.write(resp.content)
Streaming (Server-Sent Events)
Set "stream": true to receive base64-encoded WAV chunks as the vocoder emits them — sub-second time-to-first-audio. Each event carries audio.data (base64 WAV bytes); the terminal event has finish_reason: "stop" plus usage metadata.
import requests, base64, json
with requests.post(
"http://localhost:8000/v1/audio/speech",
json={"input": "Get the trust fund to the bank early.", "stream": True},
stream=True,
) as resp, open("output.wav", "wb") as f:
for line in resp.iter_lines():
if not line or not line.startswith(b"data: ") or line == b"data: [DONE]":
continue
event = json.loads(line[6:])
if event.get("finish_reason") == "stop":
break
audio = event.get("audio") or {}
if audio.get("data"):
f.write(base64.b64decode(audio["data"]))
Inline control tokens
Embed <|emotion:…|>, <|style:…|>, <|prosody:…|>, and <|sfx:…|> tokens directly in input. Two rules:
- Delivery tokens first. Emotion, style, and the prosody speed / pitch / expressive tokens shape the whole turn — put them at the start of
input. Positional tokens (<|prosody:pause|>,<|prosody:long_pause|>,<|sfx:…|>) go inline exactly where they fire. - Pair every
<|sfx:…|>with its onomatopoeia. E.g.<|sfx:laughter|>Haha,<|sfx:sigh|>Uh,<|sfx:sneeze|>Achoo. The written sound gives the model the acoustic cue to realize the effect.
Example — amusement + laughter:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "<|emotion:amusement|><|prosody:expressive_high|>Wait, wait, that was kind of hilarious. <|sfx:laughter|>Hehe, no, seriously, I was not ready for that."}' \
--output output.wav
Throughput
Throughput on Seed-TTS EN (full set, N=1088 per run). Client --max-concurrency sweep against a Higgs server (max_running_requests=16, bf16, CUDA Graph on). Each row is the mean of 3 runs. Hardware: 1× H100.
| Concurrency | Throughput (req/s) | Mean latency | RTF (per-req) | audio_s/s |
|---|---|---|---|---|
| 1 | 1.62 | 617 ms | 0.147 | 6.89 |
| 2 | 2.70 | 742 ms | 0.180 | 11.37 |
| 4 | 5.45 | 733 ms | 0.177 | 22.84 |
| 8 | 8.91 | 898 ms | 0.217 | 37.38 |
| 16 | 14.74 | 1079 ms | 0.262 | 61.84 |
- Concurrency — Maximum number of in-flight client requests (
--max-concurrency). - Throughput (req/s) — Completed requests divided by total benchmark wall-clock time.
- Mean latency — Average end-to-end time per request (send to full response received).
- RTF (per-req) — Average ratio of processing time to generated audio duration per request (<1 is faster than real time).
- audio_s/s — Total seconds of audio produced divided by total benchmark wall-clock time.
To reproduce the results, follow the instructions in this script.
API Usage
For zero-ops deployment, use the Boson AI API.
Citation
@misc{bosonai_higgs_audio_tts_v3_2026,
title = {Higgs Audio v3 TTS: Conversational Speech for Voice AI from Boson AI},
author = {Boson AI},
year = {2026},
howpublished = {https://huggingface.co/bosonai/higgs-audio-v3-tts-4b},
}
License
Boson Higgs Audio v3 Research and Non-Commercial License — see LICENSE.
- Downloads last month
- 408
