Higgs Audio v3 TTS

Higgs Audio v3 TTS is built for voice chat: it speaks, not just reads. It turns model responses into expressive conversational speech across 100+ languages, with zero-shot voice cloning and inline control over emotion, style, prosody, pauses, and sound effects.

Released for research and non-commercial use under the Boson Higgs Audio v3 Research and Non-Commercial License. Production, hosted APIs, or revenue-generating use requires a separate commercial license. Prohibited: voice cloning without consent, impersonation, fraud, election deception, biometric surveillance, or any unlawful use.

Higgs Audio v3 TTS Architecture

Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the Higgs Tokenizer into 8 codebooks at 25 fps, staggered via a delay pattern, then mapped to backbone hidden states through a multi-codebook fused embedding. Output codes pass through a multi-codebook fused head, are de-delayed, and decoded back to waveform.

Component Spec
Backbone ~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8)
Multi-codebook embedding / head Fused single-tensor, tied with text embedding
Context length 8,192 tokens (training sequence length)
Audio tokens 8 codebooks × 1026 vocab, delay pattern
Sample rate 24 kHz
Frame rate 25 fps (40 ms / frame)

Supported Languages

The model reaches single-digit WER/CER on 102 languages, which split into two tiers.

WER/CER under 5 — polished, production-quality (85)

🇿🇦 Afrikaans · 🇸🇦🇪🇬 Arabic · 🇦🇲 Armenian · 🇮🇳 Assamese · 🇪🇸 Asturian · 🇦🇿 Azerbaijani · 🇷🇺 Bashkir · 🇪🇸 Basque · 🇧🇾 Belarusian · 🇧🇩🇮🇳 Bengali · 🇧🇦 Bosnian · 🇧🇬 Bulgarian · 🇪🇸 Catalan · 🇵🇭 Cebuano · 🇮🇶 Central Kurdish · 🇨🇳 Chinese · 🇭🇷 Croatian · 🇨🇿 Czech · 🇩🇰 Danish · 🇳🇱🇧🇪 Dutch · 🇷🇺 Eastern Mari · 🇺🇸🇬🇧🇦🇺 English · 🌐 Esperanto · 🇪🇪 Estonian · 🇫🇮 Finnish · 🇫🇷🇨🇦 French · 🇪🇸 Galician · 🇬🇪 Georgian · 🇩🇪🇦🇹 German · 🇬🇷 Greek · 🇮🇳 Gujarati · 🇭🇹 Haitian Creole · 🇳🇬 Hausa · 🇮🇱 Hebrew · 🇮🇳 Hindi · 🇭🇺 Hungarian · 🇮🇩 Indonesian · 🇮🇹 Italian · 🇯🇵 Japanese · 🇮🇩 Javanese · 🇮🇳 Kannada · 🇰🇿 Kazakh · 🇰🇷 Korean · 🇷🇼 Kinyarwanda · 🇰🇬 Kyrgyz · 🇱🇻 Latvian · 🇨🇩 Lingala · 🇱🇹 Lithuanian · 🇰🇪 Luo · 🇲🇰 Macedonian · 🇲🇾🇮🇩 Malay · 🇮🇳 Malayalam · 🇲🇹 Maltese · 🇳🇿 Māori · 🇮🇳 Marathi · 🇲🇳 Mongolian · 🇳🇵 Nepali · 🇳🇴 Norwegian · 🇫🇷 Occitan · 🇮🇷🇦🇫 Persian · 🇵🇱 Polish · 🇵🇹🇧🇷 Portuguese · 🇷🇴 Romanian · 🇷🇺 Russian · 🇿🇦 Sepedi · 🇷🇸 Serbian · 🇿🇼 Shona · 🇸🇰 Slovak · 🇸🇮 Slovene · 🇪🇸🇲🇽 Spanish · 🇹🇿🇰🇪 Swahili · 🇸🇪 Swedish · 🇵🇭 Tagalog · 🇹🇯 Tajik · 🇮🇳🇱🇰 Tamil · 🇮🇳 Telugu · 🇹🇭 Thai · 🇹🇷 Turkish · 🇺🇦 Ukrainian · 🇵🇰🇮🇳 Urdu · 🇨🇳 Uyghur · 🇺🇿 Uzbek · 🇻🇳 Vietnamese · 🇿🇦 Xhosa · 🇿🇦 Zulu

WER/CER between 5 and 10 — usable, but less polished (17)

🇦🇱 Albanian · 🇲🇼🇿🇲 Chichewa/Nyanja · 🇮🇳🇵🇰 Eastern Punjabi · 🇺🇬 Ganda · 🇮🇸 Icelandic · 🇮🇪 Irish · 🇩🇿 Kabyle · 🇨🇻 Kabuverdianu · 🇰🇪 Kamba · 🇻🇦 Latin · 🇱🇺 Luxembourgish · 🇪🇹🇰🇪 Oromo · 🇦🇫🇵🇰 Pashto · 🇵🇰🇮🇳 Sindhi · 🇸🇴 Somali · 🇦🇴 Umbundu · 🇬🇧 Welsh

Control Tokens

All tags follow <|category:value|> syntax and can be inserted mid-utterance.

  • Emotionelation, amusement, enthusiasm, determination, pride, contentment, affection, relief, contemplation, confusion, surprise, awe, longing, arousal, anger, fear, disgust, bitterness, sadness, shame, helplessness
Token Description
<|emotion:elation|>Elation / joy
<|emotion:amusement|>Amusement / playful laughter
<|emotion:enthusiasm|>Enthusiasm / excitement
<|emotion:determination|>Determination / firmness
<|emotion:pride|>Pride / confidence
<|emotion:contentment|>Calm satisfaction
<|emotion:affection|>Warmth / affection
<|emotion:relief|>Relief
<|emotion:contemplation|>Thoughtful / reflective
<|emotion:confusion|>Confused
<|emotion:surprise|>Surprised
<|emotion:awe|>Awe / wonder
<|emotion:longing|>Longing / yearning
<|emotion:arousal|>Heightened desire
<|emotion:anger|>Anger
<|emotion:fear|>Fear
<|emotion:disgust|>Disgust
<|emotion:bitterness|>Bitterness
<|emotion:sadness|>Sadness
<|emotion:shame|>Shame
<|emotion:helplessness|>Helplessness
  • Stylesinging, shouting, whispering
Token Description
<|style:singing|>Singing
<|style:shouting|>Shouting / projected voice
<|style:whispering|>Whisper
  • Sound effectscough, laughter, crying, screaming, burping, humming, sigh, sniff, sneeze

Pair each token with the matching onomatopoeia immediately after it.

Token Description Suggested onomatopoeia
<|sfx:cough|>CoughAhem
<|sfx:laughter|>LaughterHaha / Hehe
<|sfx:crying|>CryingBoohoo / Sob
<|sfx:screaming|>ScreamingAhh / Aaah
<|sfx:burping|>BurpingBurp
<|sfx:humming|>HummingHmm / Mmm
<|sfx:sigh|>SighUh / Ahh
<|sfx:sniff|>SniffSff
<|sfx:sneeze|>SneezeAchoo
  • Prosody
    • Speed — speed_very_slow, speed_slow, speed_fast, speed_very_fast
    • Pauses — pause, long_pause
    • Pitch — pitch_low, pitch_high
    • Delivery — expressive_high, expressive_low
Token Effect
<|prosody:speed_very_slow|>≈0.65× speed
<|prosody:speed_slow|>≈0.85× speed
<|prosody:speed_fast|>≈1.2× speed
<|prosody:speed_very_fast|>≈1.4× speed
<|prosody:pitch_low|>≈−3 semitones
<|prosody:pitch_high|>≈+2.5 semitones
<|prosody:pause|>≈400–700 ms pause
<|prosody:long_pause|>≈700–1500 ms pause
<|prosody:expressive_high|>More expressive delivery
<|prosody:expressive_low|>Flatter delivery

Evaluation Benchmarks

Multilingual Voice Clone

We evaluate Higgs Audio v3 TTS on public multilingual TTS suites and our internal 111-language Higgs-Multilingual set, covering both common and lower-resource languages.

WER / CER (↓, ×100) macro-averaged across each benchmark's language set. Lower is better; bold marks the best per row. All numbers are reproducible end-to-end with original metrics and normalization.

Benchmark Higgs Audio v2 Higgs Audio v3 Fish Audio S2 Pro Qwen3-TTS-1.7B VibeVoice-7B IndexTTS-2 MiMo-Audio-7B-Instruct MOSS-TTS-v1.5 OmniVoice ChatterBox FireRedTTS-2
SeedTTS 2.10 1.11 1.31 1.30 3.59 1.63 3.70 1.73 1.21 17.00 1.72
CV3 21.19 4.41 4.60 7.73 11.66 129.26 71.55 6.11 4.92 32.62 19.20
MiniMax-Multilingual 49.86 2.74 5.15 27.41 8.21 112.91 85.67 3.78 2.98 49.30 12.52
Higgs-Multilingual 52.24 3.61 8.68 97.09 13.74 57.71 59.61 21.28 3.63 57.52 33.69

Emergent TTS

Win-rate (↑) per category — judge preference vs the BASELINE row; bold marks the highest win-rate per column. For a fair comparison, every model shares the same reference audio per prompt, and we run the benchmark text verbatim — no inline control tags inserted.

Model Overall ↑ Emotions ↑ Foreign Words ↑ Paralinguistics ↑ Complex Pronunciation ↑ Questions ↑ Syntactic Complexity ↑
Higgs Audio v3 53.65% 53.75% 48.75% 68.57% 25.10% 61.43% 60.71%
Fish Audio S2 Pro 43.80% 53.04% 33.93% 53.75% 18.16% 55.00% 45.71%
Qwen3-TTS-1.7B 38.84% 45.54% 24.64% 44.29% 30.00% 53.39% 34.11%
IndexTTS-2 31.12% 39.29% 5.36% 42.50% 12.45% 45.89% 38.93%
MOSS-TTS-v1.5 43.89% 60.54% 35.18% 51.43% 11.63% 53.21% 47.32%
OmniVoice 40.82% 61.07% 28.75% 52.68% 13.67% 45.00% 40.36%

Usage

SGLang Usage

Pair the weights in this repo with SGLang-Omni — a production serving stack with continuous batching for multi-codebook decoding and the same inline tag controls. The Higgs TTS cookbook walks you through installation, server launch, request examples, and the full API reference.

See the Higgs TTS cookbook for the full details.

Install and Serve

docker pull lmsysorg/sglang-omni:dev
docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
  lmsysorg/sglang-omni:dev /bin/zsh

git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -v -e .
export HF_TOKEN=hf_xxxxxxxxxxxxxxxx
hf download bosonai/higgs-audio-v3-tts-4b

sgl-omni serve \
  --model-path bosonai/higgs-audio-v3-tts-4b \
  --port 8000

Zero-shot synthesis

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, how are you?"}' \
  --output output.wav

Voice cloning

Supplying the reference transcript (text) materially improves cloning fidelity.

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": "Have a nice day and enjoy south california sunshine.",
        "references": [{
            "audio_path": "ref.wav",
            "text": "Hey, Adam here. Let's create something that feels real, sounds human, and connects every time.",
        }],
        "temperature": 0.8, "top_k": 50, "max_new_tokens": 1024,
    },
)
with open("output.wav", "wb") as f:
    f.write(resp.content)

Streaming (Server-Sent Events)

Set "stream": true to receive base64-encoded WAV chunks as the vocoder emits them — sub-second time-to-first-audio. Each event carries audio.data (base64 WAV bytes); the terminal event has finish_reason: "stop" plus usage metadata.

import requests, base64, json

with requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={"input": "Get the trust fund to the bank early.", "stream": True},
    stream=True,
) as resp, open("output.wav", "wb") as f:
    for line in resp.iter_lines():
        if not line or not line.startswith(b"data: ") or line == b"data: [DONE]":
            continue
        event = json.loads(line[6:])
        if event.get("finish_reason") == "stop":
            break
        audio = event.get("audio") or {}
        if audio.get("data"):
            f.write(base64.b64decode(audio["data"]))

Inline control tokens

Embed <|emotion:…|>, <|style:…|>, <|prosody:…|>, and <|sfx:…|> tokens directly in input. Two rules:

  1. Delivery tokens first. Emotion, style, and the prosody speed / pitch / expressive tokens shape the whole turn — put them at the start of input. Positional tokens (<|prosody:pause|>, <|prosody:long_pause|>, <|sfx:…|>) go inline exactly where they fire.
  2. Pair every <|sfx:…|> with its onomatopoeia. E.g. <|sfx:laughter|>Haha, <|sfx:sigh|>Uh, <|sfx:sneeze|>Achoo. The written sound gives the model the acoustic cue to realize the effect.

Example — amusement + laughter:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "<|emotion:amusement|><|prosody:expressive_high|>Wait, wait, that was kind of hilarious. <|sfx:laughter|>Hehe, no, seriously, I was not ready for that."}' \
  --output output.wav

Throughput

Throughput on Seed-TTS EN (full set, N=1088 per run). Client --max-concurrency sweep against a Higgs server (max_running_requests=16, bf16, CUDA Graph on). Each row is the mean of 3 runs. Hardware: 1× H100.

Concurrency Throughput (req/s) Mean latency RTF (per-req) audio_s/s
1 1.62 617 ms 0.147 6.89
2 2.70 742 ms 0.180 11.37
4 5.45 733 ms 0.177 22.84
8 8.91 898 ms 0.217 37.38
16 14.74 1079 ms 0.262 61.84
  • Concurrency — Maximum number of in-flight client requests (--max-concurrency).
  • Throughput (req/s) — Completed requests divided by total benchmark wall-clock time.
  • Mean latency — Average end-to-end time per request (send to full response received).
  • RTF (per-req) — Average ratio of processing time to generated audio duration per request (<1 is faster than real time).
  • audio_s/s — Total seconds of audio produced divided by total benchmark wall-clock time.

To reproduce the results, follow the instructions in this script.

API Usage

For zero-ops deployment, use the Boson AI API.

Citation

@misc{bosonai_higgs_audio_tts_v3_2026,
  title  = {Higgs Audio v3 TTS: Conversational Speech for Voice AI from Boson AI},
  author = {Boson AI},
  year   = {2026},
  howpublished = {https://huggingface.co/bosonai/higgs-audio-v3-tts-4b},
}

License

Boson Higgs Audio v3 Research and Non-Commercial License — see LICENSE.

Downloads last month
408
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bosonai/higgs-audio-v3-tts-4b

Finetunes
1 model
Quantizations
4 models

Spaces using bosonai/higgs-audio-v3-tts-4b 2

Collection including bosonai/higgs-audio-v3-tts-4b