PhoneticXeus

Multilingual phone recognition that turns speech into IPA phones, built on the XEUS speech encoder. Trained on 70+ languages (IPAPack++).

🎤 Try it in your browser: demo Space
💻 Code & training recipe: GitHub
📄 Paper: arXiv 2603.29042

Usage

pip install torch torchaudio transformers huggingface_hub safetensors soundfile numpy pyyaml typeguard

import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "changelinglab/PhoneticXeus", trust_remote_code=True
).eval()

wav, sr = torchaudio.load("audio.wav")
wav = wav.mean(0)                                    # mono, shape (samples,)
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

print(model.transcribe(wav, sampling_rate=16000)[0]["processed_transcript"])
# e.g. "aɪhædðætkʰjʊɹiɑsətipɪsaɪd…"

model.transcribe(...) returns a list of dicts with processed_transcript (joined IPA) and predicted_transcript (slash-separated phones). Calling model(input_values) returns frame-level CTC logits (batch, frames, 428) for custom decoding.

Audio must be mono 16 kHz. The first load asks you to allow the repo's remote code (trust_remote_code=True).

Citation

@misc{pxeus26,
      title={An Empirical Recipe for Universal Phone Recognition},
      author={Shikhar Bharadwaj and Chin-Jou Li and Kwanghee Choi and Eunjung Yeo and William Chen and Shinji Watanabe and David R. Mortensen},
      year={2026},
      eprint={2603.29042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.29042},
}

Downloads last month: 9,603

Safetensors

Model size

0.6B params

Tensor type

F32

Space using changelinglab/PhoneticXeus 1

Collection including changelinglab/PhoneticXeus

PhoneticXeus

Collection

Universal Phone Recognition model • 3 items • Updated Apr 9

Paper for changelinglab/PhoneticXeus

An Empirical Recipe for Universal Phone Recognition

Paper • 2603.29042 • Published Mar 30 • 5