| --- |
| language: |
| - en |
| - zu |
| tags: |
| - translation |
| - african-languages |
| - scientific-translation |
| - afriscience-mt |
| - nllb |
| license: apache-2.0 |
| base_model: facebook/nllb-200-1.3B |
| datasets: |
| - afriscience-mt |
| pipeline_tag: translation |
| model-index: |
| - name: nllb_200_1.3b-eng-zul |
| results: |
| - task: |
| type: translation |
| metrics: |
| - name: BLEU (test) |
| type: bleu |
| value: 37.55 |
| - name: chrF (test) |
| type: chrf |
| value: 67.23 |
| - name: SSA-COMET (test) |
| type: comet |
| value: 66.08 |
| --- |
| |
| # nllb_200_1.3b-eng-zul |
|
|
| [](https://huggingface.co/dsfsi/nllb_200_1.3b-eng-zul) |
|
|
| This model is part of the **AfriScience-MT** project, focused on machine translation of scientific texts for African languages. |
|
|
| ## Model Description |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Model Type** | Seq2Seq Translation | |
| | **Translation Direction** | English → isiZulu | |
| | **Base Model** | [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | |
| | **Domain** | Scientific/Academic texts | |
| | **Training** | Full fine-tuning on AfriScience-MT dataset | |
|
|
| ## Evaluation Results |
|
|
| Performance on the AfriScience-MT test set: |
|
|
| | Split | BLEU | chrF | SSA-COMET | |
| |-------|------|------|-----------| |
| | Validation | 33.79 | 63.39 | 65.94 | |
| | **Test** | **37.55** | **67.23** | **66.08** | |
|
|
| **Metrics explanation:** |
| - **BLEU**: Measures n-gram overlap with reference translations (0-100, higher is better) |
| - **chrF**: Character-level F-score, robust for morphologically rich languages (0-100, higher is better) |
| - **SSA-COMET**: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) ([McGill-NLP/ssa-comet-stl](https://huggingface.co/McGill-NLP/ssa-comet-stl)) |
|
|
| ## Usage |
|
|
| ### Quick Start |
|
|
| ```python |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| |
| model_id = "dsfsi/nllb_200_1.3b-eng-zul" |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| |
| # Set source language |
| tokenizer.src_lang = "eng_Latn" |
| |
| # Translate |
| text = "The mitochondria is the powerhouse of the cell." |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256) |
| |
| # Generate with target language |
| forced_bos_token_id = tokenizer.convert_tokens_to_ids("zul_Latn") |
| outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5) |
| translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] |
| print(translation) |
| ``` |
|
|
| ### Batch Translation |
|
|
| ```python |
| texts = [ |
| "Climate change affects agricultural productivity.", |
| "The study analyzed genetic markers in the population.", |
| "Renewable energy sources are essential for sustainable development." |
| ] |
| |
| inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256) |
| outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5) |
| translations = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
| for src, tgt in zip(texts, translations): |
| print(f"{src}\n→ {tgt}\n") |
| ``` |
|
|
| ## Training Details |
|
|
| ### Hyperparameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Epochs | 10 | |
| | Batch Size | 4 | |
| | Learning Rate | 2e-05 | |
|
|
| ### Training Data |
|
|
| - **Dataset**: AfriScience-MT |
| - **Domain**: Scientific abstracts and papers |
| - **Languages**: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu) |
|
|
|
|
| ## Reproducibility |
|
|
| To reproduce this model: |
|
|
| ```bash |
| # Clone the AfriScience-MT repository |
| git clone https://github.com/afriscience-mt/afriscience-mt.git |
| cd afriscience-mt |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| |
| # Run training |
| python -m afriscience_mt.scripts.run_seq2seq_training \ |
| --data_dir ./data \ |
| --source_lang eng \ |
| --target_lang zul \ |
| --model_name facebook/nllb-200-1.3B \ |
| --model_type nllb \ |
| --output_dir ./output \ |
| --num_epochs 10 \ |
| --batch_size 16 \ |
| --learning_rate 2e-5 |
| ``` |
|
|
| ## Limitations |
|
|
| - **Domain Specificity**: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text. |
| - **Language Coverage**: Only supports the specific language pair indicated. |
| - **Input Length**: Maximum input length is 256 tokens; longer texts should be split into segments. |
|
|
| ## Citation |
|
|
| If you use this model, please cite the AfriScience-MT paper ([arXiv:2605.29741](https://arxiv.org/abs/2605.29741)): |
|
|
| ```bibtex |
| @article{abdulmumin2026afriscience, |
| title = {AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation}, |
| author = {Abdulmumin, Idris and Gwadabe, Tajuddeen and Muhammad, Shamsuddeen Hassan and Adelani, David Ifeoluwa and Khalo, Nomonde and Ahmad, Ibrahim Said and Modupe, Abiodun and Mumm, Anina and Biyela, Sibusiso and Rabie, Michelle and Havemann, Johanna and Rei, Marek and Abbott, Jade and Marivate, Vukosi}, |
| journal = {arXiv preprint arXiv:2605.29741}, |
| year = {2026}, |
| url = {https://arxiv.org/abs/2605.29741} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
| ## Acknowledgments |
|
|
| - Built on top of [{base_model}](https://huggingface.co/{base_model}) |
| - Evaluation using [SSA-COMET](https://huggingface.co/McGill-NLP/ssa-comet-stl) for African language assessment |
|
|