Introduction
VietQuill✒️
VietQuill is a Vietnamese paraphrasing model designed to generate high-quality rewrites for both declarative sentences and questions. It supports controllable variation at the semantic, syntactic, and lexical levels, making it useful for data augmentation, style rewriting, query expansion, and other practical Vietnamese NLP applications.
I provide two versions of VietQuill to match different deployment needs:
| Model | Size | Description |
|---|---|---|
| vietquill-base | 0.2B | Recommended for best performance |
| vietquill-small | 0.1B | Lightweight version with reduced accuracy |
The model was trained on two supervised Vietnamese paraphrase datasets:
Usage
from transformers import pipeline
import torch
# Load model
model_name = "ngwgsang/vietquill-base"
device = 0 if torch.cuda.is_available() else -1
# Initialize paraphrasing pipeline
model = pipeline(
"text2text-generation",
model=model_name,
tokenizer=model_name,
device=device,
num_beams=5,
num_return_sequences=5,
)
# Control levels for semantic, syntactic, and lexical variation
control = {
"sem": 85,
"syn": 85,
"lex": 65,
}
# Input sentence
sentence = "Hôm qua em đến trường mẹ dắt tay từng bước."
# Build prefix following the model’s control format
prefix = f"SEM_{control['sem']} SYN_{control['syn']} LEX_{control['lex']} : "
# Generate paraphrases
outputs = model(prefix + sentence)
outputs
>> [{'generated_text': 'Hôm qua, em đến trường mẹ dắt tay từng bước.'},
>> {'generated_text': 'Hôm qua em đến trường mẹ dắt tay từng bước.'},
>> {'generated_text': 'Hôm qua, em đã được mẹ dắt tay từng bước đến trường.'},
>> {'generated_text': 'Hôm qua, em được mẹ dắt tay từng bước đến trường.'},
>> {'generated_text': 'Hôm qua em đến trường được mẹ dắt tay từng bước.'}]
Recommended Use Cases
VietQuill is suitable for a wide range of Vietnamese NLP tasks that benefit from controlled paraphrasing, such as:
Data augmentation for classification, retrieval, NLU, and QA systems
Query expansion in search, recommendation, and dialogue systems
Style and structure rewriting for education, writing assistance, and content generation
Generating diverse prompts for downstream language model training
Rewriting user queries to normalize phrasing in chatbots or customer service pipelines
Its ability to handle both declarative sentences and questions makes it practical for real-world applications where linguistic variety improves model robustness.
Citation
Please CITE our paper when VietQuill is used to help produce published results or is incorporated into other software.
@inproceedings{nguyen-nguyen-2025-large,
title = "A Large-Scale Benchmark for {V}ietnamese Sentence Paraphrases",
author = "Nguyen, Sang Quang and
Nguyen, Kiet Van",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-naacl.59/",
pages = "1045--1060",
ISBN = "979-8-89176-195-7",
abstract = "This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original{--}paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks. The dataset is available for research purposes at \url{https://github.com/ngwgsang/ViSP}."
}
@INPROCEEDINGS{10288738,
author={Nguyen, Sang Quang and Vo, Thuc Dinh and Nguyen, Duc P.A and Tran, Dang T. and Nguyen, Kiet Van},
booktitle={2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)},
title={ViQP: Dataset for Vietnamese Question Paraphrasing},
year={2023},
volume={},
number={},
pages={1-6},
doi={10.1109/MAPR59823.2023.10288738}
}
- Downloads last month
- 76
Model tree for ngwgsang/vietquill-base
Base model
VietAI/vit5-base