Introduction

VietQuill✒️

Model for Vietnamese Quality-Controlled Paraphrase Generation

VietQuill is a Vietnamese paraphrasing model designed to generate high-quality rewrites for both declarative sentences and questions. It supports controllable variation at the semantic, syntactic, and lexical levels, making it useful for data augmentation, style rewriting, query expansion, and other practical Vietnamese NLP applications.

I provide two versions of VietQuill to match different deployment needs:

Model Size Description
vietquill-base 0.2B Recommended for best performance
vietquill-small 0.1B Lightweight version with reduced accuracy

The model was trained on two supervised Vietnamese paraphrase datasets:

Dataset Domain
ViQP Question
ViSP Sentence

Usage

from transformers import pipeline
import torch

# Load model
model_name = "ngwgsang/vietquill-base"
device = 0 if torch.cuda.is_available() else -1

# Initialize paraphrasing pipeline
model = pipeline(
    "text2text-generation",
    model=model_name,
    tokenizer=model_name,
    device=device,
    num_beams=5,
    num_return_sequences=5,
)

# Control levels for semantic, syntactic, and lexical variation
control = {
    "sem": 85,
    "syn": 85,
    "lex": 65,
}

# Input sentence
sentence = "Hôm qua em đến trường mẹ dắt tay từng bước."

# Build prefix following the model’s control format
prefix = f"SEM_{control['sem']} SYN_{control['syn']} LEX_{control['lex']} : "

# Generate paraphrases
outputs = model(prefix + sentence)
outputs
>> [{'generated_text': 'Hôm qua, em đến trường mẹ dắt tay từng bước.'},
>> {'generated_text': 'Hôm qua em đến trường mẹ dắt tay từng bước.'},
>> {'generated_text': 'Hôm qua, em đã được mẹ dắt tay từng bước đến trường.'},
>> {'generated_text': 'Hôm qua, em được mẹ dắt tay từng bước đến trường.'},
>> {'generated_text': 'Hôm qua em đến trường được mẹ dắt tay từng bước.'}]

Recommended Use Cases

VietQuill is suitable for a wide range of Vietnamese NLP tasks that benefit from controlled paraphrasing, such as:

  • Data augmentation for classification, retrieval, NLU, and QA systems

  • Query expansion in search, recommendation, and dialogue systems

  • Style and structure rewriting for education, writing assistance, and content generation

  • Generating diverse prompts for downstream language model training

  • Rewriting user queries to normalize phrasing in chatbots or customer service pipelines

Its ability to handle both declarative sentences and questions makes it practical for real-world applications where linguistic variety improves model robustness.

Citation

Please CITE our paper when VietQuill is used to help produce published results or is incorporated into other software.

@inproceedings{nguyen-nguyen-2025-large,
    title = "A Large-Scale Benchmark for {V}ietnamese Sentence Paraphrases",
    author = "Nguyen, Sang Quang  and
      Nguyen, Kiet Van",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.59/",
    pages = "1045--1060",
    ISBN = "979-8-89176-195-7",
    abstract = "This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original{--}paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks. The dataset is available for research purposes at \url{https://github.com/ngwgsang/ViSP}."
}

@INPROCEEDINGS{10288738,
  author={Nguyen, Sang Quang and Vo, Thuc Dinh and Nguyen, Duc P.A and Tran, Dang T. and Nguyen, Kiet Van},
  booktitle={2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)}, 
  title={ViQP: Dataset for Vietnamese Question Paraphrasing}, 
  year={2023},
  volume={},
  number={},
  pages={1-6},
  doi={10.1109/MAPR59823.2023.10288738}
}
Downloads last month
76
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ngwgsang/vietquill-base

Base model

VietAI/vit5-base
Finetuned
(87)
this model

Datasets used to train ngwgsang/vietquill-base

Collection including ngwgsang/vietquill-base