arxiv:2512.15489

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Published on Dec 17

· Submitted by

Wei Du on Dec 19

NVIDIA

Upvote

Authors:

Abstract

Nemotron-Math, a large-scale mathematical reasoning dataset, enhances performance and robustness through diverse problem integration and efficient long-context training strategies.

AI-generated summary

High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2--3times without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100\% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.

View arXiv page View PDF Add to collection

Community

woshiweidu

Paper submitter about 22 hours ago

•

edited about 22 hours ago

Nemotron-Math is a large-scale mathematical reasoning dataset with 7.5 million solution traces spanning high, medium, and low reasoning modes, each available with and without Python tool-integrated reasoning (TIR). It combines 85K curated AoPS competition problems with 262K community-sourced StackExchange-Math questions, balancing structured tasks and real-world mathematical queries. Experiments show that Nemotron-Math consistently outperforms prior datasets, improving robustness and generalization—especially on HLE-Math—while maintaining strong performance on competition benchmarks. With an efficient long-context training strategy, it enables state-of-the-art results, including 100% maj@16 accuracy on AIME 2024 and 2025 using Python TIR.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.15489 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.15489 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.