Papers
arxiv:2512.15489

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Published on Dec 17
· Submitted by
Wei Du
on Dec 19
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Nemotron-Math, a large-scale mathematical reasoning dataset, enhances performance and robustness through diverse problem integration and efficient long-context training strategies.

AI-generated summary

High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2--3times without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100\% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.

Community

Nemotron-Math is a large-scale mathematical reasoning dataset with 7.5 million solution traces spanning high, medium, and low reasoning modes, each available with and without Python tool-integrated reasoning (TIR). It combines 85K curated AoPS competition problems with 262K community-sourced StackExchange-Math questions, balancing structured tasks and real-world mathematical queries. Experiments show that Nemotron-Math consistently outperforms prior datasets, improving robustness and generalization—especially on HLE-Math—while maintaining strong performance on competition benchmarks. With an efficient long-context training strategy, it enables state-of-the-art results, including 100% maj@16 accuracy on AIME 2024 and 2025 using Python TIR.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.15489 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.15489 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.