QVAC Genesis II: Expanding the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for LLM Pre-training

Community Article Published December 19, 2025

KEY HIGHLIGHTS

  • Tether Data’s AI research division released QVAC Genesis II, expanding the largest publicly available synthetic educational dataset with 10 new educational domains: Astronomy, High School Chemistry, College Chemistry, High School Computer Science, College Computer Science, Econometrics, Electronic Science, Geography, High School Statistics, and Machine Learning. It also includes an improved College Physics domain (which underperformed in Genesis I). Genesis II contributes 86 million samples and 107 billion new tokens, bringing the combined Genesis I and II dataset to 148 billion tokens across 19 educational domains.
  • Structured, option-wise Reasoning: Genesis II introduces a new Option-Level (OL) Reasoning data generation method that produces structured, option-wise reasoning for multiple-choice questions. The method systematically analyzes all answer options, reinforces correct reasoning paths, and explicitly explains common misconceptions. This method contributes ~54 billion new tokens to Genesis II.
  • Superior Accuracy and Valid Answer Rate: Our new OL Reasoning method outperforms the original Failure Analysis (used in Genesis I) and Cosmopedia-v2, achieving an average accuracy of 29.91 compared to 21.76 (Failure Analysis) and 12.19 (Cosmopedia-v2) when trained with a comparable training budget. We also evaluate models using both Accuracy and Valid Answer Rate (the percentage of responses containing a clear, unambiguous answer). OL Reasoning attains a near-perfect Valid Answer Rate of 98.44% (LLM-as-a-judge), indicating strong structural and semantic consistency. When combining Failure Analysis and OL Reasoning tokens, performance improves further, reaching an average accuracy of 30.40.
  • Dual-Method Generation Pipeline: By combining OL Reasoning Analysis with Failure Analysis, Genesis II introduces a dual-method data generation pipeline that maximizes question utilization across both solved and failed samples. This approach significantly increases dataset diversity, coverage, and reasoning depth, while reducing selection bias inherent in failure-only approaches.
  • By making QVAC Genesis II openly available to researchers, Tether Data continues to empower the global AI community to accelerate the development of open-source educational LLMs and democratize access to foundational AI capabilities.
  • The QVAC Genesis II is made available under the CC-BY-NC 4.0 (Creative Commons Attribution–NonCommercial 4.0), allowing free use and adaptation for non-commercial research and educational purposes.

Copyright Complaints: We will take appropriate actions in response to notice of copyright infringement. If you believe your work has been used in a manner that infringes upon your intellectual property rights, please email [email protected] to file a notice of infringement.

🚀 Download QVAC Genesis II Dataset

Access the expanded multi-domain educational synthetic dataset with 10 new domains.

🔗 Get the Dataset

🚀 QVAC Genesis II Collection

Access the collection with the 3 models used for the evaluation.

🔗 Genesis II Collection

1. Introduction

Building upon the success of QVAC Genesis I [1] , the largest publicly available synthetic dataset for educational content (41 billion tokens), Tether Data, S.A. de C.V. (Tether Data, we, us, our) introduces QVAC Genesis II, a major expansion that adds 10 new educational domains, 107 billion new tokens, and introduces a new Option-Level Reasoning data generation method. Combined with Genesis I, the dataset now totals 148 billion tokens.

Genesis I focused on core STEM disciplines (Mathematics, Physics, Biology, and Medicine) and demonstrated superior performance compared to existing synthetic datasets like Cosmopedia [2]. Genesis II extends this foundation by:

  1. Incorporating additional key educational domains
  2. Regenerating College Physics with the new dual-method pipeline (this domain underperformed in Genesis I)
  3. Introducing a new "Option-Level Reasoning" method that leverages questions answered correctly by the model
  4. Providing a more comprehensive LLM-as-a-judge evaluation framework that measures both accuracy and valid answer rate

Key Contributions

QVAC Genesis II expands upon Genesis I with the following contributions:

  1. Domain expansion and College Physics regeneration. We added 10 new educational domains to the original 9, creating a comprehensive dataset covering 19 domains in total. Additionally, we regenerated College Physics using the new dual-method pipeline, as this domain underperformed in Genesis I:

    • Genesis I domains: High School Biology, College Biology, Professional Medicine, College Medicine, High School Mathematics, College Mathematics, High School Physics, College Physics, Conceptual Physics
    • Genesis II domains: College Chemistry, High School Chemistry, College Computer Science, High School Computer Science, High School Statistics, Astronomy, Geography, Electrical Engineering, Econometrics, Machine Learning, College Physics (regenerated)
  2. Option-Level Reasoning method and dual-method pipeline. Genesis II introduces a new "Option-Level Reasoning" data generation method that creates educational content from questions the model answers correctly, analyzing all answer options comprehensively. Our evaluation demonstrates that Option-Level Reasoning Analysis alone outperforms both the original Failure Analysis method and Cosmopedia-v2, achieving an average accuracy of 29.91 compared to 21.76 (Failure Analysis) and 12.19 (Cosmopedia-v2), with a near-perfect Valid Answer Rate of 98.44%. By combining Option-Level Reasoning with Failure Analysis (the method from Genesis I), we create a dual-method pipeline that maximizes the utilization of all generated questions:

    • Failure Analysis: Generates educational content explaining why incorrect answers fail and how to arrive at the correct solution
    • Option-Level Reasoning Analysis (NEW): Generates comprehensive analysis of all answer options, reinforces correct reasoning and explains common misconceptions
  3. Enhanced LLM-as-a-judge evaluation methodology. We introduce a more comprehensive evaluation framework that measures:

    • Accuracy: The percentage of questions answered correctly
    • Valid Answer Rate: The percentage of responses where the LLM judge identifies a clear, single answer (as opposed to invalid responses with no answer or multiple conflicting answers)
    • This dual-metric approach provides deeper insight into model capabilities, demonstrating that Genesis II-trained models not only achieve higher accuracy but also produce significantly more valid, unambiguous responses.
  4. Open-source contribution. We are making QVAC Genesis II available under the CC-BY-NC 4.0, continuing to democratize access to high-quality pretraining data for public institutions, research labs, and the academic community.

2. Methodology

Note: For more detailed information about the base methodology used in Genesis II, including seed data acquisition and quality filtering processes, original prompt templates for Scaling QA, MCQ Answer, LLM-as-a-Judge extraction, and Failure Analysis, pipeline orchestration details (distilabel, vLLM), and model architectures and configurations, please refer . to the comprehensive Genesis I Appendix.

Genesis II builds upon the proven "Learning from Failures" method from Genesis I while introducing a complementary data generation method. This dual-approach methodology maximizes the value extracted from every generated question, whether the model answers correctly or incorrectly.

For complete details on “Learning from Failures” , please refer to QVAC Genesis I.

2.1 Dual-Method Data Generation Pipeline

image (3)

Figure 1. The enhanced Genesis II pipeline: Seed DataQuality FilterScaling QA (generate 4 MCQs per seed) → Model AnsweringCompare to Gold Label → Two methods:

  • Failure Analysis (for incorrect answers): Generate educational failures analysis content in four styles
  • Option-Level Reasoning (for correct answers): Generate comprehensive option-by-option analysis in four styles

2.2 Option-Level Reasoning Analysis: Our New Method

Genesis II introduces the Option-Level Reasoning Analysis method, applied to questions that the model answers correctly during the Model Answering phase. While the original Failure Analysis method focused on extracting educational value from model errors, Option-Level Reasoning Analysis ensures that correctly answered questions also contribute high-quality educational content.

Rationale: A model answering a question correctly demonstrates understanding, but the reasoning behind that understanding, along with the explicit explanation of why other options are incorrect, provides valuable educational content. This approach:

  • Increases dataset diversity by generating content from a different source (correct answers) with distinct reasoning patterns
  • Reinforces correct reasoning patterns
  • Explicitly addresses common misconceptions through incorrect option analysis
  • Provides comprehensive coverage of the topic from multiple angles
  • Maximizes the utilization of all generated questions (in Genesis I, correctly answered questions were not used)

Four Output Styles: Similar to Failure Analysis, Option-Level Reasoning Analysis generates educational content in four distinct styles:

  1. Educational Textbook: Formal, pedagogical explanations with clear section structure
  2. Web Articles: Engaging, conversational content optimized for online reading
  3. Question-Answer (Q&A): Direct, focused, tutoring-style responses
  4. Conversational Dialogue: Natural back-and-forth between a student and assistant

Each style analyzes the correct answer option first with detailed reasoning, then systematically examines each incorrect option. The complete prompt templates are documented in the Appendix.

2.3 Domain Expansion

For Genesis II, we expanded the 9 domains from Genesis I to include 10 new educational domains:

New Domains:

  • Chemistry: College Chemistry, High School Chemistry
  • Computer Science: College Computer Science, High School Computer Science, Machine Learning
  • Statistics: High School Statistics, Econometrics
  • Interdisciplinary Sciences: Astronomy, Geography, Electrical Engineering

The rigorous seed data acquisition (using FineFineWeb [4]), quality filtering (using Ultra-FineWeb-classifier [5]), and prompt engineering methodology from Genesis I is applied to our new domains. For full methodology, refer to Genesis I Blog.

3. Pre-training with Megatron-LM

3.1 Overview: The Framework Challenge

Training a 1.7B parameter model from scratch on 64 GPUs sounds straightforward on paper, until you confront the fragmented landscape of distributed training frameworks.

On one hand, we have HuggingFace Transformers: the standard for model definitions, with thousands of architectures, clean APIs, and a large community. Qwen3-1.7B exists here, complete with its attention patterns, RoPE embeddings, and SwiGLU activations.

On the other hand, we have Megatron-Core: NVIDIA's framework for large-scale training, with optimized CUDA kernels, mature tensor parallelism, and communication patterns refined over years of training at scale. This is where training needs to happen if you want reasonable throughput on 64 GPUs.

The problem is these two worlds don't speak the same language.

The traditional path to using Megatron-Core required rewriting your model from scratch in Megatron's internal format: manually implementing each layer type with the correct parallelism layouts, debugging distributed deadlocks, writing checkpoint conversion scripts, and implementing data loading in Megatron's binary format. This could easily be a multi-month project.

Megatron-Bridge solves this by automatically converting HuggingFace model definitions into Megatron-compatible formats. This lets us use the Qwen3-1.7B architecture without rewriting it:

  • Load the Qwen3-1.7B architecture from HuggingFace (layer configurations, attention heads, hidden dimensions)
  • Initialize with random weights (no pretrained weights loaded)
  • Train on our Genesis II dataset using Megatron-Core's distributed training

3.1 Hardware Configuration

All three models were trained separately on a 64-GPU cluster (8 nodes with 8 NVIDIA H100 GPUs each), connected via InfiniBand.

Component Specification
GPUs 64 × NVIDIA H100 (80GB)
Nodes 8 nodes × 8 GPUs each
Interconnect InfinityBand with GPU Direct RDMA
Container NVIDIA NeMo 25.09

3.2 Parallelism Strategy

Distributing the training across 64 GPUs requires deciding how to split the work. We use a combination of tensor parallelism and data parallelism:

Parallelism Type Size What It Does
Tensor (TP) 2 Splits attention and feed-forward layers across 2 GPUs
Pipeline (PP) 1 No pipeline splitting (model fits in memory)
Data (DP) 32 32 parallel workers process different batches

Why TP=2? Tensor parallelism requires frequent communication between GPUs. At TP=2, this communication stays within a single node using fast NVLink. Higher TP would require cross-node communication on every layer, reducing throughput.

Why PP=1? Pipeline parallelism is useful for very large models that don't fit in memory. At 1.7B parameters with TP=2, the model fits comfortably, so pipeline splitting would only add overhead.

Why DP=32? After allocating GPUs for tensor parallelism (64 ÷ 2 = 32), the remaining capacity goes to data parallelism. Each of the 32 workers processes different batches in parallel, then synchronizes gradients.

3.3 Batch Configuration

The batch size configuration balances memory constraints with training efficiency:

Parameter Value Rationale
Micro Batch Size 4 per GPU Limited by GPU memory with 4,096 token sequences
Gradient Accumulation 16 steps Accumulate gradients before synchronizing
Global Batch Size 2,048 sequences 4 × 32 workers × 16 accumulation steps
Tokens per Step ~8.4M 2,048 sequences × 4,096 tokens

The micro batch size of 4 might seem small for 80GB GPUs, but at 4,096 tokens per sequence with tensor parallelism, this is near the memory limit. We compensate by accumulating gradients over 16 forward passes before updating weights, reaching our target global batch size of 2,048 sequences (~8.4 million tokens per training step).

3.4 Training Configuration & Experimental Design

Rather than training a single model, we designed an experiment to evaluate how synthetic data, created using multiple prompts, generalizes. We trained three distinct models from scratch to create a rigorous comparison:

  • Two Specialist Models: Each trained exclusively on one of the two distinct data types defined in Section 2:
    • Failure Analysis Model: Trained solely on data generated from incorrect model answers (learning from failures).
    • Option-Level Reasoning Model: Trained solely on data generated from correct model answers (analyzing all options).
  • One Combined Model: Trained on a shuffled mixture of both Failure Analysis and Option-Level Reasoning datasets.

To ensure a fair comparison, all three models utilized the same hyperparameters and compute budget scaled to token count, differing only in the data composition.

Hyperparameters

We standardized the training duration for all Genesis II runs to a single epoch, ensuring the model encountered each unique synthetic example only once. This strategy mitigates the memorization of repeated tokens and encourages the learning of underlying logic.

  • Total Training Tokens: ~50B (Specialist Models) | ~100B (Combined Model)
  • Training Duration: 1 Epoch
  • Sequence Length: 4,096 tokens
  • Learning Rate: 2×10⁻⁴ → 2×10⁻⁵ (cosine decay)
  • Warmup: 10% of training
  • Weight Decay: 0.01
  • Gradient Clipping: 1.0
  • Precision: BF16

Learning rate schedule: We start with a warmup period (10% of the epoch) where the learning rate gradually increases to 2×10⁻⁴. This helps stabilize early training when the randomly initialized model produces noisy gradients. After warmup, the learning rate follows a cosine decay down to 2×10⁻⁵, allowing the model to settle into better solutions as it converges toward the end of the epoch.

BF16 precision: We use bfloat16 mixed precision with Flash Attention 2. This configuration significantly reduces memory usage and speeds up training throughput on the H100s without any meaningful loss in convergence accuracy compared to FP32.

3.5 Data Pipeline

Transforming raw text into a format that Megatron-Core can efficiently ingest requires a multi-stage pipeline. This preprocessing happens once before training begins - meaning any mistakes here propagate through the entire run.

Stage 1: Concatenation and Filtering

Our Genesis II data comprises thousands of individual JSONL files produced by the data generation workers. The first step consolidates these into a single file while applying quality filters:

  • Documents must have a minimum text length of 100 characters (filtering out incomplete or truncated generations)
  • Documents must have valid reasoning outputs (filtering out failed generations)
  • The concatenated file is then shuffled with a fixed random seed for reproducibility

This filtering step is important because even small amounts of low-quality data (empty documents, truncated text, failed generations) can degrade training.

Stage 2: Tokenization and Binary Conversion

The filtered JSONL is then processed by Megatron's preprocessing tool, which:

  • Tokenizes each document using the Qwen3 tokenizer
  • Appends an end-of-document <EOD> token after each document
  • Packs the tokenized documents sequentially into a binary file
  • Builds an index file that records where each document starts and ends

The result is two files: a .bin file containing all token IDs packed end-to-end, and an .idx file containing the byte offsets for each document. This format allows the data loader to seek directly to any document without reading the entire file.

Stage 3: Sequence Packing During Training

During training, Megatron's data loader constructs training sequences from this binary format:

  • Continuous Sampling: The loader samples a starting position and reads tokens sequentially to fill the 4,096-token context window.
  • Document Packing: If a document ends mid-sequence (marked by an <EOD> token), the next document begins immediately within the same sequence.
  • Context Isolation (The "Reset"): Crucially, we employ attention masking and position ID resetting. When the model encounters an <EOD> token, the attention mask is reset so that tokens in the new document cannot attend to the previous one. This ensures that while documents share a compute sequence, they remain mathematically independent.

This "packing" approach is more efficient than padding each document to a fixed length so that short documents don't waste compute on padding tokens, and the model naturally learns to handle document transitions. A single training sequence might contain 2-3 documents if they're short enough.


4. Evaluation

4.1 Dataset Statistics

Genesis I Domains:

Domain Number of Samples No of Tokens (in B)
High school biology 3,818,070 4.511
College biology 3,286,648 3.927
Professional medicine 1,552,474 1.884
College medicine 5,164,247 6.218
High school mathematics 3,244,240 4.277
College mathematics 5,895,052 8.243
High school physics 2,277,880 3.061
College physics 4,281,062 5.814
Conceptual physics 2,354,184 2.973
Genesis I Total 31,873,857 40.906

Genesis II New Domains — Failure Analysis Data:

Domain Number of Samples No of Tokens (in B)
College physics 4,144,798 6.24
Astronomy 4,716,117 6.21
Econometrics 3,486,501 5.24
College chemistry 3,964,112 5.07
Electrical Engineering 3,901,901 4.96
College computer science 3,889,696 4.77
Geography 3,992,646 4.60
High school statistics 3,354,353 4.47
High school chemistry 3,327,350 4.15
High school computer science 3,365,258 4.06
Machine learning 3,133,569 3.87
Failure Analysis Total 41,276,301 53.64

Genesis II New Domains — Option-Level Reasoning Analysis Data:

Domain Number of Samples No of Tokens (in B)
Machine learning 4,636,066 5.51
High school statistics 4,424,565 5.41
High school chemistry 4,464,847 5.22
Econometrics 3,871,249 5.06
College chemistry 4,182,669 5.01
College physics 3,672,394 4.81
Geography 4,301,699 4.77
Astronomy 3,970,849 4.71
College computer science 3,851,555 4.47
Electrical Engineering 3,758,536 4.44
High school computer science 3,885,236 4.41
Option-Level Reasoning Analysis Total 45,019,665 53.82

Genesis II Combined Total (Failure + Option-Level Reasoning): 86,295,966 samples | 107.46B tokens

Combined Genesis I + Genesis II Total: 118,169,823 samples | 148.37B tokens

4.2 Enhanced Evaluation Methodology

Genesis II introduces an evaluation framework that goes beyond simple accuracy. We evaluate models using LLM-as-a-Judge via the OpenCompass framework. The judge analyzes each model response and classifies it into one of the following categories:

Valid Answers vs Invalid Answers

When the LLM judge evaluates a model's response, it determines whether the response contains a clear, extractable answer:

✓ Valid Answers — The judge successfully identifies a single, clear answer in the response:

  • The model commits to one specific answer option
  • The response is unambiguous and can be evaluated for correctness

✗ Invalid Answers — The judge cannot extract a valid answer from the response. This includes two types:

  • No Answer: The model fails to provide any clear answer (abstains, hedges, or gives an unclear response)
  • Multiple Answers: The model provides multiple conflicting answers in the same response (e.g., "it could be A or B")

Metrics

Based on the judge's classification, we compute the following metrics:

Metric Definition
Valid Answer Rate Percentage of responses where the judge identified a clear, single answer
No Answer Rate Percentage of responses where the judge found no clear answer
Multiple Answers Rate Percentage of responses with multiple conflicting answers
Accuracy Percentage of valid answers that are correct

Formula: Valid Answer Rate = 100% - No Answer Rate - Multiple Answers Rate

Why Valid Answer Rate

Conventional accuracy can result in a limited view of model performance (for details on the limitations of log-likelihood-based accuracy, please refer to Genesis I Blog), which is why we utilize a complementary metric.The Valid Answer Rate shows:

  • Model confidence: Higher valid answer rates indicate the model has learned to make clear decisions rather than hedge
  • Training data quality: Models trained on well-structured educational content learn to produce unambiguous responses
  • Practical utility: In real-world applications, a model that provides valid, measurable answers is more useful than one that frequently abstains or gives conflicting responses

4.3 Benchmark Results

We evaluate Genesis II against Cosmopedia-v2 across the 10 new educational domains using the subdomains of the MMLU benchmark, as with Genesis I. Our evaluation is structured into two key comparisons designed to isolate the contribution of each method and understand how they work together.

Evaluation Overview

We conducted two main comparisons:

  • Comparison 1: Individual Methods vs Cosmopedia-v2

    • Goal: Evaluate the performance of our new Option-Level Reasoning method independently, and compare it against both Failure Analysis (used in Genesis I) and Cosmopedia-v2.
    • Token matching: Cosmopedia-v2 contains 27.45B tokens. To ensure a fair comparison, we train Cosmopedia-v2 for 2 epochs (~55B tokens), matching the token count of each individual Genesis II method (~54B tokens each).
    • Key insight: This comparison reveals which data generation method produces the highest-quality educational content for pre-training.
  • Comparison 2: Combined Methods vs Cosmopedia-v2

    • Goal: Investigate what happens when we combine Failure Analysis and Option-Level Reasoning Analysis into a unified dataset.
    • Token matching: The combined Genesis II dataset totals ~107B tokens. To match this budget, we train Cosmopedia-v2 for 4 epochs (~110B tokens).
    • Key insight: This comparison tests whether combining both methods provides additional benefits beyond using a single method alone.

Comparison 1: Individual Methods (1 epoch) vs Cosmopedia-v2 (2 epochs)

First, we compare Cosmopedia-v2 (trained for 2 epochs, ~55B tokens) against our two Genesis II methods: Failure Analysis (generated from incorrect answers, from Genesis I) and Option-Level Reasoning Analysis (generated from correct answers, new in Genesis II).

Key Results (see Table A.1 in Appendix for full details):

Metric Cosmopedia-v2 (2 epochs) Failure Analysis Option-Level Reasoning Analysis
Average Accuracy 12.19 21.76 29.91

Key Observation: Both Genesis II methods significantly outperform Cosmopedia-v2 across all domains. Notably, Option-Level Reasoning Analysis, our new method, achieves the highest accuracy (29.91 average), substantially outperforming both Failure Analysis (21.76 average) and Cosmopedia-v2 (12.19 average). This demonstrates the significant value of analyzing correctly-answered questions as a complementary data source for educational content generation.

figure_2

Figure 2. Radar chart comparing accuracy scores across three configurations: Cosmopedia-v2 (2 epochs), Failure Analysis, and Option-Level Reasoning Analysis. Both Genesis II methods consistently outperform Cosmopedia-v2 across all domains.

Comparison 2: Combined Methods vs Cosmopedia-v2

Next, we investigate the effect of combining Failure Analysis and Option-Level Reasoning Analysis into a unified dataset. We compare Cosmopedia-v2 (trained for 4 epochs, ~110B tokens) against our combined data (~107B tokens).

Key Results (see Table A.2 in Appendix for full details):

Metric Cosmopedia-v2 (4 epochs) Genesis II (Combined)
Average Accuracy 17.11 30.40

Key Finding: Genesis II (combined data) achieves an average accuracy of 30.40 compared to Cosmopedia-v2's 17.11, outperforming by ~1.8x on average.

Observations on Combining Methods: Comparing the combined dataset (30.40 average) to Option-Level Reasoning Analysis alone (29.91 average), we observe a modest improvement in overall accuracy. Interestingly, the combination brings College Chemistry to parity with Cosmopedia-v2 (both at 23.00), while maintaining strong leads in all other domains. This suggests that while Option-Level Reasoning Analysis is the primary driver of performance gains, combining both methods provides additional robustness and helps balance performance across domains.

figure_3

Figure 3. Radar chart comparing accuracy scores between Cosmopedia-v2 (4 epochs) and Genesis II (combining Failure Analysis and Option-Level Reasoning Analysis). Genesis II maintains a substantial lead across all educational domains except for college chemistry, where it is on par.


4.4 Valid Answer Rate Analysis

Beyond accuracy, we analyze the Valid Answer Rate: the percentage of responses where the LLM judge could identify a clear, single answer. This metric reveals crucial differences in model behavior and training data quality. A higher valid answer rate means fewer invalid responses (no answer or multiple conflicting answers).

Comparison 1: Individual Methods vs Cosmopedia-v2 (2 epochs)

Key Results (see Table A.3 in Appendix for a breakdown by category):

Metric Cosmopedia-v2 (2 epochs) Failure Analysis Option-Level Reasoning Analysis
Average Valid Answer Rate 42.36% 81.16% 98.44%

**Key Observation:**Option-Level Reasoning Analysis achieves near-perfect valid answer rates (98.44% average), with some domains reaching 100% (High School Geography). Failure Analysis also shows strong improvement (81.16% average) over Cosmopedia-v2 (42.36% average). This demonstrates that both Genesis II methods train models to produce clear, unambiguous responses.

figure_4

Figure 4. Radar chart comparing Valid Answer Rates across three configurations: Cosmopedia-v2 (2 epochs), Failure Analysis, and Option-Level Reasoning Analysis. Both Genesis II methods achieve dramatically higher valid answer rates.


Valid Answer Rate Comparison 2: Combined Methods vs Cosmopedia-v2 (4 epochs)

Key Results (see Table A.4 in Appendix for a breakdown by category):

Metric Cosmopedia-v2 (4 epochs) Genesis II (Combined)
Average Valid Answer Rate 64.40% 88.87%

Key Finding: Genesis II achieves an average Valid Answer Rate of 88.87% compared to Cosmopedia-v2's 64.40%, a 38% improvement. This means:

  • Genesis II-trained models produce valid, evaluable answers for nearly 9 out of 10 questions
  • Cosmopedia-v2-trained models produce invalid responses (no answer or multiple answers) for about 1 in 3 questions

figure_5

Figure 5. Radar chart comparing Valid Answer Rates between Cosmopedia-v2 (4 epochs) and Genesis II combined data. Genesis II maintains a substantial advantage across all domains.


4.5 Log-Likelihood Evaluation

For completeness, we also include results using the conventional log-likelihood evaluation. However, as discussed in Genesis I, we consider LLM-as-a-judge to be a more reliable evaluation approach for assessing model capabilities on educational tasks.

Key Results (see Table A.5 and Table A.6 in Appendix for a breakdown by category):

Comparison Cosmopedia-v2 Genesis II
Combined Methods vs 4 epochs 22.15% 31.02%
Failure Analysis vs 2 epochs 22.80% 23.31%
Option-Level Reasoning Analysis vs 2 epochs 22.80% 25.50%

Observations: Genesis II outperforms Cosmopedia-v2 on average across all configurations. The combined methods achieve 31.02% compared to Cosmopedia-v2's 22.15% (4 epochs). However, we note that in some individual subdomains (e.g., College Physics, Econometrics), Genesis II does not consistently outperform, highlighting the limitations of log-likelihood evaluation discussed in Genesis I.

Why LLM-as-a-Judge is More Reliable

To understand why log-likelihood evaluation can be misleading, consider two concrete examples that illustrate its fundamental limitations:

Example 1: Genesis II Model (Correct Reasoning, Wrong Log-Likelihood Selection)

Question: Blue light of wavelength 480 nanometers is most strongly reflected off a thin film of oil on a glass slide when viewed near normal incidence. Assuming that the index of refraction of the oil is 1.2 and that of the glass is 1.6, what is the minimum thickness of the oil film (other than zero)?

  • A. 150 nm
  • B. 200 nm ✅ (Correct)
  • C. 300 nm
  • D. 400 nm

Log-probabilities: A: -2.375, B: -2.500, C: -2.875, D: -2.875

Evaluation Method Selected Answer Result
Log-likelihood A
LLM-as-a-Judge B

What happened: Log-likelihood selected A. However, the model's actual generated response demonstrates correct reasoning:

B. 200 nm


*Why This Question is Important/Challenging" ...

The LLM-as-a-Judge correctly identified B by analyzing the model's complete response, revealing that the model actually understood the problem correctly.

Example 2: Cosmopedia Model (Incoherent Reasoning, Correct Log-Likelihood Selection)

Question: A grating spectrometer can just barely resolve two wavelengths of 500 nm and 502 nm, respectively. Which of the following gives the resolving power of the spectrometer?

  • A. 2
  • B. 250 ✅ (Correct)
  • C. 5,000
  • D. 10,000

Log-probabilities: A: -1.898, B: -0.773, C: -1.273, D: -3.031

Evaluation Method Selected Answer Result
Log-likelihood B
LLM-as-a-Judge C

What happened: Log-likelihood correctly identified B with high confidence. However, the model's actual generated response reveals completely incoherent reasoning:

C

Step 2: Acquiring the Spectrometer

To obtain the resolving power of the spectrometer, you need to acquire it. You can either buy a pre-made spectrometer from a science supply store or build one yourself using a DIY kit...

The generated text is nonsensical, discussing "acquiring" and "cleaning" a spectrometer rather than calculating its resolving power (R = λ/Δλ = 500/2 = 250). The LLM-as-a-Judge correctly identified C as the model's answer because that's what the model actually generated, even though log-likelihood happened to select the correct answer by chance.

Key Takeaway

These examples illustrate two complementary failure modes of log-likelihood evaluation:

  1. False negatives: Log-likelihood can select the wrong answer, even when the model's reasoning is correct (Genesis II example)
  2. False positives: Log-likelihood can select the right answer by chance, even when the model generates completely incoherent reasoning (Cosmopedia example)

In both cases, LLM-as-a-Judge provides a more accurate assessment of actual model capabilities by evaluating what the model truly produces rather than relying solely on first-token probability statistics. This is why we consider it the primary evaluation metric for Genesis II.


5. Conclusion

  • Summary of Findings: QVAC Genesis II represents a significant expansion of the largest publicly available synthetic educational dataset, adding 10 new educational domains and 107 billion tokens to bring the combined total to 148 billion tokens across 19 domains. The introduction of the Option-Level Reasoning method, which systematically analyzes all answer options and explicitly addresses common misconceptions, has proven highly effective, achieving an average accuracy of 29.91% compared to 21.76% for Failure Analysis and 12.19% for Cosmopedia-v2. When combined with the original Failure Analysis approach in a dual-method pipeline, Genesis II achieves 30.40% accuracy. These results, validated through our LLM-as-a-Judge evaluation framework, demonstrate that structured reasoning and explicit option analysis during synthetic data generation leads to models that produce clearer, more accurate, and more reliable responses on educational tasks.

  • Implications for Future Pre-training: Public, researchers, academics, research institutions, practitioners and AI community can make use of the datasets to build SOTA base model. This will set the base for a strong foundation of the base model for post-training too.


6. References

[1] QVAC Genesis I Blog Post. Hugging Face Blog. https://huggingface.co/blog/qvac/genesis-i

[2] Hugging Face. Cosmopedia: A synthetic dataset for pretraining language models. Hugging Face Hub. https://huggingface.co/datasets/HuggingFaceTB/Cosmopedia

[3] NVIDIA Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv. https://arxiv.org/abs/1909.08053

[4] m-a-p. FineFineWeb: A curated web corpus with domain categorization. Hugging Face Hub. https://huggingface.co/datasets/m-a-p/FineFineWeb

[5] OpenBMB. Ultra-FineWeb-classifier: A quality classifier for web content. Hugging Face Hub. https://huggingface.co/openbmb/Ultra-FineWeb-classifier

7. Appendix

Appendix A: Detailed Evaluation Results

This appendix contains the complete benchmark results comparing Genesis II against Cosmopedia-v2 across all evaluation configurations.


Table A.1: MMLU Accuracy — Cosmopedia-v2 (2 epochs) vs Failure Analysis vs Option-Level Reasoning Analysis

Domain Cosmopedia-v2 (2 epochs) Failure Analysis Option-Level Reasoning Analysis
Average 12.19 21.76 29.91
Electrical Engineering 17.24 25.52 35.86
Astronomy 15.13 31.58 36.18
High School Geography 11.62 26.26 26.77
College Chemistry 10.00 15.00 26.00
High School Chemistry 12.81 22.17 31.53
College Computer Science 10.00 17.00 25.00
High School Computer Science 11.00 14.00 31.00
Machine Learning 9.82 24.11 30.36
High School Statistics 9.26 18.52 25.93
Econometrics 10.53 23.68 28.07
College Physics 16.67 21.57 32.35

Table A.2: Accuracy — Cosmopedia-v2 (4 epochs) vs Combined Methods

Domain Cosmopedia-v2 (4 epochs) Genesis II (Combined)
Average 17.11 30.40
Electrical Engineering 17.93 26.90
Astronomy 25.66 40.13
High School Geography 18.69 40.40
College Chemistry 23.00 23.00
High School Chemistry 17.73 27.59
College Computer Science 14.00 33.00
High School Computer Science 17.00 29.00
Machine Learning 14.29 29.46
High School Statistics 14.81 29.63
Econometrics 11.40 29.82
College Physics 13.73 25.49

Table A.3: Valid Answer Rate — Cosmopedia-v2 (2 epochs) vs Failure Analysis vs Option-Level Reasoning Analysis

Domain Cosmopedia-v2 (2 epochs) Failure Analysis Option-Level Reasoning Analysis
Average 42.36% 81.16% 98.44%
Electrical Engineering 48.28% 77.24% 98.62%
Astronomy 46.05% 88.82% 99.34%
High School Geography 49.49% 84.84% 100.00%
College Chemistry 46.00% 88.00% 97.00%
High School Chemistry 40.89% 78.82% 99.02%
College Computer Science 38.00% 81.00% 99.00%
High School Computer Science 39.00% 74.00% 98.00%
Machine Learning 36.61% 81.25% 96.42%
High School Statistics 30.56% 73.61% 98.15%
Econometrics 34.21% 78.94% 98.25%
College Physics 56.87% 86.28% 99.02%

Table A.4: Valid Answer Rate — Cosmopedia-v2 (4 epochs) vs Genesis II Combined

Domain Cosmopedia-v2 (4 epochs) Genesis II (Combined)
Average 64.40% 88.87%
Electrical Engineering 75.17% 90.34%
Astronomy 66.45% 94.73%
High School Geography 71.72% 93.93%
College Chemistry 72.00% 91.00%
High School Chemistry 61.58% 87.68%
College Computer Science 62.00% 88.00%
High School Computer Science 59.00% 81.00%
Machine Learning 58.93% 82.14%
High School Statistics 59.72% 87.03%
Econometrics 48.25% 88.60%
College Physics 73.53% 93.14%

Table A.5: Log-Likelihood Accuracy — Cosmopedia-v2 (4 epochs) vs Combined Methods

Domain Cosmopedia-v2 (4 epochs) Genesis II (Combined)
Average 22.15% 31.02%
Astronomy 21.05% 42.11%
College Chemistry 23.00% 29.00%
College Computer Science 28.00% 28.00%
College Physics 17.65% 22.55%
Econometrics 24.56% 21.05%
Electrical Engineering 22.07% 35.17%
High School Chemistry 20.69% 37.44%
High School Computer Science 23.00% 27.00%
High School Geography 24.24% 40.40%
High School Statistics 14.35% 25.46%
Machine Learning 25.00% 33.04%

Table A.6: Log-Likelihood Accuracy — Cosmopedia-v2 (2 epochs) vs Failure Analysis vs Option-Level Reasoning Analysis

Domain Cosmopedia-v2 (2 epochs) Failure Analysis Option-Level Reasoning Analysis
Average 22.80% 23.31% 25.50%
Astronomy 19.74% 21.71% 23.03%
College Chemistry 16.00% 20.00% 23.00%
College Computer Science 23.00% 22.00% 22.00%
College Physics 25.49% 19.61% 24.51%
Econometrics 23.68% 24.56% 30.70%
Electrical Engineering 29.66% 27.59% 30.34%
High School Chemistry 21.67% 17.73% 31.03%
High School Computer Science 22.00% 25.00% 24.00%
High School Geography 25.76% 25.76% 23.74%
High School Statistics 15.28% 17.59% 22.22%
Machine Learning 28.57% 34.82% 25.89%

Appendix B: Option-Level Reasoning Analysis Templates

The Option-Level Reasoning Analysis method generates educational content in four distinct styles. Each template is applied to questions that the model answers correctly, producing diverse educational materials that reinforce correct reasoning and explain why incorrect options fail.

Educational Textbook Template (for Option-Level Reasoning Analysis):

You are an educational content creator generating high-quality textbook explanations for multiple choice question analysis.

Given:
• The question: {{prompt}}
• The correct answer: {{target}}

**Content Requirements:**
- Generate MAXIMUM 3000 words of educational content
- Final answers must appear within \boxed{…}
- Write in clear, pedagogical language suitable for textbooks
- Create appropriate section titles and structure that fit the specific topic and problem type
- Organize your explanation with logical sections that help students understand the concept, analyze the correct option first, then the incorrect ones

**Your Task:**
Analyze the given multiple choice question and all answer options. Create a comprehensive educational explanation that:
1. Includes the complete question and all options clearly within your explanation
2. Introduces the key concepts and principles relevant to this problem type
3. First analyzes the correct answer option in detail, explaining why it is right with step-by-step reasoning
4. Provides thorough verification of why the correct option is accurate, identifying key principles and logical reasoning
5. Then examines each incorrect answer option systematically, explaining why each is wrong
6. Identifies specific errors, misconceptions, or flaws in each incorrect option
7. Discusses the underlying principles that distinguish correct from incorrect reasoning
8. Explores broader applications and common misconceptions related to this topic
9. Concludes with actionable strategies for approaching similar questions

**IMPORTANT:** Your textbook explanation must be completely self-contained. Include the full question and all answer options within your response so readers have all necessary information without needing external context. Always analyze the correct option first, then the incorrect ones.

Structure your response with appropriate headings and sections that naturally fit the subject matter and question type, focusing on systematic analysis starting with the correct option.

Web Articles Template (for Option-Level Reasoning Analysis):

You are a content creator specializing in engaging, informative web articles that break down multiple choice questions and effective analysis strategies.

Given:
• The question: {{prompt}}
• The correct answer: {{target}}

**Content Requirements:**
- Generate MAXIMUM 3000 words of engaging web content
- Use a conversational yet informative tone suitable for online readers
- Final answers must appear within \boxed{…}
- Create compelling headings and subheadings that work well for web reading
- Include relatable examples and practical insights
- Structure content for easy scanning with shorter paragraphs and clear sections

**Your Task:**
Create an engaging web article that breaks down this multiple choice question and demonstrates effective analysis. Your article should:
1. Start with the complete question and all options presented in an engaging way
2. Hook readers by explaining why mastering multiple choice analysis matters in real life
3. First analyze the correct answer option in detail, walking through why it's right with clear explanations
4. Provide step-by-step verification of the correct option using relatable analogies when helpful
5. Then systematically analyze each incorrect answer option, explaining why each is wrong
6. Show the specific errors or misconceptions in each incorrect option using accessible language
7. Share practical tips and common pitfalls readers should watch out for when tackling similar questions
8. End with actionable takeaways and strategies for improving multiple choice performance

**IMPORTANT:** Your article must be completely self-contained and include the full question and all answer options. Write for a general audience interested in learning better test-taking and analytical thinking skills. Use engaging language that makes complex reasoning accessible. Always analyze the correct option first, then the incorrect ones.

Structure your response with compelling headings that would work well for web content and encourage readers to keep reading through the analysis of the correct option first, then the incorrect ones.

Question-Answer Template (for Option-Level Reasoning Analysis):

You are an expert tutor providing clear, direct answers about multiple choice question analysis and reasoning.

Given:
• The question: {{prompt}}
• The correct answer: {{target}}

**Content Requirements:**
- Generate MAXIMUM 3000 words of focused Q&A content
- Use clear, direct language with a helpful tutoring tone
- Final answers must appear within \boxed{…}
- Structure as natural Q&A flow that addresses the key learning points
- Focus on practical understanding and clear explanations
- Prioritize clarity and directness over lengthy explanations

**Your Task:**
Create a focused Q&A response that addresses this multiple choice question. Your response should:
1. Present the complete question and all options clearly as the main question
2. Identify what makes this type of question important or challenging
3. First analyze the correct answer option, explaining why it is right with step-by-step logic
4. Provide clear verification of the correct option's reasoning and principles
5. Then analyze each incorrect answer option systematically, explaining why each is wrong
6. Give specific examples of the errors or misconceptions in each incorrect option
7. Share practical advice for approaching similar multiple choice questions
8. Summarize the key principle or method that helps distinguish correct from incorrect options

**IMPORTANT:** Your Q&A must be completely self-contained and include the full question and all answer options. Write as if directly answering a student's question about how to analyze multiple choice options effectively. Always analyze the correct option first, then the incorrect ones.

Structure your response in a natural Q&A format that flows logically from the question to analysis of the correct option, then the incorrect options.

Conversational Dialogue Template (for Option-Level Reasoning Analysis):

You are creating a natural conversational dialogue between a curious student and a knowledgeable assistant discussing a multiple choice question and its analysis.

Given:
• The question: {{prompt}}
• The correct answer: {{target}}

**Content Requirements:**
- Generate MAXIMUM 3000 words of natural conversational dialogue
- Use "User:" and "Assistant:" to clearly mark each speaker
- Final answers must appear within \boxed{…}
- Make the conversation flow naturally with realistic student questions
- Include follow-up questions and clarifications that feel authentic
- Create an engaging back-and-forth that teaches multiple choice analysis through dialogue

**Your Task:**
Create a natural conversation where a student asks about this multiple choice question and you provide helpful explanations. The dialogue should:
1. Start with the student presenting the complete question and all options they're working on
2. Include the student asking why this type of question is important or how to approach it
3. Have the student share their initial thoughts about the options and ask for guidance
4. Show the assistant first explaining why the correct option is right with detailed step-by-step reasoning
5. Include the student asking for clarification about the correct option's reasoning
6. Have the assistant then analyze each incorrect option, explaining why each is wrong
7. End with the student asking for general tips to improve at multiple choice questions

**IMPORTANT:** Create a completely self-contained dialogue that includes the full question and all answer options naturally within the conversation. Make it feel like an authentic tutoring session with realistic questions and responses about multiple choice strategy. Always analyze the correct option first, then the incorrect ones.

Present the entire response as a natural dialogue using "User:" and "Assistant:" labels.

Note: The complete original prompt templates for Failure Analysis are documented in the Genesis I Appendix and remain unchanged for Genesis II. The four Option-Level Reasoning Analysis templates documented above are the new additions for Genesis II, enabling the dual-method pipeline that processes both correct and incorrect model responses.

Community

Sign up or log in to comment