QVAC Genesis II: Expanding the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for LLM Pre-training
KEY HIGHLIGHTS
- Tether Data’s AI research division released QVAC Genesis II, expanding the largest publicly available synthetic educational dataset with 10 new educational domains: Astronomy, High School Chemistry, College Chemistry, High School Computer Science, College Computer Science, Econometrics, Electronic Science, Geography, High School Statistics, and Machine Learning. It also includes an improved College Physics domain (which underperformed in Genesis I). Genesis II contributes 86 million samples and 107 billion new tokens, bringing the combined Genesis I and II dataset to 148 billion tokens across 19 educational domains.
- Structured, option-wise Reasoning: Genesis II introduces a new Option-Level (OL) Reasoning data generation method that produces structured, option-wise reasoning for multiple-choice questions. The method systematically analyzes all answer options, reinforces correct reasoning paths, and explicitly explains common misconceptions. This method contributes ~54 billion new tokens to Genesis II.
- Superior Accuracy and Valid Answer Rate: Our new OL Reasoning method outperforms the original Failure Analysis (used in Genesis I) and Cosmopedia-v2, achieving an average accuracy of 29.91 compared to 21.76 (Failure Analysis) and 12.19 (Cosmopedia-v2) when trained with a comparable training budget. We also evaluate models using both Accuracy and Valid Answer Rate (the percentage of responses containing a clear, unambiguous answer). OL Reasoning attains a near-perfect Valid Answer Rate of 98.44% (LLM-as-a-judge), indicating strong structural and semantic consistency. When combining Failure Analysis and OL Reasoning tokens, performance improves further, reaching an average accuracy of 30.40.
- Dual-Method Generation Pipeline: By combining OL Reasoning Analysis with Failure Analysis, Genesis II introduces a dual-method data generation pipeline that maximizes question utilization across both solved and failed samples. This approach significantly increases dataset diversity, coverage, and reasoning depth, while reducing selection bias inherent in failure-only approaches.
- By making QVAC Genesis II openly available to researchers, Tether Data continues to empower the global AI community to accelerate the development of open-source educational LLMs and democratize access to foundational AI capabilities.
- The QVAC Genesis II is made available under the CC-BY-NC 4.0 (Creative Commons Attribution–NonCommercial 4.0), allowing free use and adaptation for non-commercial research and educational purposes.
Copyright Complaints: We will take appropriate actions in response to notice of copyright infringement. If you believe your work has been used in a manner that infringes upon your intellectual property rights, please email [email protected] to file a notice of infringement.
🚀 Download QVAC Genesis II Dataset
Access the expanded multi-domain educational synthetic dataset with 10 new domains.
🔗 Get the Dataset🚀 QVAC Genesis II Collection
Access the collection with the 3 models used for the evaluation.
🔗 Genesis II Collection1. Introduction
Building upon the success of QVAC Genesis I [1] , the largest publicly available synthetic dataset for educational content (41 billion tokens), Tether Data, S.A. de C.V. (Tether Data, we, us, our) introduces QVAC Genesis II, a major expansion that adds 10 new educational domains, 107 billion new tokens, and introduces a new Option-Level Reasoning data generation method. Combined with Genesis I, the dataset now totals 148 billion tokens.
Genesis I focused on core STEM disciplines (Mathematics, Physics, Biology, and Medicine) and demonstrated superior performance compared to existing synthetic datasets like Cosmopedia [2]. Genesis II extends this foundation by:
- Incorporating additional key educational domains
- Regenerating College Physics with the new dual-method pipeline (this domain underperformed in Genesis I)
- Introducing a new "Option-Level Reasoning" method that leverages questions answered correctly by the model
- Providing a more comprehensive LLM-as-a-judge evaluation framework that measures both accuracy and valid answer rate
Key Contributions
QVAC Genesis II expands upon Genesis I with the following contributions:
Domain expansion and College Physics regeneration. We added 10 new educational domains to the original 9, creating a comprehensive dataset covering 19 domains in total. Additionally, we regenerated College Physics using the new dual-method pipeline, as this domain underperformed in Genesis I:
- Genesis I domains: High School Biology, College Biology, Professional Medicine, College Medicine, High School Mathematics, College Mathematics, High School Physics, College Physics, Conceptual Physics
- Genesis II domains: College Chemistry, High School Chemistry, College Computer Science, High School Computer Science, High School Statistics, Astronomy, Geography, Electrical Engineering, Econometrics, Machine Learning, College Physics (regenerated)
Option-Level Reasoning method and dual-method pipeline. Genesis II introduces a new "Option-Level Reasoning" data generation method that creates educational content from questions the model answers correctly, analyzing all answer options comprehensively. Our evaluation demonstrates that Option-Level Reasoning Analysis alone outperforms both the original Failure Analysis method and Cosmopedia-v2, achieving an average accuracy of 29.91 compared to 21.76 (Failure Analysis) and 12.19 (Cosmopedia-v2), with a near-perfect Valid Answer Rate of 98.44%. By combining Option-Level Reasoning with Failure Analysis (the method from Genesis I), we create a dual-method pipeline that maximizes the utilization of all generated questions:
- Failure Analysis: Generates educational content explaining why incorrect answers fail and how to arrive at the correct solution
- Option-Level Reasoning Analysis (NEW): Generates comprehensive analysis of all answer options, reinforces correct reasoning and explains common misconceptions
Enhanced LLM-as-a-judge evaluation methodology. We introduce a more comprehensive evaluation framework that measures:
- Accuracy: The percentage of questions answered correctly
- Valid Answer Rate: The percentage of responses where the LLM judge identifies a clear, single answer (as opposed to invalid responses with no answer or multiple conflicting answers)
- This dual-metric approach provides deeper insight into model capabilities, demonstrating that Genesis II-trained models not only achieve higher accuracy but also produce significantly more valid, unambiguous responses.
Open-source contribution. We are making QVAC Genesis II available under the CC-BY-NC 4.0, continuing to democratize access to high-quality pretraining data for public institutions, research labs, and the academic community.
2. Methodology
Note: For more detailed information about the base methodology used in Genesis II, including seed data acquisition and quality filtering processes, original prompt templates for Scaling QA, MCQ Answer, LLM-as-a-Judge extraction, and Failure Analysis, pipeline orchestration details (distilabel, vLLM), and model architectures and configurations, please refer . to the comprehensive Genesis I Appendix.
Genesis II builds upon the proven "Learning from Failures" method from Genesis I while introducing a complementary data generation method. This dual-approach methodology maximizes the value extracted from every generated question, whether the model answers correctly or incorrectly.
For complete details on “Learning from Failures” , please refer to QVAC Genesis I.
2.1 Dual-Method Data Generation Pipeline
Figure 1. The enhanced Genesis II pipeline: Seed Data → Quality Filter → Scaling QA (generate 4 MCQs per seed) → Model Answering → Compare to Gold Label → Two methods:
- Failure Analysis (for incorrect answers): Generate educational failures analysis content in four styles
- Option-Level Reasoning (for correct answers): Generate comprehensive option-by-option analysis in four styles
2.2 Option-Level Reasoning Analysis: Our New Method
Genesis II introduces the Option-Level Reasoning Analysis method, applied to questions that the model answers correctly during the Model Answering phase. While the original Failure Analysis method focused on extracting educational value from model errors, Option-Level Reasoning Analysis ensures that correctly answered questions also contribute high-quality educational content.
Rationale: A model answering a question correctly demonstrates understanding, but the reasoning behind that understanding, along with the explicit explanation of why other options are incorrect, provides valuable educational content. This approach:
- Increases dataset diversity by generating content from a different source (correct answers) with distinct reasoning patterns
- Reinforces correct reasoning patterns
- Explicitly addresses common misconceptions through incorrect option analysis
- Provides comprehensive coverage of the topic from multiple angles
- Maximizes the utilization of all generated questions (in Genesis I, correctly answered questions were not used)
Four Output Styles: Similar to Failure Analysis, Option-Level Reasoning Analysis generates educational content in four distinct styles:
- Educational Textbook: Formal, pedagogical explanations with clear section structure
- Web Articles: Engaging, conversational content optimized for online reading
- Question-Answer (Q&A): Direct, focused, tutoring-style responses
- Conversational Dialogue: Natural back-and-forth between a student and assistant
Each style analyzes the correct answer option first with detailed reasoning, then systematically examines each incorrect option. The complete prompt templates are documented in the Appendix.
2.3 Domain Expansion
For Genesis II, we expanded the 9 domains from Genesis I to include 10 new educational domains:
New Domains:
- Chemistry: College Chemistry, High School Chemistry
- Computer Science: College Computer Science, High School Computer Science, Machine Learning
- Statistics: High School Statistics, Econometrics
- Interdisciplinary Sciences: Astronomy, Geography, Electrical Engineering
The rigorous seed data acquisition (using FineFineWeb [4]), quality filtering (using Ultra-FineWeb-classifier [5]), and prompt engineering methodology from Genesis I is applied to our new domains. For full methodology, refer to Genesis I Blog.
3. Pre-training with Megatron-LM
3.1 Overview: The Framework Challenge
Training a 1.7B parameter model from scratch on 64 GPUs sounds straightforward on paper, until you confront the fragmented landscape of distributed training frameworks.
On one hand, we have HuggingFace Transformers: the standard for model definitions, with thousands of architectures, clean APIs, and a large community. Qwen3-1.7B exists here, complete with its attention patterns, RoPE embeddings, and SwiGLU activations.
On the other hand, we have Megatron-Core: NVIDIA's framework for large-scale training, with optimized CUDA kernels, mature tensor parallelism, and communication patterns refined over years of training at scale. This is where training needs to happen if you want reasonable throughput on 64 GPUs.
The problem is these two worlds don't speak the same language.
The traditional path to using Megatron-Core required rewriting your model from scratch in Megatron's internal format: manually implementing each layer type with the correct parallelism layouts, debugging distributed deadlocks, writing checkpoint conversion scripts, and implementing data loading in Megatron's binary format. This could easily be a multi-month project.
Megatron-Bridge solves this by automatically converting HuggingFace model definitions into Megatron-compatible formats. This lets us use the Qwen3-1.7B architecture without rewriting it:
- Load the Qwen3-1.7B architecture from HuggingFace (layer configurations, attention heads, hidden dimensions)
- Initialize with random weights (no pretrained weights loaded)
- Train on our Genesis II dataset using Megatron-Core's distributed training
3.1 Hardware Configuration
All three models were trained separately on a 64-GPU cluster (8 nodes with 8 NVIDIA H100 GPUs each), connected via InfiniBand.
| Component | Specification |
|---|---|
| GPUs | 64 × NVIDIA H100 (80GB) |
| Nodes | 8 nodes × 8 GPUs each |
| Interconnect | InfinityBand with GPU Direct RDMA |
| Container | NVIDIA NeMo 25.09 |
3.2 Parallelism Strategy
Distributing the training across 64 GPUs requires deciding how to split the work. We use a combination of tensor parallelism and data parallelism:
| Parallelism Type | Size | What It Does |
|---|---|---|
| Tensor (TP) | 2 | Splits attention and feed-forward layers across 2 GPUs |
| Pipeline (PP) | 1 | No pipeline splitting (model fits in memory) |
| Data (DP) | 32 | 32 parallel workers process different batches |
Why TP=2? Tensor parallelism requires frequent communication between GPUs. At TP=2, this communication stays within a single node using fast NVLink. Higher TP would require cross-node communication on every layer, reducing throughput.
Why PP=1? Pipeline parallelism is useful for very large models that don't fit in memory. At 1.7B parameters with TP=2, the model fits comfortably, so pipeline splitting would only add overhead.
Why DP=32? After allocating GPUs for tensor parallelism (64 ÷ 2 = 32), the remaining capacity goes to data parallelism. Each of the 32 workers processes different batches in parallel, then synchronizes gradients.
3.3 Batch Configuration
The batch size configuration balances memory constraints with training efficiency:
| Parameter | Value | Rationale |
|---|---|---|
| Micro Batch Size | 4 per GPU | Limited by GPU memory with 4,096 token sequences |
| Gradient Accumulation | 16 steps | Accumulate gradients before synchronizing |
| Global Batch Size | 2,048 sequences | 4 × 32 workers × 16 accumulation steps |
| Tokens per Step | ~8.4M | 2,048 sequences × 4,096 tokens |
The micro batch size of 4 might seem small for 80GB GPUs, but at 4,096 tokens per sequence with tensor parallelism, this is near the memory limit. We compensate by accumulating gradients over 16 forward passes before updating weights, reaching our target global batch size of 2,048 sequences (~8.4 million tokens per training step).
3.4 Training Configuration & Experimental Design
Rather than training a single model, we designed an experiment to evaluate how synthetic data, created using multiple prompts, generalizes. We trained three distinct models from scratch to create a rigorous comparison:
- Two Specialist Models: Each trained exclusively on one of the two distinct data types defined in Section 2:
- Failure Analysis Model: Trained solely on data generated from incorrect model answers (learning from failures).
- Option-Level Reasoning Model: Trained solely on data generated from correct model answers (analyzing all options).
- One Combined Model: Trained on a shuffled mixture of both Failure Analysis and Option-Level Reasoning datasets.
To ensure a fair comparison, all three models utilized the same hyperparameters and compute budget scaled to token count, differing only in the data composition.
Hyperparameters
We standardized the training duration for all Genesis II runs to a single epoch, ensuring the model encountered each unique synthetic example only once. This strategy mitigates the memorization of repeated tokens and encourages the learning of underlying logic.
- Total Training Tokens: ~50B (Specialist Models) | ~100B (Combined Model)
- Training Duration: 1 Epoch
- Sequence Length: 4,096 tokens
- Learning Rate: 2×10⁻⁴ → 2×10⁻⁵ (cosine decay)
- Warmup: 10% of training
- Weight Decay: 0.01
- Gradient Clipping: 1.0
- Precision: BF16
Learning rate schedule: We start with a warmup period (10% of the epoch) where the learning rate gradually increases to 2×10⁻⁴. This helps stabilize early training when the randomly initialized model produces noisy gradients. After warmup, the learning rate follows a cosine decay down to 2×10⁻⁵, allowing the model to settle into better solutions as it converges toward the end of the epoch.
BF16 precision: We use bfloat16 mixed precision with Flash Attention 2. This configuration significantly reduces memory usage and speeds up training throughput on the H100s without any meaningful loss in convergence accuracy compared to FP32.
3.5 Data Pipeline
Transforming raw text into a format that Megatron-Core can efficiently ingest requires a multi-stage pipeline. This preprocessing happens once before training begins - meaning any mistakes here propagate through the entire run.
Stage 1: Concatenation and Filtering
Our Genesis II data comprises thousands of individual JSONL files produced by the data generation workers. The first step consolidates these into a single file while applying quality filters:
- Documents must have a minimum text length of 100 characters (filtering out incomplete or truncated generations)
- Documents must have valid reasoning outputs (filtering out failed generations)
- The concatenated file is then shuffled with a fixed random seed for reproducibility
This filtering step is important because even small amounts of low-quality data (empty documents, truncated text, failed generations) can degrade training.
Stage 2: Tokenization and Binary Conversion
The filtered JSONL is then processed by Megatron's preprocessing tool, which:
- Tokenizes each document using the Qwen3 tokenizer
- Appends an end-of-document
<EOD>token after each document - Packs the tokenized documents sequentially into a binary file
- Builds an index file that records where each document starts and ends
The result is two files: a .bin file containing all token IDs packed end-to-end, and an .idx file containing the byte offsets for each document. This format allows the data loader to seek directly to any document without reading the entire file.
Stage 3: Sequence Packing During Training
During training, Megatron's data loader constructs training sequences from this binary format:
- Continuous Sampling: The loader samples a starting position and reads tokens sequentially to fill the 4,096-token context window.
- Document Packing: If a document ends mid-sequence (marked by an
<EOD>token), the next document begins immediately within the same sequence. - Context Isolation (The "Reset"): Crucially, we employ attention masking and position ID resetting. When the model encounters an
<EOD>token, the attention mask is reset so that tokens in the new document cannot attend to the previous one. This ensures that while documents share a compute sequence, they remain mathematically independent.
This "packing" approach is more efficient than padding each document to a fixed length so that short documents don't waste compute on padding tokens, and the model naturally learns to handle document transitions. A single training sequence might contain 2-3 documents if they're short enough.
4. Evaluation
4.1 Dataset Statistics
Genesis I Domains:
| Domain | Number of Samples | No of Tokens (in B) |
|---|---|---|
| High school biology | 3,818,070 | 4.511 |
| College biology | 3,286,648 | 3.927 |
| Professional medicine | 1,552,474 | 1.884 |
| College medicine | 5,164,247 | 6.218 |
| High school mathematics | 3,244,240 | 4.277 |
| College mathematics | 5,895,052 | 8.243 |
| High school physics | 2,277,880 | 3.061 |
| College physics | 4,281,062 | 5.814 |
| Conceptual physics | 2,354,184 | 2.973 |
| Genesis I Total | 31,873,857 | 40.906 |
Genesis II New Domains — Failure Analysis Data:
| Domain | Number of Samples | No of Tokens (in B) |
|---|---|---|
| College physics | 4,144,798 | 6.24 |
| Astronomy | 4,716,117 | 6.21 |
| Econometrics | 3,486,501 | 5.24 |
| College chemistry | 3,964,112 | 5.07 |
| Electrical Engineering | 3,901,901 | 4.96 |
| College computer science | 3,889,696 | 4.77 |
| Geography | 3,992,646 | 4.60 |
| High school statistics | 3,354,353 | 4.47 |
| High school chemistry | 3,327,350 | 4.15 |
| High school computer science | 3,365,258 | 4.06 |
| Machine learning | 3,133,569 | 3.87 |
| Failure Analysis Total | 41,276,301 | 53.64 |
Genesis II New Domains — Option-Level Reasoning Analysis Data:
| Domain | Number of Samples | No of Tokens (in B) |
|---|---|---|
| Machine learning | 4,636,066 | 5.51 |
| High school statistics | 4,424,565 | 5.41 |
| High school chemistry | 4,464,847 | 5.22 |
| Econometrics | 3,871,249 | 5.06 |
| College chemistry | 4,182,669 | 5.01 |
| College physics | 3,672,394 | 4.81 |
| Geography | 4,301,699 | 4.77 |
| Astronomy | 3,970,849 | 4.71 |
| College computer science | 3,851,555 | 4.47 |
| Electrical Engineering | 3,758,536 | 4.44 |
| High school computer science | 3,885,236 | 4.41 |
| Option-Level Reasoning Analysis Total | 45,019,665 | 53.82 |
Genesis II Combined Total (Failure + Option-Level Reasoning): 86,295,966 samples | 107.46B tokens
Combined Genesis I + Genesis II Total: 118,169,823 samples | 148.37B tokens
4.2 Enhanced Evaluation Methodology
Genesis II introduces an evaluation framework that goes beyond simple accuracy. We evaluate models using LLM-as-a-Judge via the OpenCompass framework. The judge analyzes each model response and classifies it into one of the following categories:
Valid Answers vs Invalid Answers
When the LLM judge evaluates a model's response, it determines whether the response contains a clear, extractable answer:
✓ Valid Answers — The judge successfully identifies a single, clear answer in the response:
- The model commits to one specific answer option
- The response is unambiguous and can be evaluated for correctness
✗ Invalid Answers — The judge cannot extract a valid answer from the response. This includes two types:
- No Answer: The model fails to provide any clear answer (abstains, hedges, or gives an unclear response)
- Multiple Answers: The model provides multiple conflicting answers in the same response (e.g., "it could be A or B")
Metrics
Based on the judge's classification, we compute the following metrics:
| Metric | Definition |
|---|---|
| Valid Answer Rate | Percentage of responses where the judge identified a clear, single answer |
| No Answer Rate | Percentage of responses where the judge found no clear answer |
| Multiple Answers Rate | Percentage of responses with multiple conflicting answers |
| Accuracy | Percentage of valid answers that are correct |
Formula: Valid Answer Rate = 100% - No Answer Rate - Multiple Answers Rate
Why Valid Answer Rate
Conventional accuracy can result in a limited view of model performance (for details on the limitations of log-likelihood-based accuracy, please refer to Genesis I Blog), which is why we utilize a complementary metric.The Valid Answer Rate shows:
- Model confidence: Higher valid answer rates indicate the model has learned to make clear decisions rather than hedge
- Training data quality: Models trained on well-structured educational content learn to produce unambiguous responses
- Practical utility: In real-world applications, a model that provides valid, measurable answers is more useful than one that frequently abstains or gives conflicting responses
4.3 Benchmark Results
We evaluate Genesis II against Cosmopedia-v2 across the 10 new educational domains using the subdomains of the MMLU benchmark, as with Genesis I. Our evaluation is structured into two key comparisons designed to isolate the contribution of each method and understand how they work together.
Evaluation Overview
We conducted two main comparisons:
Comparison 1: Individual Methods vs Cosmopedia-v2
- Goal: Evaluate the performance of our new Option-Level Reasoning method independently, and compare it against both Failure Analysis (used in Genesis I) and Cosmopedia-v2.
- Token matching: Cosmopedia-v2 contains 27.45B tokens. To ensure a fair comparison, we train Cosmopedia-v2 for 2 epochs (~55B tokens), matching the token count of each individual Genesis II method (~54B tokens each).
- Key insight: This comparison reveals which data generation method produces the highest-quality educational content for pre-training.
Comparison 2: Combined Methods vs Cosmopedia-v2
- Goal: Investigate what happens when we combine Failure Analysis and Option-Level Reasoning Analysis into a unified dataset.
- Token matching: The combined Genesis II dataset totals ~107B tokens. To match this budget, we train Cosmopedia-v2 for 4 epochs (~110B tokens).
- Key insight: This comparison tests whether combining both methods provides additional benefits beyond using a single method alone.
Comparison 1: Individual Methods (1 epoch) vs Cosmopedia-v2 (2 epochs)
First, we compare Cosmopedia-v2 (trained for 2 epochs, ~55B tokens) against our two Genesis II methods: Failure Analysis (generated from incorrect answers, from Genesis I) and Option-Level Reasoning Analysis (generated from correct answers, new in Genesis II).
Key Results (see Table A.1 in Appendix for full details):
| Metric | Cosmopedia-v2 (2 epochs) | Failure Analysis | Option-Level Reasoning Analysis |
|---|---|---|---|
| Average Accuracy | 12.19 | 21.76 | 29.91 |
Key Observation: Both Genesis II methods significantly outperform Cosmopedia-v2 across all domains. Notably, Option-Level Reasoning Analysis, our new method, achieves the highest accuracy (29.91 average), substantially outperforming both Failure Analysis (21.76 average) and Cosmopedia-v2 (12.19 average). This demonstrates the significant value of analyzing correctly-answered questions as a complementary data source for educational content generation.
Figure 2. Radar chart comparing accuracy scores across three configurations: Cosmopedia-v2 (2 epochs), Failure Analysis, and Option-Level Reasoning Analysis. Both Genesis II methods consistently outperform Cosmopedia-v2 across all domains.
Comparison 2: Combined Methods vs Cosmopedia-v2
Next, we investigate the effect of combining Failure Analysis and Option-Level Reasoning Analysis into a unified dataset. We compare Cosmopedia-v2 (trained for 4 epochs, ~110B tokens) against our combined data (~107B tokens).
Key Results (see Table A.2 in Appendix for full details):
| Metric | Cosmopedia-v2 (4 epochs) | Genesis II (Combined) |
|---|---|---|
| Average Accuracy | 17.11 | 30.40 |
Key Finding: Genesis II (combined data) achieves an average accuracy of 30.40 compared to Cosmopedia-v2's 17.11, outperforming by ~1.8x on average.
Observations on Combining Methods: Comparing the combined dataset (30.40 average) to Option-Level Reasoning Analysis alone (29.91 average), we observe a modest improvement in overall accuracy. Interestingly, the combination brings College Chemistry to parity with Cosmopedia-v2 (both at 23.00), while maintaining strong leads in all other domains. This suggests that while Option-Level Reasoning Analysis is the primary driver of performance gains, combining both methods provides additional robustness and helps balance performance across domains.
Figure 3. Radar chart comparing accuracy scores between Cosmopedia-v2 (4 epochs) and Genesis II (combining Failure Analysis and Option-Level Reasoning Analysis). Genesis II maintains a substantial lead across all educational domains except for college chemistry, where it is on par.
4.4 Valid Answer Rate Analysis
Beyond accuracy, we analyze the Valid Answer Rate: the percentage of responses where the LLM judge could identify a clear, single answer. This metric reveals crucial differences in model behavior and training data quality. A higher valid answer rate means fewer invalid responses (no answer or multiple conflicting answers).
Comparison 1: Individual Methods vs Cosmopedia-v2 (2 epochs)
Key Results (see Table A.3 in Appendix for a breakdown by category):
| Metric | Cosmopedia-v2 (2 epochs) | Failure Analysis | Option-Level Reasoning Analysis |
|---|---|---|---|
| Average Valid Answer Rate | 42.36% | 81.16% | 98.44% |
**Key Observation:**Option-Level Reasoning Analysis achieves near-perfect valid answer rates (98.44% average), with some domains reaching 100% (High School Geography). Failure Analysis also shows strong improvement (81.16% average) over Cosmopedia-v2 (42.36% average). This demonstrates that both Genesis II methods train models to produce clear, unambiguous responses.
Figure 4. Radar chart comparing Valid Answer Rates across three configurations: Cosmopedia-v2 (2 epochs), Failure Analysis, and Option-Level Reasoning Analysis. Both Genesis II methods achieve dramatically higher valid answer rates.
Valid Answer Rate Comparison 2: Combined Methods vs Cosmopedia-v2 (4 epochs)
Key Results (see Table A.4 in Appendix for a breakdown by category):
| Metric | Cosmopedia-v2 (4 epochs) | Genesis II (Combined) |
|---|---|---|
| Average Valid Answer Rate | 64.40% | 88.87% |
Key Finding: Genesis II achieves an average Valid Answer Rate of 88.87% compared to Cosmopedia-v2's 64.40%, a 38% improvement. This means:
- Genesis II-trained models produce valid, evaluable answers for nearly 9 out of 10 questions
- Cosmopedia-v2-trained models produce invalid responses (no answer or multiple answers) for about 1 in 3 questions
Figure 5. Radar chart comparing Valid Answer Rates between Cosmopedia-v2 (4 epochs) and Genesis II combined data. Genesis II maintains a substantial advantage across all domains.
4.5 Log-Likelihood Evaluation
For completeness, we also include results using the conventional log-likelihood evaluation. However, as discussed in Genesis I, we consider LLM-as-a-judge to be a more reliable evaluation approach for assessing model capabilities on educational tasks.
Key Results (see Table A.5 and Table A.6 in Appendix for a breakdown by category):
| Comparison | Cosmopedia-v2 | Genesis II |
|---|---|---|
| Combined Methods vs 4 epochs | 22.15% | 31.02% |
| Failure Analysis vs 2 epochs | 22.80% | 23.31% |
| Option-Level Reasoning Analysis vs 2 epochs | 22.80% | 25.50% |
Observations: Genesis II outperforms Cosmopedia-v2 on average across all configurations. The combined methods achieve 31.02% compared to Cosmopedia-v2's 22.15% (4 epochs). However, we note that in some individual subdomains (e.g., College Physics, Econometrics), Genesis II does not consistently outperform, highlighting the limitations of log-likelihood evaluation discussed in Genesis I.
Why LLM-as-a-Judge is More Reliable
To understand why log-likelihood evaluation can be misleading, consider two concrete examples that illustrate its fundamental limitations:
Example 1: Genesis II Model (Correct Reasoning, Wrong Log-Likelihood Selection)
Question: Blue light of wavelength 480 nanometers is most strongly reflected off a thin film of oil on a glass slide when viewed near normal incidence. Assuming that the index of refraction of the oil is 1.2 and that of the glass is 1.6, what is the minimum thickness of the oil film (other than zero)?
- A. 150 nm
- B. 200 nm ✅ (Correct)
- C. 300 nm
- D. 400 nm
Log-probabilities: A: -2.375, B: -2.500, C: -2.875, D: -2.875
| Evaluation Method | Selected Answer | Result |
|---|---|---|
| Log-likelihood | A | ❌ |
| LLM-as-a-Judge | B | ✅ |
What happened: Log-likelihood selected A. However, the model's actual generated response demonstrates correct reasoning:
B. 200 nm
*Why This Question is Important/Challenging" ...
The LLM-as-a-Judge correctly identified B by analyzing the model's complete response, revealing that the model actually understood the problem correctly.
Example 2: Cosmopedia Model (Incoherent Reasoning, Correct Log-Likelihood Selection)
Question: A grating spectrometer can just barely resolve two wavelengths of 500 nm and 502 nm, respectively. Which of the following gives the resolving power of the spectrometer?
- A. 2
- B. 250 ✅ (Correct)
- C. 5,000
- D. 10,000
Log-probabilities: A: -1.898, B: -0.773, C: -1.273, D: -3.031
| Evaluation Method | Selected Answer | Result |
|---|---|---|
| Log-likelihood | B | ✅ |
| LLM-as-a-Judge | C | ❌ |
What happened: Log-likelihood correctly identified B with high confidence. However, the model's actual generated response reveals completely incoherent reasoning:
C
Step 2: Acquiring the Spectrometer
To obtain the resolving power of the spectrometer, you need to acquire it. You can either buy a pre-made spectrometer from a science supply store or build one yourself using a DIY kit...
The generated text is nonsensical, discussing "acquiring" and "cleaning" a spectrometer rather than calculating its resolving power (R = λ/Δλ = 500/2 = 250). The LLM-as-a-Judge correctly identified C as the model's answer because that's what the model actually generated, even though log-likelihood happened to select the correct answer by chance.
Key Takeaway
These examples illustrate two complementary failure modes of log-likelihood evaluation:
- False negatives: Log-likelihood can select the wrong answer, even when the model's reasoning is correct (Genesis II example)
- False positives: Log-likelihood can select the right answer by chance, even when the model generates completely incoherent reasoning (Cosmopedia example)
In both cases, LLM-as-a-Judge provides a more accurate assessment of actual model capabilities by evaluating what the model truly produces rather than relying solely on first-token probability statistics. This is why we consider it the primary evaluation metric for Genesis II.
5. Conclusion
Summary of Findings: QVAC Genesis II represents a significant expansion of the largest publicly available synthetic educational dataset, adding 10 new educational domains and 107 billion tokens to bring the combined total to 148 billion tokens across 19 domains. The introduction of the Option-Level Reasoning method, which systematically analyzes all answer options and explicitly addresses common misconceptions, has proven highly effective, achieving an average accuracy of 29.91% compared to 21.76% for Failure Analysis and 12.19% for Cosmopedia-v2. When combined with the original Failure Analysis approach in a dual-method pipeline, Genesis II achieves 30.40% accuracy. These results, validated through our LLM-as-a-Judge evaluation framework, demonstrate that structured reasoning and explicit option analysis during synthetic data generation leads to models that produce clearer, more accurate, and more reliable responses on educational tasks.
Implications for Future Pre-training: Public, researchers, academics, research institutions, practitioners and AI community can make use of the datasets to build SOTA base model. This will set the base for a strong foundation of the base model for post-training too.
6. References
[1] QVAC Genesis I Blog Post. Hugging Face Blog. https://huggingface.co/blog/qvac/genesis-i
[2] Hugging Face. Cosmopedia: A synthetic dataset for pretraining language models. Hugging Face Hub. https://huggingface.co/datasets/HuggingFaceTB/Cosmopedia
[3] NVIDIA Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv. https://arxiv.org/abs/1909.08053
[4] m-a-p. FineFineWeb: A curated web corpus with domain categorization. Hugging Face Hub. https://huggingface.co/datasets/m-a-p/FineFineWeb
[5] OpenBMB. Ultra-FineWeb-classifier: A quality classifier for web content. Hugging Face Hub. https://huggingface.co/openbmb/Ultra-FineWeb-classifier
7. Appendix
Appendix A: Detailed Evaluation Results
This appendix contains the complete benchmark results comparing Genesis II against Cosmopedia-v2 across all evaluation configurations.
Table A.1: MMLU Accuracy — Cosmopedia-v2 (2 epochs) vs Failure Analysis vs Option-Level Reasoning Analysis
| Domain | Cosmopedia-v2 (2 epochs) | Failure Analysis | Option-Level Reasoning Analysis |
|---|---|---|---|
| Average | 12.19 | 21.76 | 29.91 |
| Electrical Engineering | 17.24 | 25.52 | 35.86 |
| Astronomy | 15.13 | 31.58 | 36.18 |
| High School Geography | 11.62 | 26.26 | 26.77 |
| College Chemistry | 10.00 | 15.00 | 26.00 |
| High School Chemistry | 12.81 | 22.17 | 31.53 |
| College Computer Science | 10.00 | 17.00 | 25.00 |
| High School Computer Science | 11.00 | 14.00 | 31.00 |
| Machine Learning | 9.82 | 24.11 | 30.36 |
| High School Statistics | 9.26 | 18.52 | 25.93 |
| Econometrics | 10.53 | 23.68 | 28.07 |
| College Physics | 16.67 | 21.57 | 32.35 |
Table A.2: Accuracy — Cosmopedia-v2 (4 epochs) vs Combined Methods
| Domain | Cosmopedia-v2 (4 epochs) | Genesis II (Combined) |
|---|---|---|
| Average | 17.11 | 30.40 |
| Electrical Engineering | 17.93 | 26.90 |
| Astronomy | 25.66 | 40.13 |
| High School Geography | 18.69 | 40.40 |
| College Chemistry | 23.00 | 23.00 |
| High School Chemistry | 17.73 | 27.59 |
| College Computer Science | 14.00 | 33.00 |
| High School Computer Science | 17.00 | 29.00 |
| Machine Learning | 14.29 | 29.46 |
| High School Statistics | 14.81 | 29.63 |
| Econometrics | 11.40 | 29.82 |
| College Physics | 13.73 | 25.49 |
Table A.3: Valid Answer Rate — Cosmopedia-v2 (2 epochs) vs Failure Analysis vs Option-Level Reasoning Analysis
| Domain | Cosmopedia-v2 (2 epochs) | Failure Analysis | Option-Level Reasoning Analysis |
|---|---|---|---|
| Average | 42.36% | 81.16% | 98.44% |
| Electrical Engineering | 48.28% | 77.24% | 98.62% |
| Astronomy | 46.05% | 88.82% | 99.34% |
| High School Geography | 49.49% | 84.84% | 100.00% |
| College Chemistry | 46.00% | 88.00% | 97.00% |
| High School Chemistry | 40.89% | 78.82% | 99.02% |
| College Computer Science | 38.00% | 81.00% | 99.00% |
| High School Computer Science | 39.00% | 74.00% | 98.00% |
| Machine Learning | 36.61% | 81.25% | 96.42% |
| High School Statistics | 30.56% | 73.61% | 98.15% |
| Econometrics | 34.21% | 78.94% | 98.25% |
| College Physics | 56.87% | 86.28% | 99.02% |
Table A.4: Valid Answer Rate — Cosmopedia-v2 (4 epochs) vs Genesis II Combined
| Domain | Cosmopedia-v2 (4 epochs) | Genesis II (Combined) |
|---|---|---|
| Average | 64.40% | 88.87% |
| Electrical Engineering | 75.17% | 90.34% |
| Astronomy | 66.45% | 94.73% |
| High School Geography | 71.72% | 93.93% |
| College Chemistry | 72.00% | 91.00% |
| High School Chemistry | 61.58% | 87.68% |
| College Computer Science | 62.00% | 88.00% |
| High School Computer Science | 59.00% | 81.00% |
| Machine Learning | 58.93% | 82.14% |
| High School Statistics | 59.72% | 87.03% |
| Econometrics | 48.25% | 88.60% |
| College Physics | 73.53% | 93.14% |
Table A.5: Log-Likelihood Accuracy — Cosmopedia-v2 (4 epochs) vs Combined Methods
| Domain | Cosmopedia-v2 (4 epochs) | Genesis II (Combined) |
|---|---|---|
| Average | 22.15% | 31.02% |
| Astronomy | 21.05% | 42.11% |
| College Chemistry | 23.00% | 29.00% |
| College Computer Science | 28.00% | 28.00% |
| College Physics | 17.65% | 22.55% |
| Econometrics | 24.56% | 21.05% |
| Electrical Engineering | 22.07% | 35.17% |
| High School Chemistry | 20.69% | 37.44% |
| High School Computer Science | 23.00% | 27.00% |
| High School Geography | 24.24% | 40.40% |
| High School Statistics | 14.35% | 25.46% |
| Machine Learning | 25.00% | 33.04% |
Table A.6: Log-Likelihood Accuracy — Cosmopedia-v2 (2 epochs) vs Failure Analysis vs Option-Level Reasoning Analysis
| Domain | Cosmopedia-v2 (2 epochs) | Failure Analysis | Option-Level Reasoning Analysis |
|---|---|---|---|
| Average | 22.80% | 23.31% | 25.50% |
| Astronomy | 19.74% | 21.71% | 23.03% |
| College Chemistry | 16.00% | 20.00% | 23.00% |
| College Computer Science | 23.00% | 22.00% | 22.00% |
| College Physics | 25.49% | 19.61% | 24.51% |
| Econometrics | 23.68% | 24.56% | 30.70% |
| Electrical Engineering | 29.66% | 27.59% | 30.34% |
| High School Chemistry | 21.67% | 17.73% | 31.03% |
| High School Computer Science | 22.00% | 25.00% | 24.00% |
| High School Geography | 25.76% | 25.76% | 23.74% |
| High School Statistics | 15.28% | 17.59% | 22.22% |
| Machine Learning | 28.57% | 34.82% | 25.89% |
Appendix B: Option-Level Reasoning Analysis Templates
The Option-Level Reasoning Analysis method generates educational content in four distinct styles. Each template is applied to questions that the model answers correctly, producing diverse educational materials that reinforce correct reasoning and explain why incorrect options fail.
Educational Textbook Template (for Option-Level Reasoning Analysis):
You are an educational content creator generating high-quality textbook explanations for multiple choice question analysis.
Given:
• The question: {{prompt}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate MAXIMUM 3000 words of educational content
- Final answers must appear within \boxed{…}
- Write in clear, pedagogical language suitable for textbooks
- Create appropriate section titles and structure that fit the specific topic and problem type
- Organize your explanation with logical sections that help students understand the concept, analyze the correct option first, then the incorrect ones
**Your Task:**
Analyze the given multiple choice question and all answer options. Create a comprehensive educational explanation that:
1. Includes the complete question and all options clearly within your explanation
2. Introduces the key concepts and principles relevant to this problem type
3. First analyzes the correct answer option in detail, explaining why it is right with step-by-step reasoning
4. Provides thorough verification of why the correct option is accurate, identifying key principles and logical reasoning
5. Then examines each incorrect answer option systematically, explaining why each is wrong
6. Identifies specific errors, misconceptions, or flaws in each incorrect option
7. Discusses the underlying principles that distinguish correct from incorrect reasoning
8. Explores broader applications and common misconceptions related to this topic
9. Concludes with actionable strategies for approaching similar questions
**IMPORTANT:** Your textbook explanation must be completely self-contained. Include the full question and all answer options within your response so readers have all necessary information without needing external context. Always analyze the correct option first, then the incorrect ones.
Structure your response with appropriate headings and sections that naturally fit the subject matter and question type, focusing on systematic analysis starting with the correct option.
Web Articles Template (for Option-Level Reasoning Analysis):
You are a content creator specializing in engaging, informative web articles that break down multiple choice questions and effective analysis strategies.
Given:
• The question: {{prompt}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate MAXIMUM 3000 words of engaging web content
- Use a conversational yet informative tone suitable for online readers
- Final answers must appear within \boxed{…}
- Create compelling headings and subheadings that work well for web reading
- Include relatable examples and practical insights
- Structure content for easy scanning with shorter paragraphs and clear sections
**Your Task:**
Create an engaging web article that breaks down this multiple choice question and demonstrates effective analysis. Your article should:
1. Start with the complete question and all options presented in an engaging way
2. Hook readers by explaining why mastering multiple choice analysis matters in real life
3. First analyze the correct answer option in detail, walking through why it's right with clear explanations
4. Provide step-by-step verification of the correct option using relatable analogies when helpful
5. Then systematically analyze each incorrect answer option, explaining why each is wrong
6. Show the specific errors or misconceptions in each incorrect option using accessible language
7. Share practical tips and common pitfalls readers should watch out for when tackling similar questions
8. End with actionable takeaways and strategies for improving multiple choice performance
**IMPORTANT:** Your article must be completely self-contained and include the full question and all answer options. Write for a general audience interested in learning better test-taking and analytical thinking skills. Use engaging language that makes complex reasoning accessible. Always analyze the correct option first, then the incorrect ones.
Structure your response with compelling headings that would work well for web content and encourage readers to keep reading through the analysis of the correct option first, then the incorrect ones.
Question-Answer Template (for Option-Level Reasoning Analysis):
You are an expert tutor providing clear, direct answers about multiple choice question analysis and reasoning.
Given:
• The question: {{prompt}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate MAXIMUM 3000 words of focused Q&A content
- Use clear, direct language with a helpful tutoring tone
- Final answers must appear within \boxed{…}
- Structure as natural Q&A flow that addresses the key learning points
- Focus on practical understanding and clear explanations
- Prioritize clarity and directness over lengthy explanations
**Your Task:**
Create a focused Q&A response that addresses this multiple choice question. Your response should:
1. Present the complete question and all options clearly as the main question
2. Identify what makes this type of question important or challenging
3. First analyze the correct answer option, explaining why it is right with step-by-step logic
4. Provide clear verification of the correct option's reasoning and principles
5. Then analyze each incorrect answer option systematically, explaining why each is wrong
6. Give specific examples of the errors or misconceptions in each incorrect option
7. Share practical advice for approaching similar multiple choice questions
8. Summarize the key principle or method that helps distinguish correct from incorrect options
**IMPORTANT:** Your Q&A must be completely self-contained and include the full question and all answer options. Write as if directly answering a student's question about how to analyze multiple choice options effectively. Always analyze the correct option first, then the incorrect ones.
Structure your response in a natural Q&A format that flows logically from the question to analysis of the correct option, then the incorrect options.
Conversational Dialogue Template (for Option-Level Reasoning Analysis):
You are creating a natural conversational dialogue between a curious student and a knowledgeable assistant discussing a multiple choice question and its analysis.
Given:
• The question: {{prompt}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate MAXIMUM 3000 words of natural conversational dialogue
- Use "User:" and "Assistant:" to clearly mark each speaker
- Final answers must appear within \boxed{…}
- Make the conversation flow naturally with realistic student questions
- Include follow-up questions and clarifications that feel authentic
- Create an engaging back-and-forth that teaches multiple choice analysis through dialogue
**Your Task:**
Create a natural conversation where a student asks about this multiple choice question and you provide helpful explanations. The dialogue should:
1. Start with the student presenting the complete question and all options they're working on
2. Include the student asking why this type of question is important or how to approach it
3. Have the student share their initial thoughts about the options and ask for guidance
4. Show the assistant first explaining why the correct option is right with detailed step-by-step reasoning
5. Include the student asking for clarification about the correct option's reasoning
6. Have the assistant then analyze each incorrect option, explaining why each is wrong
7. End with the student asking for general tips to improve at multiple choice questions
**IMPORTANT:** Create a completely self-contained dialogue that includes the full question and all answer options naturally within the conversation. Make it feel like an authentic tutoring session with realistic questions and responses about multiple choice strategy. Always analyze the correct option first, then the incorrect ones.
Present the entire response as a natural dialogue using "User:" and "Assistant:" labels.
Note: The complete original prompt templates for Failure Analysis are documented in the Genesis I Appendix and remain unchanged for Genesis II. The four Option-Level Reasoning Analysis templates documented above are the new additions for Genesis II, enabling the dual-method pipeline that processes both correct and incorrect model responses.




