TransformerLM 90M — OpenWebText pretraining (Phase 1)
A 90M-parameter GPT-2-style decoder-only transformer trained from scratch on 1.8B tokens of OpenWebText (2.5 effective epochs).
This checkpoint is the result of phase 1 of an experimental training pipeline that benchmarks:
- Hardware ablation (M1 Max · A100 · RTX PRO 6000 Blackwell · H100)
- Precision ablation (fp32 vs bf16 vs fp16)
- Optimizer ablation (AdamW vs Muon with Moonshot-scale fixes)
- Training recipe ablations (LR schedule, grad clipping, GPT-2 init, weight decay)
- Continued pretraining (phase 2 with FineWeb — separate repo)
Architecture
n_embd = 512 # hidden size
n_heads = 8 # attention heads (head_size = 64)
n_transformers = 12 # layers
block_size = 2048 # context length
vocab_size = 50304 # padded gpt2 vocab (real: 50257)
dropout = 0.0 # (disabled for pretraining)
# Total: ~90.4M parameters
Features:
- Causal self-attention with fused QKV and PyTorch
scaled_dot_product_attention(FlashAttention-2 when on NVIDIA). - Learned positional embeddings (not RoPE).
- LayerNorm (pre-norm) + GELU MLP with 4× hidden expansion.
- GPT-2-style init (N(0, 0.02) weights, scaled by 1/√(2·n_layers) on residual projections).
Training
| Property | Value |
|---|---|
| Data | chanind/openwebtext-gpt2 — 15 shards = 1.8B tokens |
| Splits | 1.62B train / 180M val |
| Optimizer | Hybrid Muon (Moonshot) for 2D block params (37.7M) + Fused AdamW for embeddings, lm_head, norms |
| LR peak | 3e-4 (AdamW) / 0.02 (Muon) |
| LR schedule | Cosine with 500-step warmup, decay to 10% of peak |
| Gradient clipping | 1.0 |
| Weight decay | 0.1 on 2D params, 0 on 1D |
| Precision | bf16 (model + activations) + fp32 optimizer states |
| Compile | torch.compile (default mode) |
| Batch | 64 sequences × 2048 tokens = 131,072 tokens/step |
| Hardware | Single NVIDIA RTX PRO 6000 Blackwell Server (96 GB GDDR7) |
| Throughput | ~418,000 tokens/sec |
| Steps to step 30000 | ~2.5 effective epochs over the 1.8B train corpus |
Loss trajectory
| Step | Train loss | Val loss |
|---|---|---|
| 0 | 11.00 | — |
| 1000 | 4.80 | 4.80 |
| 6000 | 4.05 | 4.07 |
| 12000 (end epoch 1) | 3.97 | 3.96 |
| 18000 | 3.89 | 3.93 |
| 24000 (end epoch 2) | 3.89 | 3.89 |
| 30000 | 3.85 | 3.89 |
Val perplexity at step 30000: exp(3.89) ≈ 49.
Usage
import torch, tiktoken
from huggingface_hub import hf_hub_download
# Download checkpoint
ckpt_path = hf_hub_download(
repo_id="juliannunezb/llm-training-v1-checkpoints",
filename="checkpoint.pt",
repo_type="model",
)
ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)
# Rebuild model (see full model code in the original training repo)
from transformer_lm_v1 import TransformerLM
model = TransformerLM().cuda().to(torch.bfloat16)
model.load_state_dict(ckpt["model"], strict=True)
model.eval()
# Generate
tok = tiktoken.get_encoding("gpt2")
prompt_ids = torch.tensor([tok.encode("Once upon a time")], dtype=torch.long, device="cuda")
with torch.no_grad():
for _ in range(200):
logits = model(prompt_ids[:, -2048:])
# Mask phantom vocab-padding tokens (ids >= 50257)
logits[:, -1, 50257:] = float("-inf")
probs = torch.softmax(logits[:, -1, :] / 0.9, dim=-1)
# Optional top-k filter
v, _ = torch.topk(probs, 40)
probs[probs < v[:, [-1]]] = 0
probs = probs / probs.sum(dim=-1, keepdim=True)
next_id = torch.multinomial(probs, 1)
prompt_ids = torch.cat([prompt_ids, next_id], dim=1)
print(tok.decode(prompt_ids[0].tolist()))
For a ready-to-run inference script and Gradio playground, see the project's GitHub / companion repo.
Sample generations
Prompt: "Once upon a time", temperature=0.9, top_k=40.
"Once upon a time there were more than 100,000 registered voters in the United States today. For those who may be voting for Obama, that's a huge increase, but also a staggering increase. One of the biggest obstacles to getting to it is to keep the election from running through the next election process..."
Prompt: "The best way to cook pasta is"
"The best way to cook pasta is to cook the first, then add a quick food solution, then choose a simple recipe. The next step is the simple way to cook your bread..."
Prompt: "In 2010, scientists discovered"
"In 2010, scientists discovered that the earliest known human tissue in the human brain was present in the brain. The discovery was the first in the history of human DNA in the UK..."
The model produces fluent English with correct grammar and adapts its style to the prompt (political commentary, recipe, science article), but has limited factual accuracy and occasional word hallucinations — expected for a 90M model at ~2× Chinchilla pretraining on OpenWebText alone.
Limitations
- English only (trained exclusively on OpenWebText).
- No instruction following — this is a base/pretrained model, not fine-tuned for chat or instructions.
- Small scale — 90M params is ~2 orders of magnitude below modern LLMs. Use it to study training dynamics, not for downstream tasks.
- Factual hallucinations are very common.
- Safety — not aligned, filtered, or moderated in any way.
License
MIT — the model weights are released under the same permissive terms as the training code. OpenWebText is a community recreation and its underlying content is covered by respective original licenses of the crawled pages.
Citation
If you use this model in research, please cite:
@misc{juliannunezb_transformerlm_90m_2026,
author = {Juli{\'a}n N{\'u}{\~n}ez Barrero},
title = {TransformerLM 90M (OpenWebText pretraining)},
year = 2026,
howpublished = {\url{https://huggingface.co/juliannunezb/llm-training-v1-checkpoints}}
}