TransformerLM 90M — OpenWebText pretraining (Phase 1)

A 90M-parameter GPT-2-style decoder-only transformer trained from scratch on 1.8B tokens of OpenWebText (2.5 effective epochs).

This checkpoint is the result of phase 1 of an experimental training pipeline that benchmarks:

Hardware ablation (M1 Max · A100 · RTX PRO 6000 Blackwell · H100)
Precision ablation (fp32 vs bf16 vs fp16)
Optimizer ablation (AdamW vs Muon with Moonshot-scale fixes)
Training recipe ablations (LR schedule, grad clipping, GPT-2 init, weight decay)
Continued pretraining (phase 2 with FineWeb — separate repo)

Architecture

n_embd         = 512          # hidden size
n_heads        = 8            # attention heads (head_size = 64)
n_transformers = 12           # layers
block_size     = 2048         # context length
vocab_size     = 50304        # padded gpt2 vocab (real: 50257)
dropout        = 0.0          # (disabled for pretraining)
# Total: ~90.4M parameters

Features:

Causal self-attention with fused QKV and PyTorch scaled_dot_product_attention (FlashAttention-2 when on NVIDIA).
Learned positional embeddings (not RoPE).
LayerNorm (pre-norm) + GELU MLP with 4× hidden expansion.
GPT-2-style init (N(0, 0.02) weights, scaled by 1/√(2·n_layers) on residual projections).

Training

Property	Value
Data	`chanind/openwebtext-gpt2` — 15 shards = 1.8B tokens
Splits	1.62B train / 180M val
Optimizer	Hybrid Muon (Moonshot) for 2D block params (37.7M) + Fused AdamW for embeddings, lm_head, norms
LR peak	3e-4 (AdamW) / 0.02 (Muon)
LR schedule	Cosine with 500-step warmup, decay to 10% of peak
Gradient clipping	1.0
Weight decay	0.1 on 2D params, 0 on 1D
Precision	bf16 (model + activations) + fp32 optimizer states
Compile	`torch.compile` (default mode)
Batch	64 sequences × 2048 tokens = 131,072 tokens/step
Hardware	Single NVIDIA RTX PRO 6000 Blackwell Server (96 GB GDDR7)
Throughput	~418,000 tokens/sec
Steps to step 30000	~2.5 effective epochs over the 1.8B train corpus

Loss trajectory

Step	Train loss	Val loss
0	11.00	—
1000	4.80	4.80
6000	4.05	4.07
12000 (end epoch 1)	3.97	3.96
18000	3.89	3.93
24000 (end epoch 2)	3.89	3.89
30000	3.85	3.89

Val perplexity at step 30000: exp(3.89) ≈ 49.

Usage

import torch, tiktoken
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="juliannunezb/llm-training-v1-checkpoints",
    filename="checkpoint.pt",
    repo_type="model",
)
ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)

# Rebuild model (see full model code in the original training repo)
from transformer_lm_v1 import TransformerLM
model = TransformerLM().cuda().to(torch.bfloat16)
model.load_state_dict(ckpt["model"], strict=True)
model.eval()

# Generate
tok = tiktoken.get_encoding("gpt2")
prompt_ids = torch.tensor([tok.encode("Once upon a time")], dtype=torch.long, device="cuda")
with torch.no_grad():
    for _ in range(200):
        logits = model(prompt_ids[:, -2048:])
        # Mask phantom vocab-padding tokens (ids >= 50257)
        logits[:, -1, 50257:] = float("-inf")
        probs = torch.softmax(logits[:, -1, :] / 0.9, dim=-1)
        # Optional top-k filter
        v, _ = torch.topk(probs, 40)
        probs[probs < v[:, [-1]]] = 0
        probs = probs / probs.sum(dim=-1, keepdim=True)
        next_id = torch.multinomial(probs, 1)
        prompt_ids = torch.cat([prompt_ids, next_id], dim=1)
print(tok.decode(prompt_ids[0].tolist()))

For a ready-to-run inference script and Gradio playground, see the project's GitHub / companion repo.

Sample generations

Prompt: "Once upon a time", temperature=0.9, top_k=40.

"Once upon a time there were more than 100,000 registered voters in the United States today. For those who may be voting for Obama, that's a huge increase, but also a staggering increase. One of the biggest obstacles to getting to it is to keep the election from running through the next election process..."

Prompt: "The best way to cook pasta is"

"The best way to cook pasta is to cook the first, then add a quick food solution, then choose a simple recipe. The next step is the simple way to cook your bread..."

Prompt: "In 2010, scientists discovered"

"In 2010, scientists discovered that the earliest known human tissue in the human brain was present in the brain. The discovery was the first in the history of human DNA in the UK..."

The model produces fluent English with correct grammar and adapts its style to the prompt (political commentary, recipe, science article), but has limited factual accuracy and occasional word hallucinations — expected for a 90M model at ~2× Chinchilla pretraining on OpenWebText alone.

Limitations

English only (trained exclusively on OpenWebText).
No instruction following — this is a base/pretrained model, not fine-tuned for chat or instructions.
Small scale — 90M params is ~2 orders of magnitude below modern LLMs. Use it to study training dynamics, not for downstream tasks.
Factual hallucinations are very common.
Safety — not aligned, filtered, or moderated in any way.

License

MIT — the model weights are released under the same permissive terms as the training code. OpenWebText is a community recreation and its underlying content is covered by respective original licenses of the crawled pages.

Citation

If you use this model in research, please cite:

@misc{juliannunezb_transformerlm_90m_2026,
  author       = {Juli{\'a}n N{\'u}{\~n}ez Barrero},
  title        = {TransformerLM 90M (OpenWebText pretraining)},
  year         = 2026,
  howpublished = {\url{https://huggingface.co/juliannunezb/llm-training-v1-checkpoints}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

juliannunezb
/

llm-training-v1-checkpoints