Mixture-of-Experts Language Models

A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).

Models

This repository contains two MoE architectures:

1. Sparse MoE (Top-K Routing)

Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.

2. Dynamic MoE (Confidence-Based Routing)

Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.

Model Details

Parameter	Sparse MoE	Dynamic MoE
Layers	4	4
Hidden Dim	512	512
FFN Dim	2048	2048
Attention Heads	8	8
Experts	8	4
Routing	Top-2	τ=0.8 threshold
Context Length	256	256
Vocab Size	10,000	10,000

Architecture

Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output

Transformer Block:
  └─ RMSNorm → Multi-Head Self-Attention → Residual
  └─ RMSNorm → MoE Layer → Residual

MoE Layer:
  └─ Router (softmax gating)
  └─ Expert Selection (Top-K or Dynamic)
  └─ Weighted Expert Outputs

Training

Both models were trained with:

Optimizer: AdamW (β1=0.9, β2=0.95)
Learning Rate: 3e-4 with cosine decay
Warmup Steps: 2,000
Weight Decay: 0.1

Loss Functions

Sparse MoE:

L = L_CE + α * L_balance

Dynamic MoE:

L = L_CE + β * L_balance + γ * L_entropy

Where:

L_CE: Cross-entropy loss
L_balance: Load balancing loss (encourages uniform expert utilization)
L_entropy: Entropy regularization (encourages sparse routing)

Usage

import torch
from moe.moelm import MoeLM, DynamicMOELM

# Load Sparse MoE
sparse_model = MoeLM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=8,
    top_k=2
)
sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))

# Load Dynamic MoE
dynamic_model = DynamicMOELM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=4,
    confidence_threshold=0.8
)
dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))

Files

File	Description
`sparse_moe_final.pt`	Sparse MoE model weights
`dynamic_moe_final.pt`	Dynamic MoE model weights
`sparse_moe_config.json`	Sparse MoE configuration
`dynamic_moe_config.json`	Dynamic MoE configuration

Citation

@misc{moe-lm-2024,
  title={Mixture-of-Experts Language Model},
  author={Chaitanya},
  year={2024},
  url={https://github.com/chaitanya/transformers-and-MOE}
}

Reference

Based on "Harder Tasks Need More Experts: Dynamic Routing in MoE Models"

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track