Mixture-of-Experts Language Models

A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).

Models

This repository contains two MoE architectures:

1. Sparse MoE (Top-K Routing)

Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.

2. Dynamic MoE (Confidence-Based Routing)

Dynamically adjusts the number of experts per token based on routing confidenceβ€”"easy" tokens use fewer experts, "hard" tokens use more.

Model Details

Parameter Sparse MoE Dynamic MoE
Layers 4 4
Hidden Dim 512 512
FFN Dim 2048 2048
Attention Heads 8 8
Experts 8 4
Routing Top-2 Ο„=0.8 threshold
Context Length 256 256
Vocab Size 10,000 10,000

Architecture

Input β†’ Embedding β†’ [Transformer Block Γ— N] β†’ RMSNorm β†’ Linear β†’ Output

Transformer Block:
  └─ RMSNorm β†’ Multi-Head Self-Attention β†’ Residual
  └─ RMSNorm β†’ MoE Layer β†’ Residual

MoE Layer:
  └─ Router (softmax gating)
  └─ Expert Selection (Top-K or Dynamic)
  └─ Weighted Expert Outputs

Training

Both models were trained with:

  • Optimizer: AdamW (Ξ²1=0.9, Ξ²2=0.95)
  • Learning Rate: 3e-4 with cosine decay
  • Warmup Steps: 2,000
  • Weight Decay: 0.1

Loss Functions

Sparse MoE:

L = L_CE + Ξ± * L_balance

Dynamic MoE:

L = L_CE + Ξ² * L_balance + Ξ³ * L_entropy

Where:

  • L_CE: Cross-entropy loss
  • L_balance: Load balancing loss (encourages uniform expert utilization)
  • L_entropy: Entropy regularization (encourages sparse routing)

Usage

import torch
from moe.moelm import MoeLM, DynamicMOELM

# Load Sparse MoE
sparse_model = MoeLM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=8,
    top_k=2
)
sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))

# Load Dynamic MoE
dynamic_model = DynamicMOELM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=4,
    confidence_threshold=0.8
)
dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))

Files

File Description
sparse_moe_final.pt Sparse MoE model weights
dynamic_moe_final.pt Dynamic MoE model weights
sparse_moe_config.json Sparse MoE configuration
dynamic_moe_config.json Dynamic MoE configuration

Citation

@misc{moe-lm-2024,
  title={Mixture-of-Experts Language Model},
  author={Chaitanya},
  year={2024},
  url={https://github.com/chaitanya/transformers-and-MOE}
}

Reference

Based on "Harder Tasks Need More Experts: Dynamic Routing in MoE Models"

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support