chatsd
/

Sparse_Dynamic_MOE

+---
+tags:
+- mixture-of-experts
+- moe
+- transformer
+- language-model
+- pytorch
+- conditional-computation
+datasets:
+- custom
+pipeline_tag: text-generation
+license: mit
+---
+# Mixture-of-Experts Language Models
+A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).
+## Models
+This repository contains two MoE architectures:
+### 1. Sparse MoE (Top-K Routing)
+Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.
+### 2. Dynamic MoE (Confidence-Based Routing)
+Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.
+## Model Details
+| Parameter | Sparse MoE | Dynamic MoE |
+|-----------|------------|-------------|
+| Layers | 4 | 4 |
+| Hidden Dim | 512 | 512 |
+| FFN Dim | 2048 | 2048 |
+| Attention Heads | 8 | 8 |
+| Experts | 8 | 4 |
+| Routing | Top-2 | τ=0.8 threshold |
+| Context Length | 256 | 256 |
+| Vocab Size | 10,000 | 10,000 |
+## Architecture
+```
+Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output
+Transformer Block:
+  └─ RMSNorm → Multi-Head Self-Attention → Residual
+  └─ RMSNorm → MoE Layer → Residual
+MoE Layer:
+  └─ Router (softmax gating)
+  └─ Expert Selection (Top-K or Dynamic)
+  └─ Weighted Expert Outputs
+```
+## Training
+Both models were trained with:
+- **Optimizer**: AdamW (β1=0.9, β2=0.95)
+- **Learning Rate**: 3e-4 with cosine decay
+- **Warmup Steps**: 2,000
+- **Weight Decay**: 0.1
+### Loss Functions
+**Sparse MoE:**
+```
+L = L_CE + α * L_balance
+```
+**Dynamic MoE:**
+```
+L = L_CE + β * L_balance + γ * L_entropy
+```
+Where:
+- `L_CE`: Cross-entropy loss
+- `L_balance`: Load balancing loss (encourages uniform expert utilization)
+- `L_entropy`: Entropy regularization (encourages sparse routing)
+## Usage
+```python
+import torch
+from moe.moelm import MoeLM, DynamicMOELM
+# Load Sparse MoE
+sparse_model = MoeLM(
+    vocab_size=10000,
+    num_layers=4,
+    context_length=256,
+    d_model=512,
+    d_ff=2048,
+    num_heads=8,
+    num_experts=8,
+    top_k=2
+)
+sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))
+# Load Dynamic MoE
+dynamic_model = DynamicMOELM(
+    vocab_size=10000,
+    num_layers=4,
+    context_length=256,
+    d_model=512,
+    d_ff=2048,
+    num_heads=8,
+    num_experts=4,
+    confidence_threshold=0.8
+)
+dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))
+```
+## Files
+| File | Description |
+|------|-------------|
+| `sparse_moe_final.pt` | Sparse MoE model weights |
+| `dynamic_moe_final.pt` | Dynamic MoE model weights |
+| `sparse_moe_config.json` | Sparse MoE configuration |
+| `dynamic_moe_config.json` | Dynamic MoE configuration |
+## Citation
+```bibtex
+@misc{moe-lm-2024,
+  title={Mixture-of-Experts Language Model},
+  author={Chaitanya},
+  year={2024},
+  url={https://github.com/chaitanya/transformers-and-MOE}
+}
+```
+## Reference
+Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652)
+## License
+MIT