chatsd commited on
Commit
a7e5596
·
verified ·
1 Parent(s): 41dcf48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -1
README.md CHANGED
@@ -1 +1,141 @@
1
- Model Files for https://github.com/chatrdh/transformers-and-MOE
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - mixture-of-experts
4
+ - moe
5
+ - transformer
6
+ - language-model
7
+ - pytorch
8
+ - conditional-computation
9
+ datasets:
10
+ - custom
11
+ pipeline_tag: text-generation
12
+ license: mit
13
+ ---
14
+
15
+ # Mixture-of-Experts Language Models
16
+
17
+ A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).
18
+
19
+ ## Models
20
+
21
+ This repository contains two MoE architectures:
22
+
23
+ ### 1. Sparse MoE (Top-K Routing)
24
+ Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.
25
+
26
+ ### 2. Dynamic MoE (Confidence-Based Routing)
27
+ Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.
28
+
29
+ ## Model Details
30
+
31
+ | Parameter | Sparse MoE | Dynamic MoE |
32
+ |-----------|------------|-------------|
33
+ | Layers | 4 | 4 |
34
+ | Hidden Dim | 512 | 512 |
35
+ | FFN Dim | 2048 | 2048 |
36
+ | Attention Heads | 8 | 8 |
37
+ | Experts | 8 | 4 |
38
+ | Routing | Top-2 | τ=0.8 threshold |
39
+ | Context Length | 256 | 256 |
40
+ | Vocab Size | 10,000 | 10,000 |
41
+
42
+ ## Architecture
43
+
44
+ ```
45
+ Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output
46
+
47
+ Transformer Block:
48
+ └─ RMSNorm → Multi-Head Self-Attention → Residual
49
+ └─ RMSNorm → MoE Layer → Residual
50
+
51
+ MoE Layer:
52
+ └─ Router (softmax gating)
53
+ └─ Expert Selection (Top-K or Dynamic)
54
+ └─ Weighted Expert Outputs
55
+ ```
56
+
57
+ ## Training
58
+
59
+ Both models were trained with:
60
+ - **Optimizer**: AdamW (β1=0.9, β2=0.95)
61
+ - **Learning Rate**: 3e-4 with cosine decay
62
+ - **Warmup Steps**: 2,000
63
+ - **Weight Decay**: 0.1
64
+
65
+ ### Loss Functions
66
+
67
+ **Sparse MoE:**
68
+ ```
69
+ L = L_CE + α * L_balance
70
+ ```
71
+
72
+ **Dynamic MoE:**
73
+ ```
74
+ L = L_CE + β * L_balance + γ * L_entropy
75
+ ```
76
+
77
+ Where:
78
+ - `L_CE`: Cross-entropy loss
79
+ - `L_balance`: Load balancing loss (encourages uniform expert utilization)
80
+ - `L_entropy`: Entropy regularization (encourages sparse routing)
81
+
82
+ ## Usage
83
+
84
+ ```python
85
+ import torch
86
+ from moe.moelm import MoeLM, DynamicMOELM
87
+
88
+ # Load Sparse MoE
89
+ sparse_model = MoeLM(
90
+ vocab_size=10000,
91
+ num_layers=4,
92
+ context_length=256,
93
+ d_model=512,
94
+ d_ff=2048,
95
+ num_heads=8,
96
+ num_experts=8,
97
+ top_k=2
98
+ )
99
+ sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))
100
+
101
+ # Load Dynamic MoE
102
+ dynamic_model = DynamicMOELM(
103
+ vocab_size=10000,
104
+ num_layers=4,
105
+ context_length=256,
106
+ d_model=512,
107
+ d_ff=2048,
108
+ num_heads=8,
109
+ num_experts=4,
110
+ confidence_threshold=0.8
111
+ )
112
+ dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))
113
+ ```
114
+
115
+ ## Files
116
+
117
+ | File | Description |
118
+ |------|-------------|
119
+ | `sparse_moe_final.pt` | Sparse MoE model weights |
120
+ | `dynamic_moe_final.pt` | Dynamic MoE model weights |
121
+ | `sparse_moe_config.json` | Sparse MoE configuration |
122
+ | `dynamic_moe_config.json` | Dynamic MoE configuration |
123
+
124
+ ## Citation
125
+
126
+ ```bibtex
127
+ @misc{moe-lm-2024,
128
+ title={Mixture-of-Experts Language Model},
129
+ author={Chaitanya},
130
+ year={2024},
131
+ url={https://github.com/chaitanya/transformers-and-MOE}
132
+ }
133
+ ```
134
+
135
+ ## Reference
136
+
137
+ Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652)
138
+
139
+ ## License
140
+
141
+ MIT