HiMoE β Hierarchical Mixture of Experts
A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.
Author: AG Β· Year: 2026
Overview
HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A Level-1 router selects one of N MoE blocks; that block's own Level-2 router selects one of M local experts. Only a single expert is ever activated per token β regardless of total model size.
Token
βββΊ Level-1 Router (1 of 6 MoE blocks)
βββΊ Level-2 Router (1 of 8 experts)
βββΊ Expert FFN βββΊ output
With the default config (N=6, M=8, 2 layers) the model holds ~52M parameters but activates only ~3.3% per token β the compute footprint of a ~1.7M dense model.
Repository Structure
.
βββ train_himoe.py # Full training script (self-contained)
βββ hamlet.txt # Training corpus (place here before running)
βββ README.md
βββ model/ # Created automatically on first save
βββ config.json # Hyperparameters + vocab snapshot
βββ backbone.pt # Embeddings, attention, LN, LM head
βββ main_router.pt # Level-1 gate (or layer_01_main_router.pt for n_layer > 1)
βββ moe_expert_001/
β βββ router.pt # Level-2 gate for this MoE block
β βββ model_001.pt
β βββ model_002.pt
β βββ ... (model_008.pt)
βββ moe_expert_002/
β βββ ...
βββ ...
βββ moe_expert_006/
βββ sample.txt # Generated text after training
βββ routing_log.json # Expert attribution for first 50 tokens
Each learnable component lives in its own file β making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model.
Quickstart
1. Install dependencies
pip install torch
No other dependencies. Everything else is standard library.
2. Add your data
Place hamlet.txt (or any plain-text corpus) in the same directory as train_himoe.py.
3. Train
python train_himoe.py
Checkpoints are saved to model/ every eval_interval steps and at the end of training. A sample generation and routing log are written automatically.
4. Resume training
python train_himoe.py --resume
5. Custom config
All hyperparameters are overridable from the command line:
python train_himoe.py \
--num_moes 8 \
--num_experts 16 \
--n_embd 512 \
--n_layer 4 \
--max_iters 10000 \
--lr 2e-4 \
--data_file my_corpus.txt \
--model_dir checkpoints/run_01
Architecture
HiMoEConfig defaults
| Parameter | Default | Description |
|---|---|---|
n_embd |
256 | Embedding / hidden dimension |
n_layer |
2 | Number of Transformer layers |
n_head |
4 | Attention heads |
block_size |
128 | Context window (tokens) |
num_moes |
6 | Level-1 choices (MoE blocks) |
num_experts |
8 | Level-2 choices per MoE block |
dropout |
0.1 | Dropout rate |
batch_size |
32 | Training batch size |
max_iters |
3000 | Training steps |
lr |
3e-4 | Peak learning rate |
Sparsity
| Routing Level | Active | Total | % Active |
|---|---|---|---|
| Level-1 (MoE blocks) | 1 | 6 | 16.7% |
| Level-2 (experts) | 1 | 48 | 2.1% |
| Full model (params) | ~1.7M | ~52M | ~3.3% |
Checkpoint layout for multi-layer models
When n_layer > 1, routers and expert directories are prefixed by layer:
model/
layer_01_main_router.pt
layer_01_moe_expert_001/
layer_01_moe_expert_002/
...
layer_02_main_router.pt
layer_02_moe_expert_001/
...
Training Details
- Optimiser: AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms
- LR schedule: Cosine decay with 100-step linear warmup, minimum LR = 10% of peak
- Gradient clipping: 1.0
- Weight tying: Token embedding matrix and LM head share weights
- Routing: Hard top-1 at both levels (no auxiliary load-balancing loss required)
Modular Deployment
Because every component is a separate file, you can:
Load only what you need:
import torch
# Load just one expert for inspection or fine-tuning
expert_weights = torch.load("model/moe_expert_003/model_005.pt")
Swap a router:
torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt")
Fine-tune a single MoE block without touching the backbone or other experts.
Add a new expert by saving a new model_009.pt and retraining only the corresponding router.
Output Files
After training completes:
| File | Contents |
|---|---|
model/sample.txt |
400-token generation from a blank context |
model/routing_log.json |
Per-token (MoE, expert) routing decisions for the first 50 generated tokens |
model/config.json |
Full config + vocabulary + last saved step |
The training loop also prints an expert utilisation summary β a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts.
Paper
A full write-up of the architecture, sparsity analysis, and experiments is included as himoe_paper.pdf.
Citation
@misc{himoe2026,
title = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling},
author = {AG},
year = {2026}
}