HiMoE — Hierarchical Mixture of Experts

A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.

Author: AG · Year: 2026

Overview

HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A Level-1 router selects one of N MoE blocks; that block's own Level-2 router selects one of M local experts. Only a single expert is ever activated per token — regardless of total model size.

Token
  └─► Level-1 Router  (1 of 6 MoE blocks)
          └─► Level-2 Router  (1 of 8 experts)
                  └─► Expert FFN  ──► output

With the default config (N=6, M=8, 2 layers) the model holds ~52M parameters but activates only ~3.3% per token — the compute footprint of a ~1.7M dense model.

Repository Structure

.
├── train_himoe.py       # Full training script (self-contained)
├── hamlet.txt           # Training corpus (place here before running)
├── README.md
└── model/               # Created automatically on first save
    ├── config.json                  # Hyperparameters + vocab snapshot
    ├── backbone.pt                  # Embeddings, attention, LN, LM head
    ├── main_router.pt               # Level-1 gate  (or layer_01_main_router.pt for n_layer > 1)
    ├── moe_expert_001/
    │   ├── router.pt                # Level-2 gate for this MoE block
    │   ├── model_001.pt
    │   ├── model_002.pt
    │   └── ...  (model_008.pt)
    ├── moe_expert_002/
    │   └── ...
    ├── ...
    ├── moe_expert_006/
    ├── sample.txt                   # Generated text after training
    └── routing_log.json             # Expert attribution for first 50 tokens

Each learnable component lives in its own file — making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model.

Quickstart

1. Install dependencies

pip install torch

No other dependencies. Everything else is standard library.

2. Add your data

Place hamlet.txt (or any plain-text corpus) in the same directory as train_himoe.py.

3. Train

python train_himoe.py

Checkpoints are saved to model/ every eval_interval steps and at the end of training. A sample generation and routing log are written automatically.

4. Resume training

python train_himoe.py --resume

5. Custom config

All hyperparameters are overridable from the command line:

python train_himoe.py \
  --num_moes 8 \
  --num_experts 16 \
  --n_embd 512 \
  --n_layer 4 \
  --max_iters 10000 \
  --lr 2e-4 \
  --data_file my_corpus.txt \
  --model_dir checkpoints/run_01

Architecture

HiMoEConfig defaults

Parameter	Default	Description
`n_embd`	256	Embedding / hidden dimension
`n_layer`	2	Number of Transformer layers
`n_head`	4	Attention heads
`block_size`	128	Context window (tokens)
`num_moes`	6	Level-1 choices (MoE blocks)
`num_experts`	8	Level-2 choices per MoE block
`dropout`	0.1	Dropout rate
`batch_size`	32	Training batch size
`max_iters`	3000	Training steps
`lr`	3e-4	Peak learning rate

Sparsity

Routing Level	Active	Total	% Active
Level-1 (MoE blocks)	1	6	16.7%
Level-2 (experts)	1	48	2.1%
Full model (params)	~1.7M	~52M	~3.3%

Checkpoint layout for multi-layer models

When n_layer > 1, routers and expert directories are prefixed by layer:

model/
  layer_01_main_router.pt
  layer_01_moe_expert_001/
  layer_01_moe_expert_002/
  ...
  layer_02_main_router.pt
  layer_02_moe_expert_001/
  ...

Training Details

Optimiser: AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms
LR schedule: Cosine decay with 100-step linear warmup, minimum LR = 10% of peak
Gradient clipping: 1.0
Weight tying: Token embedding matrix and LM head share weights
Routing: Hard top-1 at both levels (no auxiliary load-balancing loss required)

Modular Deployment

Because every component is a separate file, you can:

Load only what you need:

import torch
# Load just one expert for inspection or fine-tuning
expert_weights = torch.load("model/moe_expert_003/model_005.pt")

Swap a router:

torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt")

Fine-tune a single MoE block without touching the backbone or other experts.

Add a new expert by saving a new model_009.pt and retraining only the corresponding router.

Output Files

After training completes:

File	Contents
`model/sample.txt`	400-token generation from a blank context
`model/routing_log.json`	Per-token (MoE, expert) routing decisions for the first 50 generated tokens
`model/config.json`	Full config + vocabulary + last saved step

The training loop also prints an expert utilisation summary — a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts.

Paper

A full write-up of the architecture, sparsity analysis, and experiments is included as himoe_paper.pdf.

Citation

@misc{himoe2026,
  title   = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling},
  author  = {AG},
  year    = {2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support