HiMoE β€” Hierarchical Mixture of Experts

A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.

Author: AG  Β·  Year: 2026


Overview

HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A Level-1 router selects one of N MoE blocks; that block's own Level-2 router selects one of M local experts. Only a single expert is ever activated per token β€” regardless of total model size.

Token
  └─► Level-1 Router  (1 of 6 MoE blocks)
          └─► Level-2 Router  (1 of 8 experts)
                  └─► Expert FFN  ──► output

With the default config (N=6, M=8, 2 layers) the model holds ~52M parameters but activates only ~3.3% per token β€” the compute footprint of a ~1.7M dense model.


Repository Structure

.
β”œβ”€β”€ train_himoe.py       # Full training script (self-contained)
β”œβ”€β”€ hamlet.txt           # Training corpus (place here before running)
β”œβ”€β”€ README.md
└── model/               # Created automatically on first save
    β”œβ”€β”€ config.json                  # Hyperparameters + vocab snapshot
    β”œβ”€β”€ backbone.pt                  # Embeddings, attention, LN, LM head
    β”œβ”€β”€ main_router.pt               # Level-1 gate  (or layer_01_main_router.pt for n_layer > 1)
    β”œβ”€β”€ moe_expert_001/
    β”‚   β”œβ”€β”€ router.pt                # Level-2 gate for this MoE block
    β”‚   β”œβ”€β”€ model_001.pt
    β”‚   β”œβ”€β”€ model_002.pt
    β”‚   └── ...  (model_008.pt)
    β”œβ”€β”€ moe_expert_002/
    β”‚   └── ...
    β”œβ”€β”€ ...
    β”œβ”€β”€ moe_expert_006/
    β”œβ”€β”€ sample.txt                   # Generated text after training
    └── routing_log.json             # Expert attribution for first 50 tokens

Each learnable component lives in its own file β€” making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model.


Quickstart

1. Install dependencies

pip install torch

No other dependencies. Everything else is standard library.

2. Add your data

Place hamlet.txt (or any plain-text corpus) in the same directory as train_himoe.py.

3. Train

python train_himoe.py

Checkpoints are saved to model/ every eval_interval steps and at the end of training. A sample generation and routing log are written automatically.

4. Resume training

python train_himoe.py --resume

5. Custom config

All hyperparameters are overridable from the command line:

python train_himoe.py \
  --num_moes 8 \
  --num_experts 16 \
  --n_embd 512 \
  --n_layer 4 \
  --max_iters 10000 \
  --lr 2e-4 \
  --data_file my_corpus.txt \
  --model_dir checkpoints/run_01

Architecture

HiMoEConfig defaults

Parameter Default Description
n_embd 256 Embedding / hidden dimension
n_layer 2 Number of Transformer layers
n_head 4 Attention heads
block_size 128 Context window (tokens)
num_moes 6 Level-1 choices (MoE blocks)
num_experts 8 Level-2 choices per MoE block
dropout 0.1 Dropout rate
batch_size 32 Training batch size
max_iters 3000 Training steps
lr 3e-4 Peak learning rate

Sparsity

Routing Level Active Total % Active
Level-1 (MoE blocks) 1 6 16.7%
Level-2 (experts) 1 48 2.1%
Full model (params) ~1.7M ~52M ~3.3%

Checkpoint layout for multi-layer models

When n_layer > 1, routers and expert directories are prefixed by layer:

model/
  layer_01_main_router.pt
  layer_01_moe_expert_001/
  layer_01_moe_expert_002/
  ...
  layer_02_main_router.pt
  layer_02_moe_expert_001/
  ...

Training Details

  • Optimiser: AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms
  • LR schedule: Cosine decay with 100-step linear warmup, minimum LR = 10% of peak
  • Gradient clipping: 1.0
  • Weight tying: Token embedding matrix and LM head share weights
  • Routing: Hard top-1 at both levels (no auxiliary load-balancing loss required)

Modular Deployment

Because every component is a separate file, you can:

Load only what you need:

import torch
# Load just one expert for inspection or fine-tuning
expert_weights = torch.load("model/moe_expert_003/model_005.pt")

Swap a router:

torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt")

Fine-tune a single MoE block without touching the backbone or other experts.

Add a new expert by saving a new model_009.pt and retraining only the corresponding router.


Output Files

After training completes:

File Contents
model/sample.txt 400-token generation from a blank context
model/routing_log.json Per-token (MoE, expert) routing decisions for the first 50 generated tokens
model/config.json Full config + vocabulary + last saved step

The training loop also prints an expert utilisation summary β€” a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts.


Paper

A full write-up of the architecture, sparsity analysis, and experiments is included as himoe_paper.pdf.


Citation

@misc{himoe2026,
  title   = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling},
  author  = {AG},
  year    = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support