Kimi-K2-Instruct Vindex
A vindex (vector index of learned features) for moonshotai/Kimi-K2-Instruct β Moonshot AI's 1T-parameter Mixture-of-Experts transformer.
Built with LarQL using the MoE-aware vindex builder at notebooks/moe_vindex_builder.py.
Status (2026-04-23): Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8ΓH100) pending. The core finding β flat SVD spectrum consistent with 1-bit models β is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
This card updates in-place as phases land. See changelog at the bottom.
What this is
This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β specifically the down_proj weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
The vindex enables:
- C1βC5 universal constant measurement across the model
- Cross-architecture CKA (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
- Feature-level entity association lookup (which expert/feature direction activates for a given input)
- Knowledge editing via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
Key finding: fp8-native training causes spectral dissolution
Training precision, not storage precision, determines spectral structure.
| Model | Training precision | var@64 (median) | Spectral class |
|---|---|---|---|
| Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved |
| Ministral-3B | fp16 β post-quant fp8 | ~0.85 | non-dissolved |
| Kimi-K2-Instruct | fp8 native | 0.088 (MoE, 6/61 layers) | dissolved |
| Bonsai 8B | fp16 β post-quant 1-bit | 0.093 | dissolved |
| BitNet b1.58-2B-4T | 1-bit native | 0.111 | dissolved |
Dissolved: var@64 < 0.15 (bimodal gap β no model yet observed between 0.15 and 0.50).
The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is also stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used at training time (gradient noise and limited numerical range) or applied after fp16 training (post-quantization, which preserves the fp16-shaped spectrum).
Hypothesis: Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.
What this does not claim:
- fp8-native training is not equivalent to 1-bit in capability or behavior
- Post-quantized fp8 is not "worse" than native fp8 β the spectra differ, behavioral quality is a separate question
- We have no data on fp4, bf16, or other low-precision training regimes
- n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof
Testable prediction: Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.
What this is not
- This is not an inference endpoint. You cannot run generation with this artifact.
- This is not the full model weights β only SVD statistics and feature metadata.
- Phase 3 routing stats (live inference with
output_router_logits=True) are stored separately.
Model architecture (Kimi-K2 specifics)
| Parameter | Value |
|---|---|
| Architecture | DeepSeek-V3 style MoE |
| Total parameters | ~1T |
| Active parameters per forward pass | ~32B |
| Layers | 61 |
| Hidden size | 7168 |
| Routed experts per layer | 384 |
| Top-K routing | 8 (num_experts_per_tok=8) |
| Shared experts per layer | 1 |
| MoE intermediate size | 2048 |
| First K dense layers | 1 (layer 0 is dense MLP) |
| Weight precision | fp8 block-quantized (weight_block_size=[128,128]) |
| Scoring function | sigmoid |
Vindex files
| File | Description |
|---|---|
phase1_moe_svd.json |
Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) |
phase1_moe_svd_agg.json |
Aggregated C1βC4 constants across all layers |
phase2_router_svd.json |
Router gate weight SVD per layer (router.weight [384, 7168]) |
phase3_routing_stats.json |
Live routing statistics from output_router_logits=True inference (256 diverse prompts) |
moe_config.json |
Detected MoE architecture config (expert layout, layer types, routing params) |
Universal constants (C1βC5) β spot-check results (6 layers)
Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.
| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|---|---|---|---|
| C1 | FFN activation sparsity | 0.06β0.39 | pending (Phase 3) |
| C2 | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%β |
| C3 | Gate coherence | 0.53β0.81 | pending (Phase 1 full) |
| C4 | Layer temperature (var@64) | 0.80β0.90 (fp16) | 0.037β0.10 β‘ |
| C5 | Circuit stage count | 4 (fp16) | pending (Phase 3) |
β *num_experts_per_tok: 8 confirmed from config.json.*
β‘Key finding: fp8-native training produces a flat weight spectrum. Spot-check results:
| Layer | Type | var@64 | IQR / note |
|---|---|---|---|
| L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β flat spectrum |
| L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] |
| L15 | MoE | 0.100 | IQR=[0.090, 0.110] |
| L30 | MoE | pending (full run) | β |
| L45 | MoE | pending | β |
| L60 | MoE | pending | β |
This result is surprising and genuine. The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.
For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β 0.80β0.90 with a clear power-law singular value spectrum (S[0]/S[1] β 10β100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β 1.24 β nearly uniform.
Hypothesis: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).
MoE-specific methodology
Standard vindex builders (for dense models) SVD a single down_proj matrix per layer. For Kimi-K2's MoE:
- All 384 expert
down_projmatrices per layer are loaded: shape[hidden_size=7168, moe_intermediate_size=2048] - Stacked into a batch tensor:
[384, 7168, 2048] - Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
- Per-expert
var@64values are aggregated: median, Q25, Q75 reported as the layer statistic
fp8 handling: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value ratios and variance-fraction statistics are scale-invariant.
Cross-architecture CKA prediction
Given that Gemma4-E2B β Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:
- Kimi-K2 β Gemma4: ~0.97β0.99 at entity layer (if the four-stage circuit is intact)
- Kimi-K2 β Qwen3: similar range
If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.
Reproduce it
# Clone the builder
git clone https://github.com/Divinci-AI/server.git
cd server
# Install Modal
pip install modal
# Spot-check 6 layers first (validates expert layout detection)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct \
--layers 0,1,15,30,45,60
# Full Phase 1 β all 61 layers, batch SVD of 384 experts
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct
# Phase 2 β router gate SVD (no inference needed)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct --phase 2
# Phase 3 β routing statistics (requires 8ΓH100, live inference)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct --phase 3
# Pull results from Modal volume
modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/
License
CC-BY-NC 4.0 β free for non-commercial research use. For commercial use, contact mike@divinci.ai.
Citation
@misc{mooring2026kimi-k2-vindex,
author = {Mooring, Mike},
title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
year = {2026},
url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
}
Changelog
| Date | Update |
|---|---|
| 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. |
Part of the Interpretability Diaries research series. Working in public at github.com/Divinci-AI.
Model tree for Divinci-AI/kimi-k2-vindex
Base model
moonshotai/Kimi-K2-Instruct