Kimi-K2-Instruct Vindex

A vindex (vector index of learned features) for moonshotai/Kimi-K2-Instruct β€” Moonshot AI's 1T-parameter Mixture-of-Experts transformer.

Built with LarQL using the MoE-aware vindex builder at notebooks/moe_vindex_builder.py.

Status (2026-04-23): Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8Γ—H100) pending. The core finding β€” flat SVD spectrum consistent with 1-bit models β€” is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.

This card updates in-place as phases land. See changelog at the bottom.

What this is

This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β€” specifically the down_proj weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.

The vindex enables:

  • C1–C5 universal constant measurement across the model
  • Cross-architecture CKA (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
  • Feature-level entity association lookup (which expert/feature direction activates for a given input)
  • Knowledge editing via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)

Key finding: fp8-native training causes spectral dissolution

Training precision, not storage precision, determines spectral structure.

Model Training precision var@64 (median) Spectral class
Gemma 4 E2B-it fp16/bf16 0.041 non-dissolved
Ministral-3B fp16 β†’ post-quant fp8 ~0.85 non-dissolved
Kimi-K2-Instruct fp8 native 0.088 (MoE, 6/61 layers) dissolved
Bonsai 8B fp16 β†’ post-quant 1-bit 0.093 dissolved
BitNet b1.58-2B-4T 1-bit native 0.111 dissolved

Dissolved: var@64 < 0.15 (bimodal gap β€” no model yet observed between 0.15 and 0.50).

The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is also stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used at training time (gradient noise and limited numerical range) or applied after fp16 training (post-quantization, which preserves the fp16-shaped spectrum).

Hypothesis: Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β€” so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.

What this does not claim:

  • fp8-native training is not equivalent to 1-bit in capability or behavior
  • Post-quantized fp8 is not "worse" than native fp8 β€” the spectra differ, behavioral quality is a separate question
  • We have no data on fp4, bf16, or other low-precision training regimes
  • n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof

Testable prediction: Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.

What this is not

  • This is not an inference endpoint. You cannot run generation with this artifact.
  • This is not the full model weights β€” only SVD statistics and feature metadata.
  • Phase 3 routing stats (live inference with output_router_logits=True) are stored separately.

Model architecture (Kimi-K2 specifics)

Parameter Value
Architecture DeepSeek-V3 style MoE
Total parameters ~1T
Active parameters per forward pass ~32B
Layers 61
Hidden size 7168
Routed experts per layer 384
Top-K routing 8 (num_experts_per_tok=8)
Shared experts per layer 1
MoE intermediate size 2048
First K dense layers 1 (layer 0 is dense MLP)
Weight precision fp8 block-quantized (weight_block_size=[128,128])
Scoring function sigmoid

Vindex files

File Description
phase1_moe_svd.json Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios)
phase1_moe_svd_agg.json Aggregated C1–C4 constants across all layers
phase2_router_svd.json Router gate weight SVD per layer (router.weight [384, 7168])
phase3_routing_stats.json Live routing statistics from output_router_logits=True inference (256 diverse prompts)
moe_config.json Detected MoE architecture config (expert layout, layer types, routing params)

Universal constants (C1–C5) β€” spot-check results (6 layers)

Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.

Constant Description Expected (fp16 dense) Kimi-K2 (MoE)
C1 FFN activation sparsity 0.06–0.39 pending (Phase 3)
C2 Top-8 output concentration 99.7% (at MoE scale) 99.7%†
C3 Gate coherence 0.53–0.81 pending (Phase 1 full)
C4 Layer temperature (var@64) 0.80–0.90 (fp16) 0.037–0.10 ‑
C5 Circuit stage count 4 (fp16) pending (Phase 3)

†*num_experts_per_tok: 8 confirmed from config.json.*

‑Key finding: fp8-native training produces a flat weight spectrum. Spot-check results:

Layer Type var@64 IQR / note
L00 Dense MLP 0.037 S[:3]=[10.7, 8.6, 8.1] β€” flat spectrum
L01 MoE (384 experts) 0.082 IQR=[0.076, 0.087]
L15 MoE 0.100 IQR=[0.090, 0.110]
L30 MoE pending (full run) β€”
L45 MoE pending β€”
L60 MoE pending β€”

This result is surprising and genuine. The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β€” realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.

For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β‰ˆ 0.80–0.90 with a clear power-law singular value spectrum (S[0]/S[1] β‰ˆ 10–100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β‰ˆ 1.24 β€” nearly uniform.

Hypothesis: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).

MoE-specific methodology

Standard vindex builders (for dense models) SVD a single down_proj matrix per layer. For Kimi-K2's MoE:

  1. All 384 expert down_proj matrices per layer are loaded: shape [hidden_size=7168, moe_intermediate_size=2048]
  2. Stacked into a batch tensor: [384, 7168, 2048]
  3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
  4. Per-expert var@64 values are aggregated: median, Q25, Q75 reported as the layer statistic

fp8 handling: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value ratios and variance-fraction statistics are scale-invariant.

Cross-architecture CKA prediction

Given that Gemma4-E2B ↔ Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:

  • Kimi-K2 ↔ Gemma4: ~0.97–0.99 at entity layer (if the four-stage circuit is intact)
  • Kimi-K2 ↔ Qwen3: similar range

If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.

Reproduce it

# Clone the builder
git clone https://github.com/Divinci-AI/server.git
cd server

# Install Modal
pip install modal

# Spot-check 6 layers first (validates expert layout detection)
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct \
  --layers 0,1,15,30,45,60

# Full Phase 1 β€” all 61 layers, batch SVD of 384 experts
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct

# Phase 2 β€” router gate SVD (no inference needed)
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct --phase 2

# Phase 3 β€” routing statistics (requires 8Γ—H100, live inference)
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct --phase 3

# Pull results from Modal volume
modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/

License

CC-BY-NC 4.0 β€” free for non-commercial research use. For commercial use, contact mike@divinci.ai.

Citation

@misc{mooring2026kimi-k2-vindex,
  author = {Mooring, Mike},
  title  = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
  year   = {2026},
  url    = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
  note   = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
}

Changelog

Date Update
2026-04-23 Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running.

Part of the Interpretability Diaries research series. Working in public at github.com/Divinci-AI.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Divinci-AI/kimi-k2-vindex

Finetuned
(17)
this model