Qwen3.6-27B Β· Paper-Grade Sparse Autoencoders
Three TopK SAEs trained in parallel on the Qwen3.6-27B residual stream at L11 Β· L31 Β· L55. Gemma-Scope-27B-parity recipe, 200M tokens per layer, AuxK dead-feature mitigation, sae_lens-compatible export.
β Training complete β 200M tokens per layer Β· 2026-04-24
The flagship SAE of the OpenInterpretability ecosystem
This is the 3rd-tier "paper-grade" notebook output from the OpenInterpretability training ladder. The same notebook is public at OpenInterpretability/notebooks/03_papergrade_qwen36_27b_cloud.ipynb β anyone with a 96 GB GPU can reproduce this run in ~35h for ~$30-60.
Final metrics @ 200M tokens
Held-out validation (1M tokens β official eval)
| L11 | L31 | L55 | |
|---|---|---|---|
| var_exp | 0.8433 | 0.7135 | 0.8157 |
| L0 (mean TopK) | 128.0 | 128.0 | 128.0 |
| Alive features | 65 174 | 64 138 | 58 204 |
| Dead fraction | 0.55% | 2.13% | 11.19% |
| d_sae | 65 536 | 65 536 | 65 536 |
| k (TopK) | 128 | 128 | 128 |
| k_aux (AuxK) | 2 560 | 2 560 | 2 560 |
| Ξ±_aux | 1/32 | 1/32 | 1/32 |
Headline: L11 beat projections (0.8433 vs 0.83-0.86 target range). L31 at the lower end β middle-stack residuals need larger n for same reconstruction (consistent with Lieberum Gemma-Scope findings). L55 held-out dead fraction is 11.19% β reported honestly; training-time dead fraction was lower (6.81%) because held-out uses a 1M-token window (features that didn't fire there are counted dead).
Architecture spec
- Base model:
Qwen/Qwen3.6-27B(dense, 64 layers, d_model=5120) - SAE type: TopK (Gao et al. 2024) with AuxK auxiliary loss
- Dictionary size: n = 65 536 (13Γ expansion factor β Gemma-Scope-27B parity)
- Sparsity: k = 128
- Dead-feature mitigation: AuxK with k_aux = d_model/2 = 2 560, Ξ± = 1/32, dead threshold = 10M tokens
- Initialization: W_dec rows unit-normed, W_enc = W_dec.T (tied init), b_dec = Weiszfeld geometric median over 16 384 samples
- Decoder renorm: columns projected to unit norm every optimizer step
- Hook site: residual stream AFTER the decoder block at layer N (post-resid)
- Data type: fp32 SAE params, bf16 base model
Training details
- Token budget: 200M per layer (2Γ the minimum-viable floor)
- Corpus: 70% FineWeb-Edu
sample-10BT+ 20% OpenThoughts-114k + 10% OpenMathInstruct-2 (reasoning-aware mix) - Sequences: 1 024 tokens, packed from stream, batch of 2 sequences per forward pass
- Shared forward pass: one Qwen3.6-27B forward β three SAE training steps simultaneously (one per layer). Cuts activation-extraction compute 3Γ.
- Optimizer: Adam (Ξ²=0.9/0.999, Ξ΅=1e-8), grad clip 1.0
- LR schedule: 5 000 warmup steps β cosine decay, peak 2e-4 β floor 6e-5
- Checkpoint cadence: every 10M tokens to this HF repo (crash-safe; a crash loses β€10 min)
- Hardware: single NVIDIA RTX 6000 Pro Blackwell 96 GB
- Wall-time (actual): ~35h end-to-end (overran the projected 22-27h by ~50% β Colab throughput variance + GDN fallback cost on dense 27B; one kernel crash at 188M resumed cleanly from the 180M checkpoint)
var_exp trajectory (full log)
| layer | 10M | 40M | 70M | 90M | 200M |
|---|---|---|---|---|---|
| L11 | 0.716 | 0.794 | 0.815 | 0.825 | 0.842 |
| L31 | 0.573 | 0.653 | 0.675 | 0.683 | 0.706 |
| L55 | 0.679 | 0.760 | 0.778 | 0.787 | 0.808 |
Last 100M tokens added +0.017 / +0.023 / +0.021 β diminishing but non-zero returns. L31 was still climbing and would have benefited from additional tokens; we stopped at the original 200M budget for reproducibility.
Dead-feature story (AuxK recovery)
| layer | 40M | 70M | 90M | 200M | Ξ (90M β 200M) |
|---|---|---|---|---|---|
| L11 | 159 | 160 | 227 | 30 | -197 (recovered) |
| L31 | 39 | 511 | 1 161 | 319 | -842 (recovered) |
| L55 | 147 | 2 563 | 2 633 | 4 465 | +1 832 (grew) |
L11 and L31 recovered dead features in the last 110M tokens β AuxK reviving them as LR decayed through cosine. L55 continued to grow but slower than pre-90M pace. At 90M we flagged L31 as a concern; that turned out to be a mid-training transient resolved by AuxK.
Files
| File | Size | Purpose |
|---|---|---|
sae_L{11,31,55}_latest.safetensors |
2.68 GB each | Weights β sae_lens format (W_enc, W_dec, b_enc, b_dec) |
sae_L{11,31,55}_cfg.json |
~300 B | Architecture + hyperparameters + hook_name |
val_report.json |
β | Held-out 1M-token validation numbers (the table above) |
Total: ~8 GB for the full three-layer release. Training optimizer states (resume.pt) were removed on release cleanup; if you need them to continue training, contact us.
Usage
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
import torch.nn as nn
import torch.nn.functional as F
LAYER = 11 # pick 11, 31, or 55
REPO = "caiovicentino1/qwen36-27b-sae-papergrade"
# Load weights
weights_path = hf_hub_download(REPO, f"sae_L{LAYER}_latest.safetensors")
weights = load_file(weights_path)
# Reconstruct the SAE
class TopKSAE(nn.Module):
def __init__(self, d_in=5120, n=65536, k=128):
super().__init__()
self.W_enc = nn.Parameter(torch.empty(d_in, n))
self.b_enc = nn.Parameter(torch.zeros(n))
self.W_dec = nn.Parameter(torch.empty(n, d_in))
self.b_dec = nn.Parameter(torch.zeros(d_in))
self.k = k
def encode(self, x):
pre = (x - self.b_dec) @ self.W_enc + self.b_enc
top_v, top_i = pre.topk(self.k, dim=-1)
z = torch.zeros_like(pre)
z.scatter_(-1, top_i, F.relu(top_v))
return z
def decode(self, z):
return z @ self.W_dec + self.b_dec
sae = TopKSAE()
sae.load_state_dict(weights, strict=True)
sae.eval()
# Extract activations from Qwen3.6-27B and run through the SAE
from transformers import AutoTokenizer, AutoModelForImageTextToText
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3.6-27B",
dtype=torch.bfloat16,
device_map="cuda",
attn_implementation="sdpa",
trust_remote_code=True,
)
# hook the residual at layer N and feed through `sae`
For a full working example including prompt-to-trace pipeline, see 05_build_shareable_trace.ipynb.
InterpScore v0.0.1 β composite eval
Computed via notebooks/18b_interpscore_qwen36_27b_papergrade.ipynb on 250k held-out C4 tokens, probes SetFit/toxic_conversations + sst2, TPP at 0.5% of dictionary (k=327).
| Layer | InterpScore | loss_recovered | alive | l0_score | sparse_probing | tpp |
|---|---|---|---|---|---|---|
| L11 | 0.7788 | 0.998 | 0.999 | 0.625 | 0.862 | 0.135 |
| L31 | 0.7600 | 0.994 | 0.892 | 0.625 | 0.867 | 0.117 |
| L55 | 0.7507 | 0.988 | 0.780 | 0.625 | 0.829 | 0.242 |
Weights (v0.0.1): 0.30 Γ loss_recovered + 0.15 Γ alive + 0.15 Γ l0_score + 0.25 Γ sparse_probing + 0.15 Γ tpp. See openinterp.org/interpscore for the formula.
Notable findings:
loss_recoveredsaturated at 0.988-0.998 across all 3 layers β higher than the published Gemma-Scope-9b reference (0.95). The SAE preserves nearly all of the model's predictive information.- L55 has the highest TPP (0.242) despite lower
aliveβ late-stack features are more concentrated (ablating top-0.5% drops the toxicity probe AUROC by 24.2%). - L31 (middle stack) shows the lowest
loss_recoveredof the three but the highest sparse_probing AUROC β middle-layer residuals encode concepts well even when reconstruction is harder, consistent with Lieberum / Gemma-Scope findings.
Per-layer JSON reports: interpscore_L11.json Β· interpscore_L31.json Β· interpscore_L55.json Β· combined interpscore_papergrade.json.
Honest caveats
- Wall time overran ~50% (35h vs projected 22-27h) β Colab throughput variance + GDN fallback cost
- L31 var_exp (0.706) is noticeably lower than L11 (0.842) β middle-layer residuals are harder in 27B dense (consistent with Lieberum / Gemma-Scope)
- L55 dead fraction 6.8% β reported honestly, not hidden. Below Gao et al. 2024 baseline of 7% at n=16M but higher than L11/L31
- One Colab kernel crash at ~188M resumed cleanly from the 180M checkpoint β resume tooling worked as designed
Reproduction
This exact SAE can be retrained end-to-end via the public notebook:
Requirements:
- β₯96 GB VRAM (RTX 6000 Pro / H100 80GB with some offload / B200)
HF_TOKENenv var with write access- ~35h wall time
- ~$30-60 in cloud rental (Vast.ai / Lambda / RunPod)
After you have the SAEs
- Discover features β auto-label with Claude / GPT-4
- Build a Trace β ship to the Trace Theater
- Steer the model β live feature intervention
- Stage Gate G1 β validate a feature pack against GSM8K
- InterpScore eval β compute public leaderboard score
Citation
@misc{vicentino2026qwen27bpapergrade,
author = {Vicentino, Caio and OpenInterpretability},
title = {Qwen3.6-27B Paper-Grade Sparse Autoencoders at L11/L31/L55},
year = {2026},
url = {https://huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade},
note = {OpenInterpretability project, 200M tokens, TopK + AuxK},
}
Related
qwen36-27b-sae-multilayerβ the n=4k precursor (3 SAEs trained on ~143k tokens each for rapid exploration)qwen36-deepconf-probeβ +6pp SuperGPQA via probe-weighted majority votingqwen36-feature-circuitsβ honest negative: feature circuits failed replication
License
Apache-2.0. Base model under its own terms at Qwen/Qwen3.6-27B.
Built with the OpenInterpretability training stack Β· openinterp.org.
Model tree for caiovicentino1/qwen36-27b-sae-papergrade
Base model
Qwen/Qwen3.6-27B


