Qwen3.6-27B · Paper-Grade Sparse Autoencoders

Three TopK SAEs trained in parallel on the Qwen3.6-27B residual stream at L11 · L31 · L55. Gemma-Scope-27B-parity recipe, 200M tokens per layer, AuxK dead-feature mitigation, sae_lens-compatible export.

✅ Training complete — 200M tokens per layer · 2026-04-24

The flagship SAE of the OpenInterpretability ecosystem

This is the 3rd-tier "paper-grade" notebook output from the OpenInterpretability training ladder. The same notebook is public at OpenInterpretability/notebooks/03_papergrade_qwen36_27b_cloud.ipynb — anyone with a 96 GB GPU can reproduce this run in ~35h for ~$30-60.

Final metrics @ 200M tokens

Held-out validation (1M tokens — official eval)

	L11	L31	L55
var_exp	0.8433	0.7135	0.8157
L0 (mean TopK)	128.0	128.0	128.0
Alive features	65 174	64 138	58 204
Dead fraction	0.55%	2.13%	11.19%
d_sae	65 536	65 536	65 536
k (TopK)	128	128	128
k_aux (AuxK)	2 560	2 560	2 560
α_aux	1/32	1/32	1/32

Headline: L11 beat projections (0.8433 vs 0.83-0.86 target range). L31 at the lower end — middle-stack residuals need larger n for same reconstruction (consistent with Lieberum Gemma-Scope findings). L55 held-out dead fraction is 11.19% — reported honestly; training-time dead fraction was lower (6.81%) because held-out uses a 1M-token window (features that didn't fire there are counted dead).

Architecture spec

Base model: Qwen/Qwen3.6-27B (dense, 64 layers, d_model=5120)
SAE type: TopK (Gao et al. 2024) with AuxK auxiliary loss
Dictionary size: n = 65 536 (13× expansion factor — Gemma-Scope-27B parity)
Sparsity: k = 128
Dead-feature mitigation: AuxK with k_aux = d_model/2 = 2 560, α = 1/32, dead threshold = 10M tokens
Initialization: W_dec rows unit-normed, W_enc = W_dec.T (tied init), b_dec = Weiszfeld geometric median over 16 384 samples
Decoder renorm: columns projected to unit norm every optimizer step
Hook site: residual stream AFTER the decoder block at layer N (post-resid)
Data type: fp32 SAE params, bf16 base model

Training details

Token budget: 200M per layer (2× the minimum-viable floor)
Corpus: 70% FineWeb-Edu sample-10BT + 20% OpenThoughts-114k + 10% OpenMathInstruct-2 (reasoning-aware mix)
Sequences: 1 024 tokens, packed from stream, batch of 2 sequences per forward pass
Shared forward pass: one Qwen3.6-27B forward → three SAE training steps simultaneously (one per layer). Cuts activation-extraction compute 3×.
Optimizer: Adam (β=0.9/0.999, ε=1e-8), grad clip 1.0
LR schedule: 5 000 warmup steps → cosine decay, peak 2e-4 → floor 6e-5
Checkpoint cadence: every 10M tokens to this HF repo (crash-safe; a crash loses ≤10 min)
Hardware: single NVIDIA RTX 6000 Pro Blackwell 96 GB
Wall-time (actual): ~35h end-to-end (overran the projected 22-27h by ~50% — Colab throughput variance + GDN fallback cost on dense 27B; one kernel crash at 188M resumed cleanly from the 180M checkpoint)

var_exp trajectory (full log)

layer	10M	40M	70M	90M	200M
L11	0.716	0.794	0.815	0.825	0.842
L31	0.573	0.653	0.675	0.683	0.706
L55	0.679	0.760	0.778	0.787	0.808

Last 100M tokens added +0.017 / +0.023 / +0.021 — diminishing but non-zero returns. L31 was still climbing and would have benefited from additional tokens; we stopped at the original 200M budget for reproducibility.

Dead-feature story (AuxK recovery)

layer	40M	70M	90M	200M	Δ (90M → 200M)
L11	159	160	227	30	-197 (recovered)
L31	39	511	1 161	319	-842 (recovered)
L55	147	2 563	2 633	4 465	+1 832 (grew)

L11 and L31 recovered dead features in the last 110M tokens — AuxK reviving them as LR decayed through cosine. L55 continued to grow but slower than pre-90M pace. At 90M we flagged L31 as a concern; that turned out to be a mid-training transient resolved by AuxK.

Files

File	Size	Purpose
`sae_L{11,31,55}_latest.safetensors`	2.68 GB each	Weights — sae_lens format (`W_enc`, `W_dec`, `b_enc`, `b_dec`)
`sae_L{11,31,55}_cfg.json`	~300 B	Architecture + hyperparameters + `hook_name`
`val_report.json`	—	Held-out 1M-token validation numbers (the table above)

Total: ~8 GB for the full three-layer release. Training optimizer states (resume.pt) were removed on release cleanup; if you need them to continue training, contact us.

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
import torch.nn as nn
import torch.nn.functional as F

LAYER = 11  # pick 11, 31, or 55
REPO = "caiovicentino1/qwen36-27b-sae-papergrade"

# Load weights
weights_path = hf_hub_download(REPO, f"sae_L{LAYER}_latest.safetensors")
weights = load_file(weights_path)

# Reconstruct the SAE
class TopKSAE(nn.Module):
    def __init__(self, d_in=5120, n=65536, k=128):
        super().__init__()
        self.W_enc = nn.Parameter(torch.empty(d_in, n))
        self.b_enc = nn.Parameter(torch.zeros(n))
        self.W_dec = nn.Parameter(torch.empty(n, d_in))
        self.b_dec = nn.Parameter(torch.zeros(d_in))
        self.k = k

    def encode(self, x):
        pre = (x - self.b_dec) @ self.W_enc + self.b_enc
        top_v, top_i = pre.topk(self.k, dim=-1)
        z = torch.zeros_like(pre)
        z.scatter_(-1, top_i, F.relu(top_v))
        return z

    def decode(self, z):
        return z @ self.W_dec + self.b_dec

sae = TopKSAE()
sae.load_state_dict(weights, strict=True)
sae.eval()

# Extract activations from Qwen3.6-27B and run through the SAE
from transformers import AutoTokenizer, AutoModelForImageTextToText
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.6-27B",
    dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="sdpa",
    trust_remote_code=True,
)
# hook the residual at layer N and feed through `sae`

For a full working example including prompt-to-trace pipeline, see 05_build_shareable_trace.ipynb.

InterpScore v0.0.1 — composite eval

Computed via notebooks/18b_interpscore_qwen36_27b_papergrade.ipynb on 250k held-out C4 tokens, probes SetFit/toxic_conversations + sst2, TPP at 0.5% of dictionary (k=327).

Layer	InterpScore	loss_recovered	alive	l0_score	sparse_probing	tpp
L11	0.7788	0.998	0.999	0.625	0.862	0.135
L31	0.7600	0.994	0.892	0.625	0.867	0.117
L55	0.7507	0.988	0.780	0.625	0.829	0.242

Weights (v0.0.1): 0.30 × loss_recovered + 0.15 × alive + 0.15 × l0_score + 0.25 × sparse_probing + 0.15 × tpp. See openinterp.org/interpscore for the formula.

Notable findings:

loss_recovered saturated at 0.988-0.998 across all 3 layers — higher than the published Gemma-Scope-9b reference (0.95). The SAE preserves nearly all of the model's predictive information.
L55 has the highest TPP (0.242) despite lower alive — late-stack features are more concentrated (ablating top-0.5% drops the toxicity probe AUROC by 24.2%).
L31 (middle stack) shows the lowest loss_recovered of the three but the highest sparse_probing AUROC — middle-layer residuals encode concepts well even when reconstruction is harder, consistent with Lieberum / Gemma-Scope findings.

Per-layer JSON reports: interpscore_L11.json · interpscore_L31.json · interpscore_L55.json · combined interpscore_papergrade.json.

Honest caveats

Wall time overran ~50% (35h vs projected 22-27h) — Colab throughput variance + GDN fallback cost
L31 var_exp (0.706) is noticeably lower than L11 (0.842) — middle-layer residuals are harder in 27B dense (consistent with Lieberum / Gemma-Scope)
L55 dead fraction 6.8% — reported honestly, not hidden. Below Gao et al. 2024 baseline of 7% at n=16M but higher than L11/L31
One Colab kernel crash at ~188M resumed cleanly from the 180M checkpoint — resume tooling worked as designed

Reproduction

This exact SAE can be retrained end-to-end via the public notebook:

Requirements:

≥96 GB VRAM (RTX 6000 Pro / H100 80GB with some offload / B200)
HF_TOKEN env var with write access
~35h wall time
~$30-60 in cloud rental (Vast.ai / Lambda / RunPod)

After you have the SAEs

Discover features — auto-label with Claude / GPT-4
Build a Trace — ship to the Trace Theater
Steer the model — live feature intervention
Stage Gate G1 — validate a feature pack against GSM8K
InterpScore eval — compute public leaderboard score

Citation

@misc{vicentino2026qwen27bpapergrade,
  author = {Vicentino, Caio and OpenInterpretability},
  title  = {Qwen3.6-27B Paper-Grade Sparse Autoencoders at L11/L31/L55},
  year   = {2026},
  url    = {https://huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade},
  note   = {OpenInterpretability project, 200M tokens, TopK + AuxK},
}

qwen36-27b-sae-multilayer — the n=4k precursor (3 SAEs trained on ~143k tokens each for rapid exploration)
qwen36-deepconf-probe — +6pp SuperGPQA via probe-weighted majority voting
qwen36-feature-circuits — honest negative: feature circuits failed replication

License

Apache-2.0. Base model under its own terms at Qwen/Qwen3.6-27B.

Built with the OpenInterpretability training stack · openinterp.org.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/qwen36-27b-sae-papergrade

Base model

Qwen/Qwen3.6-27B

Finetuned

(57)

this model

caiovicentino1
/

qwen36-27b-sae-papergrade