Qwen3.6-27B Β· Paper-Grade Sparse Autoencoders

Three TopK SAEs trained in parallel on the Qwen3.6-27B residual stream at L11 Β· L31 Β· L55. Gemma-Scope-27B-parity recipe, 200M tokens per layer, AuxK dead-feature mitigation, sae_lens-compatible export.

openinterp.org/train Status

βœ… Training complete β€” 200M tokens per layer Β· 2026-04-24


The flagship SAE of the OpenInterpretability ecosystem

This is the 3rd-tier "paper-grade" notebook output from the OpenInterpretability training ladder. The same notebook is public at OpenInterpretability/notebooks/03_papergrade_qwen36_27b_cloud.ipynb β€” anyone with a 96 GB GPU can reproduce this run in ~35h for ~$30-60.


Final metrics @ 200M tokens

Held-out validation

Held-out validation (1M tokens β€” official eval)

L11 L31 L55
var_exp 0.8433 0.7135 0.8157
L0 (mean TopK) 128.0 128.0 128.0
Alive features 65 174 64 138 58 204
Dead fraction 0.55% 2.13% 11.19%
d_sae 65 536 65 536 65 536
k (TopK) 128 128 128
k_aux (AuxK) 2 560 2 560 2 560
Ξ±_aux 1/32 1/32 1/32

Headline: L11 beat projections (0.8433 vs 0.83-0.86 target range). L31 at the lower end β€” middle-stack residuals need larger n for same reconstruction (consistent with Lieberum Gemma-Scope findings). L55 held-out dead fraction is 11.19% β€” reported honestly; training-time dead fraction was lower (6.81%) because held-out uses a 1M-token window (features that didn't fire there are counted dead).


Architecture spec

  • Base model: Qwen/Qwen3.6-27B (dense, 64 layers, d_model=5120)
  • SAE type: TopK (Gao et al. 2024) with AuxK auxiliary loss
  • Dictionary size: n = 65 536 (13Γ— expansion factor β€” Gemma-Scope-27B parity)
  • Sparsity: k = 128
  • Dead-feature mitigation: AuxK with k_aux = d_model/2 = 2 560, Ξ± = 1/32, dead threshold = 10M tokens
  • Initialization: W_dec rows unit-normed, W_enc = W_dec.T (tied init), b_dec = Weiszfeld geometric median over 16 384 samples
  • Decoder renorm: columns projected to unit norm every optimizer step
  • Hook site: residual stream AFTER the decoder block at layer N (post-resid)
  • Data type: fp32 SAE params, bf16 base model

Training details

  • Token budget: 200M per layer (2Γ— the minimum-viable floor)
  • Corpus: 70% FineWeb-Edu sample-10BT + 20% OpenThoughts-114k + 10% OpenMathInstruct-2 (reasoning-aware mix)
  • Sequences: 1 024 tokens, packed from stream, batch of 2 sequences per forward pass
  • Shared forward pass: one Qwen3.6-27B forward β†’ three SAE training steps simultaneously (one per layer). Cuts activation-extraction compute 3Γ—.
  • Optimizer: Adam (Ξ²=0.9/0.999, Ξ΅=1e-8), grad clip 1.0
  • LR schedule: 5 000 warmup steps β†’ cosine decay, peak 2e-4 β†’ floor 6e-5
  • Checkpoint cadence: every 10M tokens to this HF repo (crash-safe; a crash loses ≀10 min)
  • Hardware: single NVIDIA RTX 6000 Pro Blackwell 96 GB
  • Wall-time (actual): ~35h end-to-end (overran the projected 22-27h by ~50% β€” Colab throughput variance + GDN fallback cost on dense 27B; one kernel crash at 188M resumed cleanly from the 180M checkpoint)

var_exp trajectory (full log)

var_expl trajectory

layer 10M 40M 70M 90M 200M
L11 0.716 0.794 0.815 0.825 0.842
L31 0.573 0.653 0.675 0.683 0.706
L55 0.679 0.760 0.778 0.787 0.808

Last 100M tokens added +0.017 / +0.023 / +0.021 β€” diminishing but non-zero returns. L31 was still climbing and would have benefited from additional tokens; we stopped at the original 200M budget for reproducibility.

Dead-feature story (AuxK recovery)

Dead-feature evolution

layer 40M 70M 90M 200M Ξ” (90M β†’ 200M)
L11 159 160 227 30 -197 (recovered)
L31 39 511 1 161 319 -842 (recovered)
L55 147 2 563 2 633 4 465 +1 832 (grew)

L11 and L31 recovered dead features in the last 110M tokens β€” AuxK reviving them as LR decayed through cosine. L55 continued to grow but slower than pre-90M pace. At 90M we flagged L31 as a concern; that turned out to be a mid-training transient resolved by AuxK.


Files

File Size Purpose
sae_L{11,31,55}_latest.safetensors 2.68 GB each Weights β€” sae_lens format (W_enc, W_dec, b_enc, b_dec)
sae_L{11,31,55}_cfg.json ~300 B Architecture + hyperparameters + hook_name
val_report.json β€” Held-out 1M-token validation numbers (the table above)

Total: ~8 GB for the full three-layer release. Training optimizer states (resume.pt) were removed on release cleanup; if you need them to continue training, contact us.


Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
import torch.nn as nn
import torch.nn.functional as F

LAYER = 11  # pick 11, 31, or 55
REPO = "caiovicentino1/qwen36-27b-sae-papergrade"

# Load weights
weights_path = hf_hub_download(REPO, f"sae_L{LAYER}_latest.safetensors")
weights = load_file(weights_path)

# Reconstruct the SAE
class TopKSAE(nn.Module):
    def __init__(self, d_in=5120, n=65536, k=128):
        super().__init__()
        self.W_enc = nn.Parameter(torch.empty(d_in, n))
        self.b_enc = nn.Parameter(torch.zeros(n))
        self.W_dec = nn.Parameter(torch.empty(n, d_in))
        self.b_dec = nn.Parameter(torch.zeros(d_in))
        self.k = k

    def encode(self, x):
        pre = (x - self.b_dec) @ self.W_enc + self.b_enc
        top_v, top_i = pre.topk(self.k, dim=-1)
        z = torch.zeros_like(pre)
        z.scatter_(-1, top_i, F.relu(top_v))
        return z

    def decode(self, z):
        return z @ self.W_dec + self.b_dec

sae = TopKSAE()
sae.load_state_dict(weights, strict=True)
sae.eval()

# Extract activations from Qwen3.6-27B and run through the SAE
from transformers import AutoTokenizer, AutoModelForImageTextToText
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.6-27B",
    dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="sdpa",
    trust_remote_code=True,
)
# hook the residual at layer N and feed through `sae`

For a full working example including prompt-to-trace pipeline, see 05_build_shareable_trace.ipynb.


InterpScore v0.0.1 β€” composite eval

Computed via notebooks/18b_interpscore_qwen36_27b_papergrade.ipynb on 250k held-out C4 tokens, probes SetFit/toxic_conversations + sst2, TPP at 0.5% of dictionary (k=327).

InterpScore per layer

Layer InterpScore loss_recovered alive l0_score sparse_probing tpp
L11 0.7788 0.998 0.999 0.625 0.862 0.135
L31 0.7600 0.994 0.892 0.625 0.867 0.117
L55 0.7507 0.988 0.780 0.625 0.829 0.242

Weights (v0.0.1): 0.30 Γ— loss_recovered + 0.15 Γ— alive + 0.15 Γ— l0_score + 0.25 Γ— sparse_probing + 0.15 Γ— tpp. See openinterp.org/interpscore for the formula.

Notable findings:

  • loss_recovered saturated at 0.988-0.998 across all 3 layers β€” higher than the published Gemma-Scope-9b reference (0.95). The SAE preserves nearly all of the model's predictive information.
  • L55 has the highest TPP (0.242) despite lower alive β€” late-stack features are more concentrated (ablating top-0.5% drops the toxicity probe AUROC by 24.2%).
  • L31 (middle stack) shows the lowest loss_recovered of the three but the highest sparse_probing AUROC β€” middle-layer residuals encode concepts well even when reconstruction is harder, consistent with Lieberum / Gemma-Scope findings.

Per-layer JSON reports: interpscore_L11.json Β· interpscore_L31.json Β· interpscore_L55.json Β· combined interpscore_papergrade.json.


Honest caveats

  1. Wall time overran ~50% (35h vs projected 22-27h) β€” Colab throughput variance + GDN fallback cost
  2. L31 var_exp (0.706) is noticeably lower than L11 (0.842) β€” middle-layer residuals are harder in 27B dense (consistent with Lieberum / Gemma-Scope)
  3. L55 dead fraction 6.8% β€” reported honestly, not hidden. Below Gao et al. 2024 baseline of 7% at n=16M but higher than L11/L31
  4. One Colab kernel crash at ~188M resumed cleanly from the 180M checkpoint β€” resume tooling worked as designed

Reproduction

This exact SAE can be retrained end-to-end via the public notebook:

Open in GitHub

Requirements:

  • β‰₯96 GB VRAM (RTX 6000 Pro / H100 80GB with some offload / B200)
  • HF_TOKEN env var with write access
  • ~35h wall time
  • ~$30-60 in cloud rental (Vast.ai / Lambda / RunPod)

After you have the SAEs


Citation

@misc{vicentino2026qwen27bpapergrade,
  author = {Vicentino, Caio and OpenInterpretability},
  title  = {Qwen3.6-27B Paper-Grade Sparse Autoencoders at L11/L31/L55},
  year   = {2026},
  url    = {https://huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade},
  note   = {OpenInterpretability project, 200M tokens, TopK + AuxK},
}

Related


License

Apache-2.0. Base model under its own terms at Qwen/Qwen3.6-27B.

Built with the OpenInterpretability training stack Β· openinterp.org.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/qwen36-27b-sae-papergrade

Base model

Qwen/Qwen3.6-27B
Finetuned
(57)
this model

Space using caiovicentino1/qwen36-27b-sae-papergrade 1