Chimère: A Self-Improving MoE Inference System for Consumer Hardware

35B parameters. 80 tokens/second on the production HTTP path. One GPU. $0.10/day. The model improves while you sleep.

πŸ†• Latest update β€” April 2026: Step 7 multi-architecture dispatch. The same chimere-server runtime now also runs Mamba-2 / Nemotron-H MoE hybrid SSM models end-to-end, on top of a custom backport of upstream llama.cpp's Mamba-2 work into our ik_llama.cpp fork (offered upstream as PR #1593). NVIDIA Nemotron-3-Nano-30B-A3B Q4_0 measured at ~45 tok/s on RTX 5060 Ti (sm_120, NCMOE=30, ctx 2048). Qwen3.5 production path is byte-for-byte unchanged. See Multi-architecture support below.


What is Chimère?

Chimère is a complete inference system — not just a model or a runtime, but an integrated stack where every component feeds the others. It runs Qwen3.5-35B-A3B (35B total parameters, 3.5B active per token, 256 experts) on a single RTX 5060 Ti (16 GB VRAM) at **80 tok/s on the chimere-server HTTP production path** (the bare ik_llama backend reaches ~93 tok/s; the Rust HTTP / sampling layer adds the difference). A nightly quality loop further improves the system from production traffic.

This is the kind of system NVIDIA builds for enterprise deployments, except it runs on a desktop in the south of France.

User request
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ODO β€” Unified Orchestrator (port 8084)                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Intent      β”‚  β”‚ Entropy      β”‚  β”‚ Confidence    β”‚  β”‚
β”‚  β”‚ Classifier  β”‚  β”‚ Router       β”‚  β”‚ RAG Trigger   β”‚  β”‚
β”‚  β”‚ (3-cascade) β”‚  β”‚ (fast/qual/  β”‚  β”‚ (logprob      β”‚  β”‚
β”‚  β”‚             β”‚  β”‚  ultra)      β”‚  β”‚  probe)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                β”‚                  β”‚           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Enrichment Pipeline                               β”‚  β”‚
│  │ ‒ Web Search SOTA (8-stage: expand→search→RRF→    │  │
│  │   fetch→chunk→rerank→CRAG→synthesize)             │  │
β”‚  β”‚ β€’ ChromaDB RAG (dense + BM25 + RRF + cross-enc)  β”‚  β”‚
β”‚  β”‚ β€’ FAISS Semantic Few-shot (per domain)            β”‚  β”‚
β”‚  β”‚ β€’ Dynamic Engram (web β†’ n-gram logit bias)        β”‚  β”‚
β”‚  β”‚ β€’ Tool Injection (auto, from pipeline YAML)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚ DVTS Tree     β”‚  β”‚ ABF + CGRS     β”‚                  β”‚
β”‚  β”‚ Search (K=2,  β”‚  β”‚ (thinking      β”‚                  β”‚
β”‚  β”‚ ThinkPRM)     β”‚  β”‚  budget mgmt)  β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                  β”‚
          β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  chimere-server (port 8081) β€” Rust Runtime              β”‚
β”‚  β€’ ik_llama FFI backend (93 tok/s)                      β”‚
β”‚  β€’ Multi-tier Engram (Cuckoo <10ns β†’ hash O(1) β†’ FAISS)β”‚
β”‚  β€’ Logprobs (top-5 log-softmax, real values)            β”‚
β”‚  β€’ ABF token 248069 forcing at budget threshold         β”‚
β”‚  β€’ IQ3_S custom-mix / RAMP-v2 (15.2 GB, 3.78 BPW)     β”‚
β”‚  β€’ KV cache: q8_0 keys + q4_0 values (sweet spot)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Quality & Self-Improvement Loop                        β”‚
β”‚  β€’ ThinkPRM-1.5B (CPU): step-level verification         β”‚
β”‚  β€’ quality_scores.jsonl (104 scores, mean 3.04/5)       β”‚
β”‚  β€’ training_pairs.jsonl (68 pairs, score β‰₯ 4)           β”‚
│  ‒ 03:00 — Nightly LoRA (MeZO, stops GGUF→train→restart)│
β”‚  β€’ 04:00 β€” Engram WRITE (quality-gated, decay >30d)     β”‚
β”‚  β€’ Mon 02:00 β€” DSPy MIPROv2 prompt optimization         β”‚
β”‚  β€’ 6h β€” ChromaDB RAG reindex                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Is SOTA for Consumer Hardware (March 2026)

vs. Existing Solutions

System What it does What Chimère adds
llama.cpp / ik_llama Serve GGUF models We use ik_llama as backend (+23% vs stock), add Engram, ABF, quality loop
KTransformers CPU/GPU co-serving for MoE We go further: per-tensor mixed-precision (RAMP), self-improving quality
Ollama / LM Studio Easy local LLM UI No orchestration, no quality loop, no domain memory, no nightly improvement
vLLM / TensorRT-LLM High-throughput serving Requires A100+, no consumer GPU support for 35B MoE
OpenRouter / Together AI API access to MoE models Cloud, not local. $0.60/M tokens vs $0.10/day
Autoresearch (Karpathy) Self-improving research agent Concept paper, not deployed system. No runtime, no quantization
DeepSeek Engram Conditional memory for MoE Paper only. We implemented multi-tier version with quality-gated write

What No One Else Has Combined

  1. Rust runtime + custom CUDA sm_120 kernels for Blackwell consumer GPUs β€” 56K lines, the only MoE runtime in Rust
  2. RAMP data-free quantization β€” per-tensor mixed-precision without calibration data, 15.2 GB GGUF
  3. Multi-tier Engram β€” Cuckoo filter (<10ns) β†’ N-gram hash (O(1)) β†’ FAISS semantic (~5ms), with quality-gated nightly writes
  4. Adaptive Budget Forcing β€” thinking budget management for quantized reasoning (IQ3_S produces less coherent thinking than BF16, so budget must be shorter)
  5. The quality loop β€” ThinkPRM scores every response, high-quality pairs feed nightly LoRA + Engram + DSPy. The system improves from production traffic.
  6. Honest negative results β€” we tried and documented why speculative decoding (DFlash Ο„=6.06, wall-clock 0.73Γ—), MTP (84.8% acceptance, 0.51Γ—), and expert prefetch (86.65% hit, +1.1%) don't help on this hardware.

The Components

1. ODO β€” Unified Orchestrator

chimere-odo | 17K lines Python

A single proxy between the user and the model that adds intelligence:

  • Intent Classification: 3-strategy cascade (regex 99% <1ms β†’ filetype β†’ LLM GBNF <50ms). Routes to: code, kine, cyber, research, default.
  • Entropy Router: Measures query complexity β†’ fast (no-think, 0.7 temp), quality (think, 2048 budget, ABF 0.55), ultra (DVTS K=2, ThinkPRM).
  • Enrichment: Web search (8-stage SOTA pipeline), ChromaDB RAG (dense+BM25+RRF+cross-encoder), semantic few-shot, dynamic Engram.
  • Quality Gate: ThinkPRM-1.5B verifies step-level reasoning. Score ≀ 2 β†’ retry with reflection.
  • Pipeline YAML: Define multi-step agent workflows with hot-reload. 5 pipelines shipped (code: architectβ†’coder, cyber: triageβ†’correlateβ†’remediate).

Why not LangChain/LlamaIndex? Too heavy, too many abstractions, designed for cloud APIs. ODO is 1,525 lines doing exactly what we need with zero external framework dependencies.

2. Engram β€” Multi-Tier Domain Memory

chimere-engram-tables

Inspired by DeepSeek Engram but implemented as a lightweight lookup system:

  • Tier 0 β€” Cuckoo filter (<10ns): skips 97% of lookups for tokens not in any table
  • Tier 1 β€” FNV-1a hash tables (O(1)): core N-gram matching, binary format compatible with Rust and Python
  • Tier 2 β€” FAISS semantic (~5ms): embedding-based few-shot example retrieval

Ablation results (measured on 10-question benchmark):

Engram v1 (Ξ±=0.35, think+response):  77%  ← biases thinking, DEGRADES
Engram OFF (Ξ±=0):                    85%  ← baseline
Engram v2 (Ξ±=0.1, response-only):   88%  ← PRODUCTION

Key insight: applying Engram bias during the thinking phase constrains reasoning with domain patterns. Response-only bias with low Ξ± is the sweet spot.

Why not RAG alone? RAG injects knowledge via context (expensive, limited by context window). Engram injects at the logit level (zero context cost, unlimited knowledge). They're complementary β€” RAG for long-form retrieval, Engram for factual bias.

3. RAMP β€” Data-Free Quantization Pipeline

ramp-quant | 9K lines Python + C

Produces hardware-optimized GGUF without calibration data:

  1. NSDS sensitivity β€” kurtosis + SVD rank per tensor, data-free
  2. Proxy model β€” round-trip quantization error Γ— sensitivity = instant loss estimate
  3. Evolutionary search β€” 128 population, 200 generations, under VRAM budget
  4. Build β€” generates llama-quantize --custom-q command

GDN sensitivity hierarchy discovered:

SSM gates (Ξ±, Ξ²)  >  Attention Q/K  >  Shared experts  >  Routed experts
     Q8_0               Q5_K/Q6_K         Q5_K              IQ3_S

What we tried that failed:

Method Result Why
QuaRot (Hadamard rotation) PPL = 49,524 Incompatible with GDN recurrent state
OptRot (Givens rotation) PPL = 49,524 No cross-layer absorption
ParoQuant (pairwise rotation) OOM 256 experts Γ— grouped_mm > 16 GB
EvoPress (KL fitness) OOM Full model needed in RAM (70 GB)
GPTQ/AWQ layer-wise Complex MoE expert handling not supported

4. chimere-server β€” Rust Inference Runtime

chimere | 56K lines Rust + 2.6K CUDA

The only MoE inference runtime written in Rust, with:

  • Custom CUDA kernels for sm_120 (IQ3_S dequant, Q8_0+dp4a GEMV, flash attention, fused MoE)
  • Three backends: libllama FFI (93 tok/s), cudarc (57 tok/s), Candle (18 tok/s)
  • GDN state save/restore (impossible in llama.cpp β€” this enables speculative decoding on hybrid architectures)
  • OpenAI-compatible /v1/chat/completions API with streaming logprobs

Performance journey (5 days, March 14-19):

9.1 β†’ 21 β†’ 30.5 β†’ 42.5 β†’ 57 β†’ 93 tok/s

Through 9 Candle optimizations, Q8_1+dp4a kernels, fused operations, and finally the libllama FFI breakthrough.

5. Self-Improvement Loop

The system improves while idle:

Timer What How
03:00 daily LoRA training MeZO zeroth-order on quality-filtered pairs (score β‰₯ 4). Stops GGUF server β†’ trains β†’ restarts (try/finally).
04:00 daily Engram WRITE Add validated responses to domain tables. Decay: halve weight >30d, delete >90d.
Mon 02:00 DSPy MIPROv2 Bayesian prompt optimization per domain. Tested: code +8% on benchmark.
Every 6h RAG reindex ChromaDB re-ingestion of knowledge base.

Quality scoring: ThinkPRM-1.5B runs on CPU, scores every response 1-5 with step-level chain-of-thought verification. 104 scores logged (mean 3.04/5), 68 training pairs generated, 72 SPIN DPO pairs accumulated.

Quality scoring uses a 9B model on CPU (qwen9b-scorer, port 8085) β€” runs alongside the production 35B with zero VRAM impact. Previous 27B scorer required stopping production; the 9B CPU scorer eliminated all nightly downtime.

Why not RLHF/DPO on cloud? We can't afford cloud GPU time. MeZO trains at inference cost β€” the script stops the GGUF server, trains for ~1 min, restarts. Quality is lower than full DPO but it's free and runs every night.

6. DFlash β€” Block Diffusion Drafter

Paper + code

8 architectures over 27 days. Best holdout result: Ο„=6.06 (comparable to original DFlash paper). But wall-clock = 0.73Γ— (slowdown) because the target model is too fast (93 tok/s) for speculative decoding to help.

The GDN State Barrier: GDN recurrent layers cannot be rolled back after draft rejection. This is a structural incompatibility affecting all hybrid SSM-attention models (Jamba, RWKV, Qwen3.5). No amount of drafter improvement fixes this β€” it requires a new runtime (chimere-server provides one).

7. MTP β€” Multi-Token Prediction (Negative Result)

Model | Patches

First MTP implementation for Qwen3.5 MoE in ik_llama.cpp (5 patches, 8 bugs fixed). 84.8% acceptance but 0.51Γ— speedup β€” the MTP layer is itself MoE (256 experts on CPU), costing as much as a main forward.

8. Expert Prefetch (Negative Result)

Models

MLP predictor achieves 86.65% hit@8 accuracy. Zero speedup because ggml's multi-threaded CPU loop makes GPU prefetch serialize what was parallel work.


Quantified Results

Metric Value
Generation throughput (Qwen3.5-35B-A3B prod path) 80 tok/s chimere-server HTTP, ~93 tok/s bare ik_llama backend
Generation throughput (Nemotron-3-Nano-30B-A3B, NEW) ~45 tok/s chimere-server HTTP, NCMOE=30, ctx 2048
Model Qwen3.5-35B-A3B, RAMP-v2 15.2 GB (3.78 BPW)
VRAM usage ~14 GB / 16 GB
Benchmark 10/10 (code, math, tools, domain)
Engram ablation v1: 77% β†’ OFF: 85% β†’ v2: 88%
Quality scores 104 entries, mean 3.04/5
DFlash Ο„ (holdout) 6.06 (comparable to paper's 6.4)
DFlash wall-clock 0.73Γ— (negative β€” honest result)
MTP acceptance 84.8% (but 0.51Γ— throughput)
Expert prefetch 86.65% hit@8 (but +1.1% throughput)
Code size 121K lines (Rust + Python + CUDA + C)
Cost $0.10/day electricity
Hardware RTX 5060 Ti 16GB, i5-14600KF, 32GB DDR5

Multi-architecture support

As of April 2026 (Step 7 of the chimere-server multi-arch refactor), the same chimere-server runtime dispatches between two code paths based on the GGUF's general.architecture metadata:

Path Architectures Features
Qwen3.5 (prod) qwen35moe Full stack: MTP, MRoPE, Engram, multi-agent, cudarc / Candle / libllama backends, fast C++ sampler
Generic (libllama) mamba2, nemotron_h_moe, mamba libllama-only: forward via LlamaForward FFI, no MTP, no Engram, single-agent at Step 7

The Generic path was unblocked by a 12-commit Phase 3.x backport of upstream llama.cpp's Mamba-2 / Nemotron-H MoE support into our ik_llama.cpp fork, offered upstream as PR #1593. Validated end-to-end on:

  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF Q4_0: 45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via bin/test-nemotron and through HTTP /v1/chat/completions
  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF UD-IQ3_XXS: same path, coherent text on CPU

Models that should run via the same Generic path (untested at the chimere level β€” your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B, Hymba-1.5B-Base, Zamba2-7B.

The technical doc lives at chimere-server/docs/STEP7_MULTI_ARCH.md.


All Repositories

Code

Repository Lines What
chimere 96K Rust runtime + DFlash + MTP patches + 5 papers + Step 7 multi-arch dispatch
chimere-odo 17K Orchestrator + Engram + search + quality loop
ik_llama.cpp (fork) C++ backend fork β€” branch mamba2-nemotron-h-backport + PR #1593
ramp-quant 9K Quantization pipeline

Models

Model Size What
RAMP-v2-15G 15.2 GB Production GGUF (automated pipeline)
IQ3_S-custom-mix 14.7 GB Hand-crafted 317-override GGUF
IQ3_S-MTP 18.4 GB First MTP-enabled GGUF for Qwen3.5 MoE
MeZO LoRA 340 KB Zeroth-order LoRA proof-of-concept

Data

Dataset What
chimere-dflash-data DFlash training prompts (3,927)
chimere-quality-scores Quality scores + training pairs
chimere-engram-tables N-gram domain tables
chimere-expert-predictor 4 MLP predictor variants
chimere-calibration imatrix calibration corpus

Papers (5 drafts, arXiv pending endorsement)

  1. Block Diffusion Drafting for Hybrid MoE Models β€” 8 architectures, GDN State Barrier, wall-clock 0.73Γ—
  2. Chimère System Paper — the complete self-improving stack
  3. RAMP: Data-Free Mixed-Precision Quantization β€” 7 builds, QuaRot failure, RAMP-v2
  4. MTP on Qwen3.5 MoE β€” 84.8% acceptance, 0.51Γ— (negative result)
  5. Expert Prefetch β€” 86.65% hit@8, zero speedup (negative result)

All LaTeX sources: chimere/paper/latex/


Author

Kévin Rémondière — Independent ML researcher, Oloron-Sainte-Marie, France

ORCID: 0009-0008-2443-7166

Built in 7 weeks on a desktop. Everything open-source. The model improves in its sleep.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Kevletesteur/chimere-system