LisaMegaWatts/philosophy-corpus
Viewer โข Updated โข 15.2M โข 468
A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the Julia SLM family of models exploring alternative sequence mixing architectures.
JuliaSLM is the baseline Transformer in a family of three architectures trained on the same data with matched parameter budgets:
| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| JuliaSLM | Transformer | 4-head causal attention + RoPE | 34.5 | 5.04M |
| MonarchSLM | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| SymbioSLM | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
| +-- ln1: RMSNorm(256)
| +-- attn: CausalSelfAttention(4 heads, 64 dim each)
| | +-- wq, wk, wv: Dense(256 -> 256)
| | +-- wo: Dense(256 -> 256)
| +-- ln2: RMSNorm(256)
| +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
| Parameter | Value |
|---|---|
| Total parameters | 5,037,312 |
| Embedding dim | 256 |
| Layers | 6 |
| Attention heads | 4 |
| Head dim | 64 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE |
| Weight tying | Yes |
| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 10.2% |
| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
| SwiGLU FFN x 6 | 2.95M | 58.5% |
| RMSNorm x 13 | 3.3K | <0.1% |
| Total | 5.04M |
| Value | |
|---|---|
| Dataset | philosophy-corpus |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 66 minutes |
| Throughput | ~26K tok/s |
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | 3.54 | 34.5 |
Built entirely in Julia:
Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
Served via JuliaSLM Space:
curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux
tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
model = create_model(ModelConfig(;
arch="transformer", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
| File | Description |
|---|---|
final.jld2 |
Trained model parameters (JLD2 format) |
config.toml |
Model architecture configuration |
vocab.json |
BPE vocabulary (2000 tokens) |
merges.txt |
BPE merge rules |
@misc{juliaslm2026,
title={JuliaSLM: A Small Language Model in Pure Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}
MIT