Instructions to use Sunbird/sunbirdtutor-gemma-4-e2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Sunbird/sunbirdtutor-gemma-4-e2b with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Sunbird/sunbirdtutor-gemma-4-e2b") model = AutoModelForImageTextToText.from_pretrained("Sunbird/sunbirdtutor-gemma-4-e2b") - Notebooks
- Google Colab
- Kaggle
Sunbird Tutor · Gemma 4 E2B
A Gemma 4 E2B fine-tune for spoken question-answering in English and Ugandan languages. Audio in, text answer out, no intermediate transcription step.
This is the multilingual SFT checkpoint described in the Sunbird Tutor writeup (Kaggle Gemma 4 Good Hackathon, 2026). It powers the Sunflower educational assistant app, an Android app that runs the model fully offline so a child in a Ugandan classroom can ask a science question in Luganda and get an answer back in Luganda, without an internet connection.
Training code: SunbirdAI/sunbird-tutor-modelling contains the multilingual training pipeline, data prep, and evaluation harness used to produce this model.
What it does
Accepts 16 kHz mono PCM audio directly through Gemma 4's native audio tower, with no separate ASR step. The audio understanding lives inside the same context window that produces the answer, so a single forward pass takes you from a spoken question to a written reply. Depending on the prompt, the model can answer the question, transcribe the speech, translate it into another supported language, or explain what was said.
Vision was not exercised during fine-tuning. Text plus audio input only.
Languages
| ISO 639-3 | Language | Status |
|---|---|---|
eng |
English | strong |
lug |
Luganda | strongest non-English. chrF ~0.51 on the project eval. |
ach |
Acholi | second tier. chrF ~0.40, classroom-usable for shorter responses. |
nyn |
Runyankole | transcription and short translation reliable; QA degrades. |
xog |
Lusoga | transcription and short translation reliable; QA degrades. |
nyo |
Lunyoro | transcription and short translation reliable; QA degrades. |
teo |
Ateso | transcription and short translation reliable; QA degrades. |
Quality scales with training data volume, so Luganda is meaningfully ahead. The broader Sunbird Tutor project targets 12 Ugandan languages across multiple checkpoints; see the training repo for the larger picture.
How to use
Transformers (Python)
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("Sunbird/sunbirdtutor-gemma-4-e2b")
model = AutoModelForImageTextToText.from_pretrained(
"Sunbird/sunbirdtutor-gemma-4-e2b",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "system",
"content": (
"You are an educational assistant that can give explanations, "
"transcriptions and translations in Ugandan languages."
),
},
{
"role": "user",
"content": [{"type": "audio", "audio": "path/to/16khz_mono.wav"}],
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
answer = processor.batch_decode(
outputs[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)[0]
print(answer)
The exact auto-class for Gemma 4 audio depends on your transformers version. If AutoModelForImageTextToText does not pick up the audio tower, the loader scripts in sunbird-tutor-modelling have the working class names.
On-device
For mobile, use the Cactus INT4 quantization: Sunbird/sunflower-qa-cactus-int4. About 3.8 GB on disk, fully offline. The audio tower stays at FP16 so Luganda speech recognition survives quantization; the decoder is pushed to INT4.
Prompt format
Use this system prompt verbatim. The fine-tune saw it on every training example, and drift here measurably degrades quality:
You are an educational assistant that can give explanations, transcriptions and translations in Ugandan languages.
The user content varies by mode:
| Mode | User content |
|---|---|
| Answer (default) | empty string. Audio carries the question. |
| Transcribe | "Transcribe this audio." |
| Translate | "Translate this audio into {target language}." |
| Explain | "Explain what was said in this audio." |
The canonical runtime strings live in lib/model_settings_sheet.dart inside the Sunflower app.
Training
Fine-tuned from google/gemma-4-e2b. The pipeline, documented in full in the training repository, has three stages:
- Continued pretraining on around 600M characters of Ugandan-language text from web articles, books, translations, and synthetic instruction-following examples.
- Multilingual SFT on transcription using speech data from SALT, Google's Waxal, FLEURS, and the Makerere speech benchmark.
- A short final fine-tune on speech QA, built from the Ugandan primary school curriculum, machine-translated to Ugandan languages and converted to speech with a TTS model the team had previously trained (Orpheus 3B based).
The audio tower was preserved throughout. The decoder adapts to Ugandan-language QA while keeping Gemma 4's native speech understanding intact.
For exact configs, training scripts, per-language eval numbers, and the writeup that frames the work, see the training repo.
Intended use
Primary school science Q&A in Ugandan classrooms. The demo curriculum covers six Primary 5 to Primary 7 topics: photosynthesis, the water cycle, the life cycle of an insect, malaria prevention, digestion, and the solar system. Beyond Q&A, the model handles speech transcription, translation between supported languages from spoken input, and short spoken explanations. It is also useful as a research artifact for adapting multimodal foundation models to low-resource languages.
Out of scope
High-stakes domains, including medical, legal, and financial advice. Languages outside the seven listed. Image input. Long-form generation past a few hundred tokens, which drifts from the single-turn QA distribution the fine-tune was optimised for.
Limitations
Only Luganda reaches the strongest tier of QA quality. Acholi is classroom-usable for shorter responses. The other four Ugandan languages are present in the model but full question-answering degrades outside Luganda and Acholi; transcription and short translation remain reliable. Background-noise robustness has not been formally benchmarked in classroom environments. Audio inference assumes 16 kHz mono PCM input.
Related artifacts
- SunbirdAI/sunbird-tutor-modelling: training code, data pipeline, evaluation harness.
- SunbirdAI/sunflower-app: Android app that runs this model on-device.
- ak3ra/sunflower-qa-cactus-int4: Cactus INT4 quantization, ~3.8 GB.
Acknowledgements
Built by the Sunbird AI team. Foundation model: Google's Gemma 4 E2B. Inference engine: Cactus.
Citation
@misc{sunbird-tutor-gemma-4-e2b-2026,
author = {Sunbird AI},
title = {Sunbird Tutor: Gemma 4 E2B for spoken question-answering in Ugandan languages},
year = {2026},
url = {https://huggingface.co/Sunbird/sunbirdtutor-gemma-4-e2b}
}
Submitted to the Kaggle Gemma 4 Good Hackathon, 2026.
- Downloads last month
- 63