Qari-OCR-0.4.0-VL-4B-Instruct
A vision-language model fine-tuned for OCR on Islamic books and Arabic manuscripts. Based on Qwen/Qwen3-VL-4B-Instruct, trained on 45,000 image-text pairs from the seemorg/books-ocr dataset.
Results
| Model |
CER β |
WER β |
BLEU β |
| Qari-OCR-0.4.0 |
0.1222 |
0.2562 |
68.41 |
| Qwen/Qwen3-VL-4B-Instruct |
0.4922 |
0.6966 |
34.61 |
| Qwen/Qwen3-VL-8B-Instruct |
0.6876 |
0.8954 |
23.89 |
| NAMAA/Qari-0.2.2.1 |
0.6448 |
0.5126 |
21.97 |
| MBZUAI/AIN |
1.2843 |
1.2697 |
3.50 |
Usage
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
model_name = "NAMAA-Space/Qari-OCR-0.4.0-VL-4B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": f"./{src}"},
{"type": "text", "text": "Free OCR."},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
result = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
Training
- Base model: Qwen/Qwen3-VL-4B-Instruct
- Dataset: seemorg/books-ocr
- Training samples: 45,000 image-text pairs
- Domain: Islamic books and Arabic religious texts
Limitations
- Optimized for printed Islamic texts; performance may vary on modern Arabic fonts or handwritten text.
- Requires reasonable image quality (300+ DPI recommended).
- Arabic script only.
Citation
@misc{qari-ocr-0.4.0,
author = {NAMAA-Space},
title = {Qari-OCR-0.4.0-VL-4B-Instruct},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/NAMAA-Space/Qari-OCR-0.4.0-VL-4B-Instruct}}
}