Garnet-OCR-7B-0422
The Garnet-OCR-7B-0422 model is a refined and highly optimized evolution of Gliese-OCR-7B-Post2.0-final, built upon the Qwen2.5-VL architecture. This release focuses on high-precision mathematical formula extraction, structured markdown generation, and accurate table reconstruction, making it especially effective for technical and scientific documents. Fine-tuned with enhanced datasets targeting mathematical notation, document structure, and layout fidelity, the model delivers superior performance across complex documents including research papers, scanned PDFs, handwritten equations, structured forms, and analytical reports.
GGUF: https://huggingface.co/prithivMLmods/Garnet-OCR-7B-Post-0422-GGUF
Key Enhancements
- Advanced Mathematical Formula Extraction: Significantly improved recognition and LaTeX conversion of complex equations, symbols, and multi-line expressions.
- High-Precision Markdown Generation: Produces clean, structured, and semantically accurate markdown outputs for documents.
- Robust Table Reconstruction: Enhanced detection and reconstruction of tables with correct alignment, hierarchy, and formatting.
- Optimized Document Visualization and OCR Pipeline: Improved understanding of text, layout, and embedded visuals for structured document parsing.
- Context-Aware Multimodal Linking: Stronger alignment between textual, visual, and spatial elements within documents.
- High-Fidelity Content Extraction: Accurate extraction of structured, semi-structured, and unstructured data with normalization.
- Analytical Recognition: Improved reasoning over charts, graphs, tables, and mathematical content.
- Enhanced Layout Awareness: Better spatial and semantic comprehension for complex document structures.
- Extended Multimodal Duration Support: Supports long document sequences and extended visual inputs.
- Stable Final Optimization: Consolidated improvements for reliable and consistent outputs.
Quick Start with Transformers
Quick start with a Colab notebook: Garnet-OCR-7B-0422-4bit-Demo
pip install transformers==5.6.0
# or
pip install git+https://github.com/huggingface/transformers.git
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Garnet-OCR-7B-0422", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Garnet-OCR-7B-0422")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "Extract equations, reconstruct tables, and convert the document into structured markdown."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)
Training Details
| Parameter | Details |
|---|---|
| Model Name | Garnet-OCR-7B-0422 |
| Model State | Experimental |
| Model Architecture | Qwen2_5_VLForConditionalGeneration |
| Base Model | prithivMLmods/Gliese-OCR-7B-Post2.0-final |
| Dataset Size | ~30K samples (Modular combination of datasets) |
| Dataset Composition | ~20% from allenai/olmOCR-mix-0225, remaining from mixed OCR and document datasets |
| Training Time | 9,240 (±250) seconds (~2.57 hours) |
| Hardware | NVIDIA A100-SXM4-80GB |
| Training Framework | TRL (Transformers Reinforcement Learning) |
| Transformers Version | 5.6.0 |
| Compute Service | Hugging Face Jobs / Spaces |
| Warmup Steps | 750 |
| Precision | bfloat16 |
Intended Use
- Mathematical formula extraction and LaTeX generation.
- High-quality markdown conversion from documents.
- Accurate table detection and reconstruction.
- Document OCR and structured data extraction.
- Scientific, academic, and technical document processing.
- Multimodal document understanding and reasoning.
- Summarization and QA over structured documents.
Limitations
- Reduced accuracy on heavily degraded or low-quality scans.
- High computational requirements for large-scale inference.
- Limited optimization for edge devices.
- Occasional layout misalignment in highly complex documents.
- Performance sensitivity to visual token configuration and long context inputs.
- Downloads last month
- 12
Model tree for prithivMLmods/Garnet-OCR-7B-0422
Base model
Qwen/Qwen2.5-VL-7B-Instruct