Smol Vision 🐣

Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models. Original GH repository is here migrated to Hugging Face since notebooks there aren't rendered 🥲

Latest examples 👇🏻

Note: The script and notebook are updated to fix few issues related to QLoRA!

	Notebook	Description
Quantization/ONNX	Faster and Smaller Zero-shot Object Detection with Optimum	Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools.
VLM Fine-tuning	Fine-tune PaliGemma	Fine-tune state-of-the-art vision language backbone PaliGemma using transformers.
Intro to Optimum/ORT	Optimizing DETR with 🤗 Optimum	A soft introduction to exporting vision models to ONNX and quantizing them.
Model Shrinking	Knowledge Distillation for Computer Vision	Knowledge distillation for image classification.
Quantization	Fit in vision models using Quanto	Fit in vision models to smaller hardware using quanto
Speed-up	Faster foundation models with torch.compile	Improving latency for foundation models using `torch.compile`
VLM Fine-tuning	Fine-tune Florence-2	Fine-tune Florence-2 on DocVQA dataset
VLM Fine-tuning	QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2	QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset
VLM Fine-tuning (Script)	QLoRA Fine-tune IDEFICS3 on VQAv2	QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset
Multimodal RAG	Multimodal RAG using ColPali and Qwen2-VL	Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL
Multimodal Retriever Fine-tuning	Fine-tune ColPali for Multimodal RAG	Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case
VLM Fine-tuning	Fine-tune Gemma-3n for all modalities (audio-text-image)	Fine-tune Gemma-3n model to handle any modality: audio, text, and image.
Multimodal RAG	Any-to-Any (Video) RAG with OmniEmbed and Qwen	Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen.
Speed-up/Memory Optimization	Vision language model serving using TGI (SOON)	Explore speed-ups and memory improvements for vision-language model serving with text-generation inference
Quantization/Optimum/ORT	All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON)	End-to-end model optimization using Optimum

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Articles mentioning merve/smol-vision

用开源模型强化你的 OCR 工作流

merve, ariG23498, davanstrien, hynky, andito, reach-vb, pcuenq

•

Oct 21, 2025

• 14

Supercharge your OCR Pipelines with Open Models

merve, ariG23498, davanstrien, hynky, andito, reach-vb, pcuenq

•

Oct 21, 2025

• 313