η¨εΌζΊζ¨‘εεΌΊεδ½ η OCR ε·₯δ½ζ΅


- +5
merve, ariG23498, davanstrien, hynky, andito, reach-vb, pcuenq
β’ β’ 14How to use merve/smol-vision with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="merve/smol-vision") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("merve/smol-vision", dtype="auto")How to use merve/smol-vision with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "merve/smol-vision"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "merve/smol-vision",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/merve/smol-vision
How to use merve/smol-vision with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "merve/smol-vision" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "merve/smol-vision",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "merve/smol-vision" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "merve/smol-vision",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use merve/smol-vision with Docker Model Runner:
docker model run hf.co/merve/smol-vision
Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models. Original GH repository is here migrated to Hugging Face since notebooks there aren't rendered π₯²
Latest examples ππ»
Note: The script and notebook are updated to fix few issues related to QLoRA!
| Notebook | Description | |
|---|---|---|
| Quantization/ONNX | Faster and Smaller Zero-shot Object Detection with Optimum | Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools. |
| VLM Fine-tuning | Fine-tune PaliGemma | Fine-tune state-of-the-art vision language backbone PaliGemma using transformers. |
| Intro to Optimum/ORT | Optimizing DETR with π€ Optimum | A soft introduction to exporting vision models to ONNX and quantizing them. |
| Model Shrinking | Knowledge Distillation for Computer Vision | Knowledge distillation for image classification. |
| Quantization | Fit in vision models using Quanto | Fit in vision models to smaller hardware using quanto |
| Speed-up | Faster foundation models with torch.compile | Improving latency for foundation models using torch.compile |
| VLM Fine-tuning | Fine-tune Florence-2 | Fine-tune Florence-2 on DocVQA dataset |
| VLM Fine-tuning | QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2 | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
| VLM Fine-tuning (Script) | QLoRA Fine-tune IDEFICS3 on VQAv2 | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
| Multimodal RAG | Multimodal RAG using ColPali and Qwen2-VL | Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL |
| Multimodal Retriever Fine-tuning | Fine-tune ColPali for Multimodal RAG | Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case |
| VLM Fine-tuning | Fine-tune Gemma-3n for all modalities (audio-text-image) | Fine-tune Gemma-3n model to handle any modality: audio, text, and image. |
| Multimodal RAG | Any-to-Any (Video) RAG with OmniEmbed and Qwen | Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen. |
| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON) | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference |
| Quantization/Optimum/ORT | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) | End-to-end model optimization using Optimum |