GR-Lite: Fashion Image Retrieval Model

GR-Lite is a fashion image retrieval model fine-tuned from DINOv3-ViT-L/16. It extracts 1024-dimensional L2-normalized embeddings optimized for fashion product search and retrieval.

GR-Lite achieves state-of-the-art performance on LookBench and other fashion retrieval benchmarks. See the paper for detailed metrics.

Property	Value
Architecture	ViT-L/16 (DINOv2-style)
Parameters	303M
Input	336 x 336 RGB
Output	1024-dim L2-normalized embedding
Framework	PyTorch / Transformers

Quick Start

from transformers import AutoModel
import torch
from PIL import Image
from torchvision import transforms

# Load model
model = AutoModel.from_pretrained("srpone/gr-lite", trust_remote_code=True)
model.eval()

# Preprocess
transform = transforms.Compose([
    transforms.Resize((336, 336)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

image = Image.open("your_image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0)  # [1, 3, 336, 336]

# Extract embedding
with torch.no_grad():
    output = model(pixel_values)
    embedding = output.pooler_output  # [1, 1024], L2-normalized

print(f"Embedding shape: {embedding.shape}")

Feature Extraction (Batch)

from torch.utils.data import DataLoader, Dataset

class ImageDataset(Dataset):
    def __init__(self, image_paths, transform):
        self.image_paths = image_paths
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        return self.transform(image)

# Create dataloader
dataset = ImageDataset(your_image_paths, transform)
loader = DataLoader(dataset, batch_size=64, num_workers=4)

# Extract all embeddings
all_embeddings = []
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

with torch.no_grad():
    for batch in loader:
        output = model(batch.to(device))
        all_embeddings.append(output.pooler_output.cpu())

embeddings = torch.cat(all_embeddings, dim=0)  # [N, 1024]

Image Retrieval

import torch.nn.functional as F

# query_emb: [1, 1024], gallery_embs: [N, 1024]
similarity = query_emb @ gallery_embs.T  # cosine similarity (already L2-normalized)
top_k_indices = similarity.argsort(descending=True)[0, :10]

Model Outputs

The model returns a BaseModelOutputWithPooling:

Field	Shape	Description
`pooler_output`	`[B, 1024]`	L2-normalized CLS token embedding (use this for retrieval)
`last_hidden_state`	`[B, 446, 1024]`	Full sequence output (CLS + 4 registers + 441 patches)

Resources

Project Site: LookBench
Paper: LookBench: A Comprehensive Benchmark for Fashion Image Retrieval
Benchmark Dataset: srpone/look-bench
Code & Notebooks: look-bench GitHub

Citation

@article{gao2026lookbench,
      title={LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval},
      author={Chao Gao and Siqiao Xue and Yimin Peng and Jiwen Fu and Tingyi Gu and Shanshan Li and Fan Zhou},
      year={2026},
      url={https://arxiv.org/abs/2601.14706},
      journal={arXiv preprint arXiv:2601.14706},
}

Downloads last month: -

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for srpone/gr-lite

Base model

facebook/dinov3-vit7b16-pretrain-lvd1689m

Finetuned

facebook/dinov3-vitl16-pretrain-lvd1689m

Finetuned

(3)

this model

Paper for srpone/gr-lite

LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Paper • 2601.14706 • Published Jan 21 • 1