GR-Lite: Fashion Image Retrieval Model

GR-Lite is a fashion image retrieval model fine-tuned from DINOv3-ViT-L/16. It extracts 1024-dimensional L2-normalized embeddings optimized for fashion product search and retrieval.

GR-Lite achieves state-of-the-art performance on LookBench and other fashion retrieval benchmarks. See the paper for detailed metrics.

Property Value
Architecture ViT-L/16 (DINOv2-style)
Parameters 303M
Input 336 x 336 RGB
Output 1024-dim L2-normalized embedding
Framework PyTorch / Transformers

Quick Start

from transformers import AutoModel
import torch
from PIL import Image
from torchvision import transforms

# Load model
model = AutoModel.from_pretrained("srpone/gr-lite", trust_remote_code=True)
model.eval()

# Preprocess
transform = transforms.Compose([
    transforms.Resize((336, 336)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

image = Image.open("your_image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0)  # [1, 3, 336, 336]

# Extract embedding
with torch.no_grad():
    output = model(pixel_values)
    embedding = output.pooler_output  # [1, 1024], L2-normalized

print(f"Embedding shape: {embedding.shape}")

Feature Extraction (Batch)

from torch.utils.data import DataLoader, Dataset

class ImageDataset(Dataset):
    def __init__(self, image_paths, transform):
        self.image_paths = image_paths
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        return self.transform(image)

# Create dataloader
dataset = ImageDataset(your_image_paths, transform)
loader = DataLoader(dataset, batch_size=64, num_workers=4)

# Extract all embeddings
all_embeddings = []
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

with torch.no_grad():
    for batch in loader:
        output = model(batch.to(device))
        all_embeddings.append(output.pooler_output.cpu())

embeddings = torch.cat(all_embeddings, dim=0)  # [N, 1024]

Image Retrieval

import torch.nn.functional as F

# query_emb: [1, 1024], gallery_embs: [N, 1024]
similarity = query_emb @ gallery_embs.T  # cosine similarity (already L2-normalized)
top_k_indices = similarity.argsort(descending=True)[0, :10]

Model Outputs

The model returns a BaseModelOutputWithPooling:

Field Shape Description
pooler_output [B, 1024] L2-normalized CLS token embedding (use this for retrieval)
last_hidden_state [B, 446, 1024] Full sequence output (CLS + 4 registers + 441 patches)

Resources

Citation

@article{gao2026lookbench,
      title={LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval},
      author={Chao Gao and Siqiao Xue and Yimin Peng and Jiwen Fu and Tingyi Gu and Shanshan Li and Fan Zhou},
      year={2026},
      url={https://arxiv.org/abs/2601.14706},
      journal={arXiv preprint arXiv:2601.14706},
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for srpone/gr-lite

Paper for srpone/gr-lite