LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
Paper • 2601.14706 • Published • 1
GR-Lite is a fashion image retrieval model fine-tuned from DINOv3-ViT-L/16. It extracts 1024-dimensional L2-normalized embeddings optimized for fashion product search and retrieval.
GR-Lite achieves state-of-the-art performance on LookBench and other fashion retrieval benchmarks. See the paper for detailed metrics.
| Property | Value |
|---|---|
| Architecture | ViT-L/16 (DINOv2-style) |
| Parameters | 303M |
| Input | 336 x 336 RGB |
| Output | 1024-dim L2-normalized embedding |
| Framework | PyTorch / Transformers |
from transformers import AutoModel
import torch
from PIL import Image
from torchvision import transforms
# Load model
model = AutoModel.from_pretrained("srpone/gr-lite", trust_remote_code=True)
model.eval()
# Preprocess
transform = transforms.Compose([
transforms.Resize((336, 336)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
image = Image.open("your_image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0) # [1, 3, 336, 336]
# Extract embedding
with torch.no_grad():
output = model(pixel_values)
embedding = output.pooler_output # [1, 1024], L2-normalized
print(f"Embedding shape: {embedding.shape}")
from torch.utils.data import DataLoader, Dataset
class ImageDataset(Dataset):
def __init__(self, image_paths, transform):
self.image_paths = image_paths
self.transform = transform
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
image = Image.open(self.image_paths[idx]).convert("RGB")
return self.transform(image)
# Create dataloader
dataset = ImageDataset(your_image_paths, transform)
loader = DataLoader(dataset, batch_size=64, num_workers=4)
# Extract all embeddings
all_embeddings = []
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
with torch.no_grad():
for batch in loader:
output = model(batch.to(device))
all_embeddings.append(output.pooler_output.cpu())
embeddings = torch.cat(all_embeddings, dim=0) # [N, 1024]
import torch.nn.functional as F
# query_emb: [1, 1024], gallery_embs: [N, 1024]
similarity = query_emb @ gallery_embs.T # cosine similarity (already L2-normalized)
top_k_indices = similarity.argsort(descending=True)[0, :10]
The model returns a BaseModelOutputWithPooling:
| Field | Shape | Description |
|---|---|---|
pooler_output |
[B, 1024] |
L2-normalized CLS token embedding (use this for retrieval) |
last_hidden_state |
[B, 446, 1024] |
Full sequence output (CLS + 4 registers + 441 patches) |
@article{gao2026lookbench,
title={LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval},
author={Chao Gao and Siqiao Xue and Yimin Peng and Jiwen Fu and Tingyi Gu and Shanshan Li and Fan Zhou},
year={2026},
url={https://arxiv.org/abs/2601.14706},
journal={arXiv preprint arXiv:2601.14706},
}
Base model
facebook/dinov3-vit7b16-pretrain-lvd1689m