Xenova/quickdraw-small
Viewer β’ Updated β’ 5M β’ 246 β’ 6
A text-conditional diffusion model for generating Google QuickDraw-style sketches from text prompts. This model uses DDPM (Denoising Diffusion Probabilistic Models) with CLIP text encoding and classifier-free guidance to generate 64x64 grayscale sketches.
This is a U-Net based diffusion model that generates sketches conditioned on text prompts. It uses:
openai/clip-vit-base-patch32) for text conditioningpip install torch torchvision transformers diffusers datasets matplotlib pillow tqdm
import torch
from model import TextConditionedUNet
from scheduler import SimpleDDPMScheduler
from text_encoder import CLIPTextEncoder
from generate import generate_samples
# Load checkpoint
checkpoint_path = "text_diffusion_final_epoch_100.pt"
checkpoint = torch.load(checkpoint_path)
# Initialize model
model = TextConditionedUNet(text_dim=512).cuda()
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Initialize text encoder
text_encoder = CLIPTextEncoder(model_name="openai/clip-vit-base-patch32", freeze=True).cuda()
text_encoder.eval()
# Generate samples
scheduler = SimpleDDPMScheduler(1000)
prompt = "a drawing of a cat"
num_samples = 4
guidance_scale = 5.0
with torch.no_grad():
text_embedding = text_encoder(prompt)
text_embeddings = text_embedding.repeat(num_samples, 1)
shape = (num_samples, 1, 64, 64)
samples = scheduler.sample_text(model, shape, text_embeddings, 'cuda', guidance_scale)
# Generate samples
python generate.py --checkpoint text_diffusion_final_epoch_100.pt \
--prompt "a drawing of a fire truck" \
--num-samples 4 \
--guidance-scale 5.0
# Visualize denoising process
python visualize_generation.py --checkpoint text_diffusion_final_epoch_100.pt \
--prompt "a drawing of a cat" \
--num-steps 10
Try these prompts for best results:
Note: The model is trained on a limited set of QuickDraw classes, so it works best with simple object descriptions in the format "a drawing of a [object]".
The model supports classifier-free guidance to improve text-image alignment:
guidance_scale = 1.0: No guidance (pure conditional generation)guidance_scale = 3.0-7.0: Recommended range (default: 5.0)Input: (batch, 1, 64, 64)
βββ Down Block 1: 1 β 256 channels
βββ Down Block 2: 256 β 512 channels
βββ Down Block 3: 512 β 512 channels
βββ Middle Block: 512 channels
βββ Up Block 3: 1024 β 512 channels (with skip connections)
βββ Up Block 2: 768 β 256 channels (with skip connections)
βββ Up Block 1: 512 β 1 channel (with skip connections)
Output: (batch, 1, 64, 64) - predicted noise
If you use this model, please cite:
@misc{quickdraw-text-diffusion,
title={Text-Conditional QuickDraw Diffusion Model},
author={Your Name},
year={2024},
howpublished={\url{https://huggingface.co/YOUR_USERNAME/quickdraw-text-diffusion}}
}
MIT License