DALL·E (2021) | AI Timeline

What Happened

OpenAI revealed DALL·E, a 12-billion parameter version of GPT-3 modified to generate images from text captions. Named as a portmanteau of Salvador Dalí and Pixar's WALL·E, the model could create plausible images from natural language descriptions, including surreal combinations like "an armchair in the shape of an avocado."

Why It Matters

DALL·E demonstrated that the same autoregressive Transformer approach used for text could extend to image generation, proving that large language models could bridge modalities. It captured public imagination and sparked intense interest in text-to-image AI, paving the way for DALL·E 2, Midjourney, Stable Diffusion, and the broader generative AI art movement.

Technical Details

Architecture: GPT-3-style autoregressive Transformer (12B parameters)
Approach: Treats text and image tokens as a single stream — text is encoded as BPE tokens, images are encoded as discrete tokens via a dVAE (discrete variational autoencoder)
Training: Trained on 250 million text-image pairs from the internet
Capabilities: Could generate images from complex prompts, combine unrelated concepts, render text, apply transformations, and generate multiple variations
Limitations: Output resolution was 256×256, and quality varied significantly

What Happened

Why It Matters

Technical Details

Sources