What Happened
OpenAI revealed DALL·E, a 12-billion parameter version of GPT-3 modified to generate images from text captions. Named as a portmanteau of Salvador Dalí and Pixar's WALL·E, the model could create plausible images from natural language descriptions, including surreal combinations like "an armchair in the shape of an avocado."
Why It Matters
DALL·E demonstrated that the same autoregressive Transformer approach used for text could extend to image generation, proving that large language models could bridge modalities. It captured public imagination and sparked intense interest in text-to-image AI, paving the way for DALL·E 2, Midjourney, Stable Diffusion, and the broader generative AI art movement.
Technical Details
- Architecture: GPT-3-style autoregressive Transformer (12B parameters)
- Approach: Treats text and image tokens as a single stream — text is encoded as BPE tokens, images are encoded as discrete tokens via a dVAE (discrete variational autoencoder)
- Training: Trained on 250 million text-image pairs from the internet
- Capabilities: Could generate images from complex prompts, combine unrelated concepts, render text, apply transformations, and generate multiple variations
- Limitations: Output resolution was 256×256, and quality varied significantly