What Happened
In January 2021, OpenAI published CLIP, describing a model trained on (image, text) pairs that learns aligned representations enabling zero-shot image classification via text prompts.
Why It Matters
CLIP helped normalize a “language as interface” paradigm for vision tasks and contributed to the rapid growth of multimodal systems that combine text and images.
Technical Details
CLIP uses a contrastive objective to align image and text embeddings, enabling similarity-based retrieval and classification without task-specific retraining.