CLIP: Connecting Text and Images (2021)

What Happened

In January 2021, OpenAI published CLIP, describing a model trained on (image, text) pairs that learns aligned representations enabling zero-shot image classification via text prompts.

Why It Matters

CLIP helped normalize a “language as interface” paradigm for vision tasks and contributed to the rapid growth of multimodal systems that combine text and images.

Technical Details

CLIP uses a contrastive objective to align image and text embeddings, enabling similarity-based retrieval and classification without task-specific retraining.

Sources

OpenAI announcement
Code (GitHub)