What Happened
Google AI Language researchers published "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." BERT (Bidirectional Encoder Representations from Transformers) introduced a new approach to pre-training language representations by jointly conditioning on both left and right context in all layers.
Why It Matters
BERT shattered records across 11 NLP benchmarks and fundamentally changed how the field approached language understanding tasks. It demonstrated that bidirectional pre-training was significantly more effective than left-to-right approaches like GPT. Google integrated BERT into its search engine in 2019, affecting an estimated 10% of all English-language queries. BERT became the most widely used model in NLP research and industry applications.
Technical Details
- Architecture: Transformer encoder (BERT-Base: 12 layers, 110M params; BERT-Large: 24 layers, 340M params)
- Pre-training objectives:
- Masked Language Model (MLM): Randomly masks 15% of input tokens and predicts them, enabling bidirectional context
- Next Sentence Prediction (NSP): Binary classification of whether two sentences follow each other
- Training data: BooksCorpus + English Wikipedia (~3.3B words)
- Key insight: Unlike GPT's unidirectional approach, BERT's bidirectional training captures richer contextual representations