What Happened
Google DeepMind launched Gemini, a family of natively multimodal AI models available in three sizes: Ultra, Pro, and Nano. Unlike GPT-4, which was primarily a text model with image understanding added, Gemini was designed from the ground up to process and reason across text, images, audio, video, and code simultaneously. Gemini Ultra was the first model to achieve human-expert performance on the MMLU benchmark.
Why It Matters
Gemini represented Google's most serious response to OpenAI's GPT-4 and the ChatGPT phenomenon. It signaled that the AI race was intensifying among major tech companies. The model's native multimodality — being trained on multiple modalities from the start rather than bolting them together — pointed toward the future of AI systems that could seamlessly understand and generate across different types of content.
Technical Details
- Architecture: Multimodal Transformer-based model (details sparse)
- Sizes:
- Ultra: Most capable, for complex tasks
- Pro: Best balance of capability and efficiency
- Nano: Efficient models for on-device use (Nano-1: 1.8B, Nano-2: 3.25B)
- Benchmark results: Gemini Ultra achieved 90.0% on MMLU (5-shot), surpassing human expert performance (89.8%) and GPT-4's 86.4%
- Multimodal capabilities: Native processing of text, images, audio, and video in a single model
- Deployment: Integrated into Google products (Bard → Gemini chatbot, Pixel phones, Google Cloud)
- Controversy: Google's initial demo video was criticized for being misleadingly edited to exaggerate real-time multimodal capabilities