Gemini (Google) (2023)

What Happened

Google DeepMind launched Gemini, a family of natively multimodal AI models available in three sizes: Ultra, Pro, and Nano. Unlike GPT-4, which was primarily a text model with image understanding added, Gemini was designed from the ground up to process and reason across text, images, audio, video, and code simultaneously. Gemini Ultra was the first model to achieve human-expert performance on the MMLU benchmark.

Why It Matters

Gemini represented Google's most serious response to OpenAI's GPT-4 and the ChatGPT phenomenon. It signaled that the AI race was intensifying among major tech companies. The model's native multimodality — being trained on multiple modalities from the start rather than bolting them together — pointed toward the future of AI systems that could seamlessly understand and generate across different types of content.

Technical Details

Architecture: Multimodal Transformer-based model (details sparse)
Sizes:
Ultra: Most capable, for complex tasks
Pro: Best balance of capability and efficiency
Nano: Efficient models for on-device use (Nano-1: 1.8B, Nano-2: 3.25B)
Benchmark results: Gemini Ultra achieved 90.0% on MMLU (5-shot), surpassing human expert performance (89.8%) and GPT-4's 86.4%
Multimodal capabilities: Native processing of text, images, audio, and video in a single model
Deployment: Integrated into Google products (Bard → Gemini chatbot, Pixel phones, Google Cloud)
Controversy: Google's initial demo video was criticized for being misleadingly edited to exaggerate real-time multimodal capabilities

What Happened

Why It Matters

Technical Details

Sources