Back to timeline

Direct Preference Optimization

DPO is proposed as a simpler alternative to RLHF by optimizing preference objectives without explicit reinforcement learning.

Architecture

What Happened

In May 2023, the DPO paper introduced a preference-optimization method aimed at aligning language models using preference data without a full RLHF reward-model-and-policy-optimization loop.

Why It Matters

DPO helped popularize “RLHF without RL” framing and contributed to a growing ecosystem of preference-based alignment methods used in open and proprietary model training.

Technical Details

DPO reformulates preference learning into a supervised objective that can be applied directly to model parameters, reducing pipeline complexity compared to multi-stage RLHF setups.