Direct Preference Optimization (2023)

What Happened

In May 2023, the DPO paper introduced a preference-optimization method aimed at aligning language models using preference data without a full RLHF reward-model-and-policy-optimization loop.

Why It Matters

DPO helped popularize “RLHF without RL” framing and contributed to a growing ecosystem of preference-based alignment methods used in open and proprietary model training.

Technical Details

DPO reformulates preference learning into a supervised objective that can be applied directly to model parameters, reducing pipeline complexity compared to multi-stage RLHF setups.

Sources

Paper (arXiv)
Reference implementation (GitHub)