What Happened
In May 2023, the DPO paper introduced a preference-optimization method aimed at aligning language models using preference data without a full RLHF reward-model-and-policy-optimization loop.
Why It Matters
DPO helped popularize “RLHF without RL” framing and contributed to a growing ecosystem of preference-based alignment methods used in open and proprietary model training.
Technical Details
DPO reformulates preference learning into a supervised objective that can be applied directly to model parameters, reducing pipeline complexity compared to multi-stage RLHF setups.