What Happened
In March 2022, OpenAI published research describing “Training language models to follow instructions with human feedback,” outlining a practical pipeline combining supervised demonstrations and reinforcement learning from human preferences (RLHF).
Why It Matters
RLHF-style alignment became a standard approach for making large language models more usable in products, influencing instruction tuning, preference optimization, and evaluation practices across the industry.
Technical Details
The pipeline typically includes supervised fine-tuning on human demonstrations, training a reward model from preference data, and optimizing the policy to maximize that reward signal.