Training & Optimization/rlhf

RLHF

A training method that uses human feedback to teach AI how to behave.

What it actually means

RLHF stands for Reinforcement Learning from Human Feedback. After an AI is trained on text, human raters rank its responses from best to worst. The model then learns to produce responses more like the ones humans preferred.

Real-world analogy

Imagine training a dog with treats. Every time it does something you like, you reward it. Over thousands of repetitions, it learns what behaviour earns approval. RLHF does the same — except the dog is a language model and the treats are positive signals from human raters.

Common misconception

RLHF doesn't make AI "understand" what's good. It makes AI produce outputs that look like what humans rated highly. If the raters had biases, the model learns those biases too.

Related terms

llm fine-tuning ai-agent