6 lessons

Reinforcement Learning

Learn from reward signals — the algorithms behind AlphaGo and RLHF

Lessons

01
Markov Decision Processes
States, actions, rewards, transitions — the RL contract.
MediumOpen
02
Q-Learning
Learn a value function from experience, one update at a time.
MediumOpen
03
Policy Gradients
Optimize the policy directly via gradient ascent on expected reward.
HardOpen
04
REINFORCE
The cleanest policy-gradient algorithm — and its variance problem.
HardOpen
05
Actor-Critic
Combine policy learning with a value baseline.
HardOpen
06
Proximal Policy Optimization
The stable RL algorithm behind RLHF.
HardOpen