All Courses
6 lessons
Reinforcement Learning
Learn from reward signals — the algorithms behind AlphaGo and RLHF
Lessons
- 01
Markov Decision Processes
States, actions, rewards, transitions — the RL contract.
MediumOpen - 02
Q-Learning
Learn a value function from experience, one update at a time.
MediumOpen - 03
Policy Gradients
Optimize the policy directly via gradient ascent on expected reward.
HardOpen - 04
REINFORCE
The cleanest policy-gradient algorithm — and its variance problem.
HardOpen - 05
Actor-Critic
Combine policy learning with a value baseline.
HardOpen - 06
Proximal Policy Optimization
The stable RL algorithm behind RLHF.
HardOpen