6 lessons

Fine-Tuning & RLHF

From a base model to an aligned, instruction-following assistant

Lessons

01
Supervised Fine-Tuning
Turn a base model into an instruction-follower.
MediumOpen
02
LoRA
Low-rank adapters — fine-tune 0.1% of the parameters.
HardOpen
03
QLoRA
LoRA on 4-bit weights — fine-tune a 70B on a single GPU.
HardOpen
04
Reward Modeling
Train a preference model from human pairwise comparisons.
HardOpen
05
PPO for RLHF
Policy optimization against a learned reward model.
HardOpen
06
Direct Preference Optimization
RLHF without a separate reward model — the elegant alternative.
HardOpen