UCLA ECE RLHF Reading Group

A group of UCLA ECE grad students studying Reinforcement Learning from Human Feedback by Nathan Lambert.

Goal: Understand LLM post-training deeply enough to do LLM post training research.

Website: shreyasrajesh0308.github.io/RLHF-Reading-Group

Book: rlhfbook.com

Code: natolambert/rlhf-book

Reference runs: wandb.ai/natolambert/rlhf-book

When: Thursdays, 5:00 PM

Where: One of the rooms in E4 5th or 6th floor.

Schedule

Week	Date	Chapter(s)	Topic	Exercise	Presenter
1	Feb 12 @Faraday	Ch 1-3	Introduction, history, training overview	Setup: clone repo, `uv sync`, run one training job	Shreyas
2	Feb 20 @Maxwell	Ch 6 (fundamentals)	RL fundamentals & policy gradients	Explore `policy_gradients/` code structure	Rushabha
3	Feb 26 @Faraday	Ch 4-5	Instruction tuning + Reward models	Train ORM vs PRM, discuss tradeoffs	Merve
4	Mar 5 @Tesla	Ch 5 cont’d	Reward models contd. — discussion + hands-on	Train ORM vs PRM, discuss tradeoffs	Merve
5	Mar 12	Ch 6 cont	Policy gradients (REINFORCE, RLOO, PPO, GRPO)	Implement REINFORCE loss by hand, run RLOO	TBD
6	Mar 19	Ch 7	Reasoning & inference-time scaling	RLVR experiments	TBD
7	Mar 26	Ch 8	DPO, IPO	Implement DPO loss by hand, run IPO	TBD
8	Apr 2	Ch 8 cont	SimPO, KTO, ORPO	Compare reference-free methods	TBD
9	Apr 9	Paper	DeepSeek R1 (Guo et al. 2025) — reasoning via RL	Discuss R1-Zero emergence, GRPO at scale	TBD
10	Apr 16	Paper	TBD — frontier paper (OLMo 2, or most relevant release at the time)	Full pipeline deep-dive	TBD
11	Apr 23	Ch 9-10	Rejection sampling, nature of preferences	Discussion-heavy week	TBD
12+	Apr 30+	Ch 11-17 / papers	Advanced topics + research directions	Pick a question, design an experiment	TBD

git clone https://github.com/natolambert/rlhf-book.git
cd rlhf-book/code
uv sync

uv run python -m reward_models.train_orm --samples 100 --epochs 1

See CLAUDE.md for full architecture notes and development commands.

Weekly, ~1.5-2 hours
One person presents the chapter (~20 min summary + 2-3 discussion questions)
Everyone reads beforehand
Second half is hands-on: run or modify code together
Its important to implement core algorithms by hand, to understand them well enough before handing them over to agents.
All implementations go in your personal fork — compare approaches during meetings

Papers referenced in discussion, supplementary reading, useful blog posts.

Paper / Resource	Relevant Chapter	Added By

When something sparks a “what if…” — write it here. These become experiment ideas.

Everyone: fork repo, clone book code, install dependencies, confirm training runs
Read chapters 1-3
Assign Ch 4+5 presenter for week 2

Interested in joining? Email us at shreyasrajesh38@ucla.edu.