UCLA ECE RLHF Reading Group
A group of UCLA ECE grad students studying Reinforcement Learning from Human Feedback by Nathan Lambert.
Goal: Understand LLM post-training deeply enough to do LLM post training research.
Website: shreyasrajesh0308.github.io/RLHF-Reading-Group
| Book: rlhfbook.com | Code: natolambert/rlhf-book | Reference runs: wandb.ai/natolambert/rlhf-book |
| When: Thursdays, 5:00 PM | Where: One of the rooms in E4 5th or 6th floor. |
Schedule
| Week | Date | Chapter(s) | Topic | Exercise | Presenter |
|---|---|---|---|---|---|
| 1 | Feb 12 @Faraday | Ch 1-3 | Introduction, history, training overview | Setup: clone repo, uv sync, run one training job |
Shreyas |
| 2 | Feb 20 @Maxwell | Ch 6 (fundamentals) | RL fundamentals & policy gradients | Explore policy_gradients/ code structure |
Rushabha |
| 3 | Feb 26 @Faraday | Ch 4-5 | Instruction tuning + Reward models | Train ORM vs PRM, discuss tradeoffs | Merve |
| 4 | Mar 5 @Tesla | Ch 5 cont’d | Reward models contd. — discussion + hands-on | Train ORM vs PRM, discuss tradeoffs | Merve |
| 5 | Mar 12 | Ch 6 cont | Policy gradients (REINFORCE, RLOO, PPO, GRPO) | Implement REINFORCE loss by hand, run RLOO | TBD |
| 6 | Mar 19 | Ch 7 | Reasoning & inference-time scaling | RLVR experiments | TBD |
| 7 | Mar 26 | Ch 8 | DPO, IPO | Implement DPO loss by hand, run IPO | TBD |
| 8 | Apr 2 | Ch 8 cont | SimPO, KTO, ORPO | Compare reference-free methods | TBD |
| 9 | Apr 9 | Paper | DeepSeek R1 (Guo et al. 2025) — reasoning via RL | Discuss R1-Zero emergence, GRPO at scale | TBD |
| 10 | Apr 16 | Paper | TBD — frontier paper (OLMo 2, or most relevant release at the time) | Full pipeline deep-dive | TBD |
| 11 | Apr 23 | Ch 9-10 | Rejection sampling, nature of preferences | Discussion-heavy week | TBD |
| 12+ | Apr 30+ | Ch 11-17 / papers | Advanced topics + research directions | Pick a question, design an experiment | TBD |
Setup
- Fork this repo for your own implementations and experiments
- Clone the upstream book code:
git clone https://github.com/natolambert/rlhf-book.git
cd rlhf-book/code
uv sync
- Verify it works:
uv run python -m reward_models.train_orm --samples 100 --epochs 1
See CLAUDE.md for full architecture notes and development commands.
Format
- Weekly, ~1.5-2 hours
- One person presents the chapter (~20 min summary + 2-3 discussion questions)
- Everyone reads beforehand
- Second half is hands-on: run or modify code together
- Its important to implement core algorithms by hand, to understand them well enough before handing them over to agents.
- All implementations go in your personal fork — compare approaches during meetings
Meeting Notes
- Week 1 — Feb 12: Introduction & Setup
- Week 2 — Feb 20: RL Fundamentals & Policy Gradients
- Week 3 — Feb 26: Instruction Tuning & Reward Models
- Week 4 — Mar 5: Reward Models cont’d
Papers & Resources
Papers referenced in discussion, supplementary reading, useful blog posts.
| Paper / Resource | Relevant Chapter | Added By |
|---|---|---|
Research Questions
When something sparks a “what if…” — write it here. These become experiment ideas.
-
Week 1 Goals
- Everyone: fork repo, clone book code, install dependencies, confirm training runs
- Read chapters 1-3
- Assign Ch 4+5 presenter for week 2
Join Us
Interested in joining? Email us at shreyasrajesh38@ucla.edu.