CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Context
This is a reading group repository for studying Nathan Lambert’s Reinforcement Learning from Human Feedback (RLHF) book. The goal is to deeply understand LLM post-training methods — reward modeling, policy gradient RL, and direct alignment — well enough to do research.
The upstream code lives at natolambert/rlhf-book. The book is at rlhfbook.com.
Upstream Repository Setup
To work with the book’s code examples:
git clone https://github.com/natolambert/rlhf-book.git
cd rlhf-book/code
uv sync
Always use uv run python instead of bare python to ensure the correct virtual environment:
uv run python -m policy_gradients.train --config policy_gradients/configs/grpo.yaml
uv run python -m direct_alignment.train --config direct_alignment/configs/dpo.yaml
uv run python -m reward_models.train_orm --samples 400
Requires Python 3.12+. Flash Attention is installed by default on x86_64; ARM64 systems fall back to PyTorch SDPA automatically.
Code Architecture (upstream code/ directory)
Three independent packages, each corresponding to a book chapter:
policy_gradients/ — Chapter 6: Policy Gradient Methods
- Algorithms: REINFORCE, RLOO, PPO, GRPO, Dr. GRPO, GSPO, CISPO
- Config: Pydantic
BaseModelinconfig.py, loaded from YAML files inconfigs/ - Core abstractions:
Experiencedataclass (trajectory data),ReplayBuffer(collection), loss classes inloss.py(allnn.Modulewithforward(log_probs, experience)) - Training:
train.py— rollout phase (generate completions, compute rewards viareasoning_gymprocedural tasks) → training phase (policy gradient updates with gradient accumulation) - Entry:
uv run python -m policy_gradients.train --config configs/<algo>.yaml
direct_alignment/ — Chapter 8: Direct Alignment (DPO family)
- Algorithms: DPO, cDPO, IPO, SimPO, ORPO, KTO
- Config: Python
@dataclassinconfig.pywith loss-specific defaults in__post_init__ - Core abstractions:
PreferenceBatchdataclass (chosen/rejected pairs), loss classes inloss.py(allnn.Modulewithforward(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps)returning(loss, metrics_dict)) - Data:
data.py— loads UltraFeedback or Anthropic HH preference datasets, normalizes to[prompt, chosen, rejected]format - Key function:
compute_logprobs(logits, labels, mask)— shared log-probability computation used by all losses - Training:
train.py— forward pass through policy + frozen reference model, compute loss, gradient accumulation - Entry:
uv run python -m direct_alignment.train --config direct_alignment/configs/<algo>.yaml
reward_models/ — Chapter 5: Reward Models
- Model types: ORM (Outcome), PRM (Process), Preference RM (Bradley-Terry)
- Base class:
base.py—BaseRewardModel(nn.Module)with backbone + linear reward head, shared training utilities (training_loop,create_optimizer, wandb helpers) - Datasets: GSM8K (ORM/PRM), UltraFeedback (Preference RM), PRM800K (PRM)
- Entry:
uv run python -m reward_models.train_orm,train_prm,train_preference_rm
How to add a new algorithm
- New policy gradient: Add loss class to
policy_gradients/loss.py, register inget_loss_objective()intrain.py, create a config YAML - New direct alignment: Add loss class to
direct_alignment/loss.py, register inLOSS_FUNCTIONSdict +get_loss_function(), create a config YAML - New reward model: Subclass
BaseRewardModelin a newtrain_*.py, add entry point inpyproject.toml
Environment Variables
export WANDB_API_KEY="..." # Required for experiment logging
export WANDB_PROJECT="rlhf-book" # Optional override
export WANDB_MODE="disabled" # To disable logging entirely
export HF_TOKEN="..." # For gated HuggingFace models
Model Sizing & Memory
| Model | GPU Memory (full fine-tune) |
|---|---|
| Qwen3-0.6B | ~4-6 GB |
| Qwen3-1.7B | ~10-15 GB |
| Qwen2.5-3B | ~20-25 GB |
Default models: Qwen3-0.6B (reward models), Qwen3-1.7B (policy gradients), OLMo-2-0425-1B-SFT (direct alignment). Learning rates: ~1e-5 to 5e-6 for full fine-tuning (10x smaller than LoRA).
Linting
uv run ruff check . # Lint
uv run ruff check --fix . # Auto-fix
uv run ruff format . # Format
Config: pyproject.toml — targets Python 3.12, line length 100, selects E/F/I/W/B/C4 rules.
Reference Runs
All example training runs are publicly viewable at wandb.ai/natolambert/rlhf-book — use these to verify your training curves look reasonable.
Book Chapter → Code Mapping
| Chapter | Code Package | Key Concept |
|---|---|---|
| Ch 5: Reward Models | reward_models/ |
Bradley-Terry loss, ORM vs PRM |
| Ch 6: Policy Gradients | policy_gradients/ |
REINFORCE → PPO → GRPO evolution |
| Ch 8: Direct Alignment | direct_alignment/ |
DPO and why it works without RL |
Supplementary Reading — Bridging Book to Frontier
The book covers foundational algorithms. After completing each major section, Claude should proactively surface the key papers, technical reports, and blog posts needed to connect that section’s content to current frontier practice.
What Claude should do
- After a chapter or topic is completed, remind the reading group and provide a curated list of supplementary readings
- Prioritize: foundational papers the book builds on, landmark results that extend the book’s methods, and recent technical reports showing how these ideas are deployed at scale
- Include brief (1-2 sentence) annotations explaining why each reading matters and how it connects to the chapter just covered
- Organize by relevance, not chronology
Key areas the book doesn’t fully cover (supplement these)
- Reasoning via RL: DeepSeek-R1, OpenAI o1/o3-style long chain-of-thought RL, test-time compute scaling
- RLHF infrastructure at scale: Distributed rollout systems, async training (veRL, OpenRLHF)
- Synthetic data and self-play: Distillation, iterative DPO/RLHF, constitutional AI
- Frontier post-training recipes: Llama 3 post-training, Gemini, Claude training methodology (what’s public)
- Evaluation and red-teaming: How labs measure alignment, safety evaluations, reward hacking/overoptimization
Reading Group → Research Roadmap
Phase 1: Foundations (Book + Supplementary Readings)
- Work through the book chapters and code examples
- Run the training scripts, study the wandb curves, build intuition
- After each chapter, go through the supplementary papers Claude surfaces
Phase 2: Replication (Bridge to Research)
- Pick 1-2 papers and replicate key results at small scale (1B-7B) using the book’s codebase as a starting point
- Good replication targets:
- DeepSeek-R1-Zero: emergent reasoning with GRPO on a small model (extends Ch 6)
- DPO overoptimization dynamics (extends Ch 8)
- Process reward model vs outcome reward model comparison (extends Ch 5)
- Success = matching qualitative trends, not exact numbers
Phase 3: Original Research
- Identify a gap or question from the replication work
- Run controlled experiments varying one thing at a time
- Publishable research at 1B-7B scale is common — algorithmic insights transfer up
Compute Resources
- 4x NVIDIA RTX A6000 (48GB VRAM each), 503GB system RAM, 128 CPU cores
- Comfortable for full fine-tuning up to 7B, feasible for 13B with multi-GPU
- Sufficient for all phases of this roadmap