CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Context

This is a reading group repository for studying Nathan Lambert’s Reinforcement Learning from Human Feedback (RLHF) book. The goal is to deeply understand LLM post-training methods — reward modeling, policy gradient RL, and direct alignment — well enough to do research.

The upstream code lives at natolambert/rlhf-book. The book is at rlhfbook.com.

Upstream Repository Setup

To work with the book’s code examples:

git clone https://github.com/natolambert/rlhf-book.git
cd rlhf-book/code
uv sync

Always use uv run python instead of bare python to ensure the correct virtual environment:

uv run python -m policy_gradients.train --config policy_gradients/configs/grpo.yaml
uv run python -m direct_alignment.train --config direct_alignment/configs/dpo.yaml
uv run python -m reward_models.train_orm --samples 400

Requires Python 3.12+. Flash Attention is installed by default on x86_64; ARM64 systems fall back to PyTorch SDPA automatically.

Code Architecture (upstream code/ directory)

Three independent packages, each corresponding to a book chapter:

policy_gradients/ — Chapter 6: Policy Gradient Methods

direct_alignment/ — Chapter 8: Direct Alignment (DPO family)

reward_models/ — Chapter 5: Reward Models

How to add a new algorithm

Environment Variables

export WANDB_API_KEY="..."           # Required for experiment logging
export WANDB_PROJECT="rlhf-book"     # Optional override
export WANDB_MODE="disabled"         # To disable logging entirely
export HF_TOKEN="..."                # For gated HuggingFace models

Model Sizing & Memory

Model GPU Memory (full fine-tune)
Qwen3-0.6B ~4-6 GB
Qwen3-1.7B ~10-15 GB
Qwen2.5-3B ~20-25 GB

Default models: Qwen3-0.6B (reward models), Qwen3-1.7B (policy gradients), OLMo-2-0425-1B-SFT (direct alignment). Learning rates: ~1e-5 to 5e-6 for full fine-tuning (10x smaller than LoRA).

Linting

uv run ruff check .          # Lint
uv run ruff check --fix .    # Auto-fix
uv run ruff format .         # Format

Config: pyproject.toml — targets Python 3.12, line length 100, selects E/F/I/W/B/C4 rules.

Reference Runs

All example training runs are publicly viewable at wandb.ai/natolambert/rlhf-book — use these to verify your training curves look reasonable.

Book Chapter → Code Mapping

Chapter Code Package Key Concept
Ch 5: Reward Models reward_models/ Bradley-Terry loss, ORM vs PRM
Ch 6: Policy Gradients policy_gradients/ REINFORCE → PPO → GRPO evolution
Ch 8: Direct Alignment direct_alignment/ DPO and why it works without RL

Supplementary Reading — Bridging Book to Frontier

The book covers foundational algorithms. After completing each major section, Claude should proactively surface the key papers, technical reports, and blog posts needed to connect that section’s content to current frontier practice.

What Claude should do

Key areas the book doesn’t fully cover (supplement these)

Reading Group → Research Roadmap

Phase 1: Foundations (Book + Supplementary Readings)

Phase 2: Replication (Bridge to Research)

Phase 3: Original Research

Compute Resources