Week 1 — Feb 12: Introduction, History & Training Overview

UCLA ECE RLHF Reading Group

Chapters: 1-3 Presenter: Shreyas

Note: Content, figures, and examples in these notes are drawn from Nathan Lambert’s RLHF Book and referenced papers. These are reading group notes, not original work.


Chapter 1: Introduction — Key Takeaways

What RLHF actually does

Three types of post-training

  1. SFT/IFT — teaches format, instruction-following. Learns features in language
  2. Preference Fine-tuning (PreFT) — aligns to human preferences. Learns style and subtle preferences. This is where RLHF lives
  3. RLVR — RL with verifiable rewards for reasoning domains. Newest, fastest-evolving

The Elicitation Theory of Post-Training

Why this matters


Phase 1: Origins (pre-2018)

Phase 2: Language Models (2019-2022)

Phase 3: ChatGPT Era (2023+)


Chapter 3: Training Overview — Core Concepts

The RLHF objective

Standard RL: maximize expected reward over trajectories RLHF simplification: no state transitions, response-level (bandit) rewards, learned reward model

\[J(\pi) = \mathbb{E}[r_\theta(x, y)] - \beta \cdot D_{\text{KL}}(\pi \parallel \pi_{\text{ref}})\]

Three key differences from standard RL:

  1. Reward function → reward model (learned, not environmental)
  2. No state transitions (prompt in, completion out — single step)
  3. Response-level rewards (bandit-style, not per-timestep)

The KL penalty

Three canonical recipes (increasing complexity)

InstructGPT (2022): SFT (10K) → Reward Model (100K pairs) → RLHF (100K prompts) Simple, three-stage. The template everything builds on.

Tülu 3 (2024): SFT (1M synthetic) → On-policy DPO (1M pairs) → RLVR (10K prompts) Much more data, synthetic data-heavy, adds RLVR for reasoning.

DeepSeek R1 (2025): Cold-start SFT (100K+ reasoning) → Large-scale RLVR → Rejection sampling → Mixed RL Reasoning-first. RL is the centerpiece, not an afterthought. Represents the current frontier.

The trend: more stages, more data, more RL, reasoning-centric.


Discussion Questions

  1. The Elicitation Theory vs. Superficial Alignment: Lambert argues post-training extracts deep capabilities, not just surface style. But LIMA showed you can get surprisingly far with 1K examples. Where’s the truth? Is there a threshold where more preference data stops helping and you need a fundamentally different signal (like RLVR)?

  2. Why did it take so long for DPO to work? The math was published in May 2023 but the first good models weren’t until fall 2023 — and the fix was just a lower learning rate. What does this tell us about the gap between theory and practice in post-training? What other “obvious in hindsight” practical details might be hiding in current methods?

  3. The three recipes (InstructGPT → Tülu 3 → DeepSeek R1) show a clear trend toward more RL. Is RLHF (preference-based) becoming less important relative to RLVR (verifiable rewards)? Or do they serve fundamentally different purposes — preferences for style/safety, verifiable rewards for capabilities?

  4. The KL penalty is doing a lot of work. It’s the main thing preventing reward hacking and model collapse. But it also limits how far the model can improve. How should we think about setting β? Is there a principled way, or is it mostly empirical?

  5. Open vs. closed gap: Lambert notes companies that embraced RLHF early (Anthropic) built lasting advantages, and that open-source was stuck in a “SFT is enough” phase. As of 2025/2026, has the open-source community closed this gap? What’s still missing?


Key Equations to Know

The RLHF objective — maximize reward while staying close to the reference policy:

\[J(\pi) = \mathbb{E}[r_\theta(x,y)] - \beta \, D_{\text{KL}}(\pi \parallel \pi_{\text{ref}})\]

Trajectory distribution in standard RL — contrast with the simplified RLHF bandit setup:

\[p_\pi(\tau) = \rho_0(s_0) \prod_{t} \pi(a_t \mid s_t) \, p(s_{t+1} \mid s_t, a_t)\]

Notes

The early three-stage RLHF process: SFT, reward model, then RL optimization


Action Items