Week 2 — Feb 20: RL Fundamentals & Policy Gradients

UCLA ECE RLHF Reading Group

Chapters: 6 (Fundamentals) Presenter: TBD

Note: Content, figures, and examples in these notes are drawn from Nathan Lambert’s RLHF Book and referenced papers. These are reading group notes, not original work.


Understanding RL Basics in the LLM/RLHF Context

The simplified LLM bandit setting

Standard RL involves state transitions: agent takes action in state → environment transitions to new state → repeat. LLMs turn this into a bandit problem (single-step decision):

Component Standard RL LLM RLHF
State Current game/environment state Prompt/instruction (fixed)
Action Single move Generated completion/response
Reward From environment From reward model (or verifiable signal)
Trajectory Sequence of state-action pairs over time Single prompt → completion pair

What is a “reward” in RLHF?

What is a “state” and “action” for an LLM?


Introduction to Policy Gradients

The core insight: maximize expected reward by following the gradient

The goal: maximize expected reward over trajectories from the policy \(J(\pi_\theta) = \mathbb{E}_{y \sim \pi_\theta(y|x)}[r(x, y)]\)

We want to adjust the policy parameters $\theta$ to increase this objective. The policy gradient theorem tells us the gradient direction:

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}[r(x, y) \cdot \nabla_\theta \log \pi_\theta(y|x)]\]

Key insight: The gradient pulls the policy toward high-reward actions and away from low-reward ones. We can estimate this gradient by:

  1. Sample completions $y$ from the current policy
  2. Score each with the reward model $r(x, y)$
  3. Update: \(\theta \leftarrow \theta + \alpha \cdot r(x, y) \cdot \nabla_\theta \log \pi_\theta(y|x)\)

Why this works (and why it’s noisy)

The policy gradient uses the likelihood ratio trick — we can express rewards in terms of log-probabilities without needing the reward function’s gradient:

\[\nabla_\theta \mathbb{E}[r(x, y)] = \mathbb{E}[\nabla_\theta \log \pi_\theta(y|x) \cdot r(x, y)]\]

This is powerful because:

But it’s high-variance:


Mathematical Definitions: Value Functions

State Value Function $V^\pi(s)$

The state value function under policy $\pi$ is the expected cumulative discounted reward starting from state $s$ and following the policy:

\[V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]\]

where:

Interpretation: “How good is it to be in state $s$ if I follow the policy from now on?”

Action Value Function $Q^\pi(s, a)$

The action value function (also called Q-function) is the expected cumulative reward starting from state $s$, taking action $a$, then following the policy:

\[Q^\pi(s, a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]

Interpretation: “How good is it to take action $a$ in state $s$, and then follow the policy?”

Bellman Equations (Recursive Relationship)

Value functions satisfy recursive relationships called Bellman equations:

\[V^\pi(s) = \mathbb{E}_a\left[r(s, a) + \gamma V^\pi(s')\right]\] \[Q^\pi(s, a) = \mathbb{E}_{s'}\left[r(s, a) + \gamma V^\pi(s')\right]\]

Key insight: The value of a state is the immediate reward plus the discounted value of the next state. This allows dynamic programming and bootstrapping.

Advantage Function $A^\pi(s, a)$

The advantage function measures how much better action $a$ is compared to the average action under policy $\pi$:

\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

Interpretation:

In the LLM/RLHF Context

For language models, the mapping is:

Standard RL LLM RLHF
State $s$ Prompt $x$ (fixed)
Action $a$ Generated token $y_t$ or full response $y$
Reward $r(s,a)$ Reward model output $r_\theta(x, y)$
$V(s)$ Expected reward for prompt: $V(x) = \mathbb{E}{y \sim \pi}[r\theta(x,y)]$
$Q(s,a)$ Expected reward after taking action: $Q(x,y) = r_\theta(x,y)$ (one-shot)
$A(s,a)$ How much better response $y$ is: $A(x,y) = r_\theta(x,y) - V(x)$

In RLHF (bandit setting):


Tricks to Reduce Bias and Variance: Advantage Functions

The problem: pure reward has high variance

If we use raw reward $r(x, y)$ as the gradient weight, we’re saying “apply the same weight to all tokens in the response.” But not all tokens are equally important — some tokens generated the good part, others were neutral.

Also, if reward is always positive (e.g., 0.5 to 1.0), we push every token upward equally, wasting gradient signal on actions that didn’t matter.

Solution 1: Baseline subtraction (reduce variance, no bias)

Use a baseline $b(x)$ — typically the value function $V(x)$ — to center rewards:

\[A(x, y) = r(x, y) - V(x)\]

Effect:

Why it works: Mathematically, subtracting a baseline doesn’t change the gradient in expectation (it’s zero-mean), but it drastically reduces variance — the advantage $A$ concentrates the signal on genuinely good vs. bad actions.

Solution 2: Per-token advantages (decompose the trajectory)

The response $y = [y_1, \ldots, y_T]$ is a sequence. A single end-of-sequence reward $r(x, y)$ assigns credit to the entire trajectory, but we want to know which tokens deserve credit.

Approach: Use temporal advantage decomposition — compute advantages that account for the model’s prediction at each step:

This is more granular than a single reward signal — it tells the policy which specific tokens were good moves.

Solution 3: Advantage estimation from value functions (bootstrapping)

If we train a separate value function $V_\phi(x)$ to predict expected reward, we can estimate advantages more efficiently:

\[A(x, y) = r(x, y) - V_\phi(x)\]

or temporal advantages:

\[A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\]

Benefit: We can use bootstrapping — future rewards are estimated from the value function rather than waiting for the whole trajectory to finish. This reduces variance even more.

Value function training: on-policy rollouts are scored (reward), targets are computed via regression against value predictions, and advantages weight the policy gradient. From Nathan Lambert's RLHF Book (page 60).


The Simplest Policy Gradient Algorithm: REINFORCE

Algorithm

REINFORCE is the baseline policy gradient algorithm. It’s simple enough to fit on a slide but captures the essential idea:

  1. Collect trajectory: Sample completion $y$ from policy: \(y \sim \pi_\theta(\cdot | x)\)

  2. Compute return: Get reward from RM: \(R = r(x, y)\)

  3. Compute gradient: For each token position $t$ in the trajectory: \(g_t = \nabla_\theta \log \pi_\theta(y_t | x, y_{<t}) \cdot R\)

  4. Update policy: Accumulate gradients and apply SGD: \(\theta \leftarrow \theta + \alpha \sum_t g_t\)

Why it’s noisy

Why we still care (it’s the foundation)

All modern policy gradient methods (PPO, GRPO, etc.) are refinements of REINFORCE that:

But the core mechanism is always: sample from policy → score with reward → push gradient in reward direction.

REINFORCE with baseline (the next step up)

Add a simple baseline $b(x)$ (often just the mean reward):

  1. Sample: \(y \sim \pi_\theta(\cdot | x)\)

  2. Get reward: \(R = r(x, y)\)

  3. Compute advantage: \(A = R - b(x)\)

  4. Update: \(\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(y|x) \cdot A\)

Effect: Rewards below the baseline now produce negative gradients, concentrating updates on truly good actions. Variance drops significantly.


Key Equations to Know

Policy gradient theorem: \(\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]\)

REINFORCE gradient (using empirical trajectory): \(g = \nabla_\theta \log \pi_\theta(y|x) \cdot r(x, y)\)

Advantage-weighted gradient (with baseline): \(g = \nabla_\theta \log \pi_\theta(y|x) \cdot (r(x, y) - V(x))\)

Why the baseline is unbiased: \(\mathbb{E}[V(x) \cdot \nabla_\theta \log \pi_\theta(y|x)] = V(x) \mathbb{E}[\nabla_\theta \log \pi_\theta(y|x)] = V(x) \cdot 0 = 0\)


Discussion Questions

  1. State-action abstraction in LLMs: We treat the prompt as a fixed “state” and token generation as the “action space.” But in interactive settings, the prompt could evolve over time (like multi-turn dialogue). How does RLHF handle this? Does the policy gradient framework break?

  2. The reward model is the bottleneck: RLHF’s entire training loop depends on the reward model’s quality. A bad RM leads to overoptimization and policy collapse. How should we think about RM uncertainty? Should we be training an ensemble and using disagreement as a signal for when to stop optimizing?

  3. Why does REINFORCE have such high variance? Intuitively, a single reward for a 1000-token response seems like it should be enough information. What exactly is making the gradient estimates so noisy compared to supervised learning (SFT)?

  4. Advantage vs. reward scaling: In REINFORCE with baseline, how sensitive is the training to how we compute $A = R - V(x)$? If $V$ is poorly trained, does the advantage estimate become worse than no baseline at all?

  5. On-policy data collection: Policy gradients require sampling from the current policy. This means we can’t reuse old data — we have to collect new rollouts every step. How much does this on-policy requirement hurt sample efficiency? When would an off-policy method (like DPO, Chapter 8) be preferable?


Notes


Action Items