Post

GRPO

GRPO

Main idea

Key point it to understand the below pictures

Iteration steps

GRPO Iteration

  • for each input, generator G outputs
  • for each output, calculate logits_prob for each token in current, old, reference mdoel
  • calcualte objective value as loss
  • update old model in each step
  • update reference model in each epoch

Objective function

Objective function

  • G is amount of outputs in each group for each input
  • O_i is i-th output in current group
  • t is index of tokens in O_i
  • q is input
  • O_i,t is t-tokens in i-th output
  • pi is model parameter

KL value

KL value

Hyper parameters

Name in huggingface-trl

  • beta weight for KL-value between current model and reference model, increase to avoid over-fitting
  • num_iterations Numbers of iteration per batch, GRPO iterations times in Algorithm 1 picture, similar with LR
  • epsilon for both clip lower_bound and upper_bound
  • epsilon_high repalce epsilon for clip upper_bound when exist

FAQ

Q: How to cold start?

A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible

Q: How to simplify Zoom up/down in objective function?

This post is licensed under CC BY 4.0 by the author.