GRPO
GRPO
Main idea
Key point it to understand the below pictures
Iteration steps
- for each input, generator
G
outputs - for each output, calculate logits_prob for each token in current, old, reference mdoel
- calcualte objective value as loss
- update old model in each step
- update reference model in each epoch
Objective function
G
is amount of outputs in each group for each inputO_i
is i-th output in current groupt
is index of tokens inO_i
q
is inputO_i,t
is t-tokens in i-th outputpi
is model parameter
KL value
Hyper parameters
Name in huggingface-trl
beta
weight for KL-value between current model and reference model, increase to avoid over-fittingnum_iterations
Numbers of iteration per batch, GRPO iterations times inAlgorithm 1 picture
, similar with LRepsilon
for both clip lower_bound and upper_boundepsilon_high
repalceepsilon
for clip upper_bound when exist
FAQ
Q: How to cold start?
A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible
Q: How to simplify Zoom up/down in objective function?
This post is licensed under CC BY 4.0 by the author.