GRPO

Posted May 20, 2025 Updated May 21, 2025

By Informal

1 min read

GRPO

Main idea

Key point it to understand the below pictures

for each input, generator G outputs
for each output, calculate logits_prob for each token in current, old, reference mdoel
calcualte objective value as loss
update old model in each step
update reference model in each epoch

beta weight for KL-value between current model and reference model, increase to avoid over-fitting
num_iterations Numbers of iteration per batch, GRPO iterations times in Algorithm 1 picture, similar with LR
epsilon for both clip lower_bound and upper_bound
epsilon_high repalce epsilon for clip upper_bound when exist

Q: How to cold start?

A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible

Q: How to simplify Zoom up/down in objective function?

This post is licensed under CC BY 4.0 by the author.