RLHF does change the parameters. The way to think about it is that backpropagati...

bilsbie · on July 25, 2023

Thanks. That helps.

So how can you update weights without doing back-propagation? Or is it still back propagation but with a different metric?

RC_ITR · on July 25, 2023

Both do backpropagation, the difference is what you are backpropagating towards.

Think of it this way - there are an equal number of rude and polite comments online (actually probably way more rude ones).

If a model is trained on that data, how do you get it to only respond politely?

You could filter out the rude comments, but that's expensive and those rude comments may still have other helpful patterns that tech your model other stuff.

Alternatively, you could pre-train on the rude comments, but then after pre-training is done, you hire a ton of people in a low cost geo and ask them 'do you prefer comment 1 (a polite output of the pre-trained model) or comment 2 (a rude output).'

The model then 'learns' that comment 1 is better because it gets more votes, and adjusts parameter's (through backpropagation) to make comment 1 instead of comment 2

In practice, you can't control what the model outputs, so you just ask it to give you it's top N responses and the humans rank all of them, hoping you get a decent mix of rude and polite.

bradfox2 · on July 25, 2023

It's still loss being backproped, but the loss is calculated over a different criteria

bilsbie · on July 25, 2023

Ok that makes a lot of sense.

Why do they call it reinforcement learning then? Is it not traditional RE such as Q learning?

dgant · on July 25, 2023

The distinction making it RL is that the model is training on data produced by the model itself.

The benefit of RL in general is that you're training on states the agent is likely to find itself in, and the cost is needing an agent which explores salient states. Which is why we keep seeing RL as a finishing step after imitation (eg AlphaStar first learning StarCraft from replays)

bradfox2 · on July 27, 2023

LLM output is scored by another model that produces a reward for the entire sequence emitted by the LLM. The reward model is trained on human preferences or some other metric usually. It's RL because we train on the reward and not some language modeling objective.

The LLM is trained to increase this reward score (or minimize the inverse), which is what makes it RL.

samstave · on July 25, 2023

This implies that any RLHF is introducing human bias into any "thoughts" the model may have?

RC_ITR · on July 25, 2023

Yes, but I think your comment has the foundational misconception that it's the first or even main place where bias is put into models.

LLMs are just pattern identifiers and repeaters. They are trained on inherently biased training datasets of inherently biased text written by inherently biased humans. Every single step of training introduces some amount of bias to an LLM.