Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RLHF does change the parameters.

The way to think about it is that backpropagation changes the parameters of a model so they get closer to some sort of desired output.

In pre-training and SFT, the parameters are changed so the model does a better job of replicating the next word in the training data, given the words it has already seen.

In RLHF, the parameters are changed so the model does a better job of outputting the response that aligns to the human's preference (see: the feedback screen in the linked article).



Thanks. That helps.

So how can you update weights without doing back-propagation? Or is it still back propagation but with a different metric?


Both do backpropagation, the difference is what you are backpropagating towards.

Think of it this way - there are an equal number of rude and polite comments online (actually probably way more rude ones).

If a model is trained on that data, how do you get it to only respond politely?

You could filter out the rude comments, but that's expensive and those rude comments may still have other helpful patterns that tech your model other stuff.

Alternatively, you could pre-train on the rude comments, but then after pre-training is done, you hire a ton of people in a low cost geo and ask them 'do you prefer comment 1 (a polite output of the pre-trained model) or comment 2 (a rude output).'

The model then 'learns' that comment 1 is better because it gets more votes, and adjusts parameter's (through backpropagation) to make comment 1 instead of comment 2

In practice, you can't control what the model outputs, so you just ask it to give you it's top N responses and the humans rank all of them, hoping you get a decent mix of rude and polite.


It's still loss being backproped, but the loss is calculated over a different criteria


Ok that makes a lot of sense.

Why do they call it reinforcement learning then? Is it not traditional RE such as Q learning?


The distinction making it RL is that the model is training on data produced by the model itself.

The benefit of RL in general is that you're training on states the agent is likely to find itself in, and the cost is needing an agent which explores salient states. Which is why we keep seeing RL as a finishing step after imitation (eg AlphaStar first learning StarCraft from replays)


LLM output is scored by another model that produces a reward for the entire sequence emitted by the LLM. The reward model is trained on human preferences or some other metric usually. It's RL because we train on the reward and not some language modeling objective.

The LLM is trained to increase this reward score (or minimize the inverse), which is what makes it RL.


This implies that any RLHF is introducing human bias into any "thoughts" the model may have?


Yes, but I think your comment has the foundational misconception that it's the first or even main place where bias is put into models.

LLMs are just pattern identifiers and repeaters. They are trained on inherently biased training datasets of inherently biased text written by inherently biased humans. Every single step of training introduces some amount of bias to an LLM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: