Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Trying the abliterated ones makes that embarrassingly clear. You're better off tuning on erotic fanfic for your waifu than using an abliterated one

These are two very different things. Ablation gets used to remove the LLM's behavior of refusing to answer, but obviously it does not otherwise affect the LLM's replies, much less increase the LLM's knowledge or suitability to "forbidden" topics since that will depend on what it was trained on, and forbidden topics tend not to be heavily featured in the training process. Instead, the models tend to confabulate even more than usual as if clumsily trying to fill the gaps in their training. If anything, ablation will more easily let us test "what an LLM would say if it was jailbroken", which will likely help mitigate the oft-expressed concern that a "jailbroken" model might say something dangerous. (Of course a random confabulation about the wrong topic can also be quite dangerous, but confabulations in general are a really hard problem to address.)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: