Notes on Reinforcement Learning from Human Feedback

The purpose of this post is to document notes related to Reinforcement Learning from Human Feedback, an approach for updating a trained LLM to better match human preferences or values. The images and presentation follow the Generative AI with LLMs course by DeepLearning.AI on Coursera (see link in References).

 - Aligning models with human values



- Q Learning is an alternative to PPO
- PPO updates weights of LLM based on human feedback in Reward Model (e.g. "helpful" or "not helpful")
- Agent can cheat the system via Reward Hacking (like empty stats in basketball)

- Can avoid Reward Hacking by comparing RL-updated LLM output vs Reference Model (frozen LLM) via KL Divergence. This gets added back via the Reward Model and PPO in what is essentially a control loop.

- KL Divergence is essentially a measure of the difference between 2 probability distributions. Here, it ensures that the reference policy and the updated policy do not differ by too much. In other words, it plays the role of a regularization term.

- Occurred to me that there are basically two levels to training:
    1.) The foundational original training of a model (where the weights are initially random)
    2.) Various additional levels of training (transfer learning, full fine-tuning, PEFT, RLHF)

- The various additional levels of training range from modifying all weights across the LLM to just a small subset using PEFT methods such as LoRA.

A nice cheat sheet describing the different levels of training:

Comments

Popular Posts