Notes on Reinforcement Learning from Human Feedback

August 29, 2023

Notes on Reinforcement Learning from Human Feedback

The purpose of this post is to document notes related to Reinforcement Learning from Human Feedback, an approach for updating a trained LLM to better match human preferences or values. The images and presentation follow the Generative AI with LLMs course by DeepLearning.AI on Coursera (see link in References).

- Aligning models with human values

- Q Learning is an alternative to PPO

- PPO updates weights of LLM based on human feedback in Reward Model (e.g. "helpful" or "not helpful")

- Agent can cheat the system via Reward Hacking (like empty stats in basketball)

- Can avoid Reward Hacking by comparing RL-updated LLM output vs Reference Model (frozen LLM) via KL Divergence. This gets added back via the Reward Model and PPO in what is essentially a control loop.

- KL Divergence is essentially a measure of the difference between 2 probability distributions. Here, it ensures that the reference policy and the updated policy do not differ by too much. In other words, it plays the role of a regularization term.

- Occurred to me that there are basically two levels to training:

1.) The foundational original training of a model (where the weights are initially random)

2.) Various additional levels of training (transfer learning, full fine-tuning, PEFT, RLHF)

- The various additional levels of training range from modifying all weights across the LLM to just a small subset using PEFT methods such as LoRA.

A nice cheat sheet describing the different levels of training:

References:

Reinforcement learning from human feedback (RLHF) | Coursera

Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU (huggingface.co)

Search This Blog

Ideas to Reality

Notes on Reinforcement Learning from Human Feedback

Comments

Post a Comment

Popular Posts

Notes on Embedding Models: From Architecture to Implementation