arxiv Preprint - Contrastive Prefence Learning: Learning from Human Feedback without RL

arxiv Preprint - Contrastive Prefence Learning: Learning from Human Feedback without RL

AI Breakdown · 2023-10-24
03:43

In this episode we discuss Contrastive Prefence Learning: Learning from Human Feedback without RL by Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh. Traditional approaches to Reinforcement Learning from Human Feedback (RLHF) assume that human preferences align with reward, but recent research suggests they align with regret under the user's optimal policy. This flawed assumption complicates the optimization of the learned reward function using RL. Contrastive Preference Learning (CPL) is proposed as a new approach that learns optimal policies directly from preferences without the need for RL, using maximum entropy and a contrastive objective. CPL is off-policy, applicable to various problems, and can handle high-dimensional and sequential RLHF tasks.

AI Breakdown

The podcast where we use AI to breakdown the recent AI papers and provide simplified explanations of intricate AI topics for educational purposes.

The content presented here is generated automatically by utilizing LLM and text to speech technologies. While every effort is made to ensure accuracy, any potential misrepresentations or inaccuracies are unintentional due to evolving technology. We value your feedback to enhance our podcast and provide you with the best possible learning experience.

Where can you listen?

Apple Podcasts Logo Spotify Logo Podtail Logo Google Podcasts Logo RSS

Episodes