In this course, you’ll explore the basics of instruction-tuning with Hugging Face, reward modeling, and how to train a reward model. You’ll also learn proximal policy optimization (PPO) with Hugging Face, LLMs as policies, and reinforcement learning from human feedback (RLHF). This course will further delve into direct performance optimization (DPO) with Hugging Face using the partition function.
By the end of the course, you will be able to: