Jul 18, 2024

Proximal Policy Optimization in Machine Learning & Ethical AI: A Detailed Review

By: Eshaan Barkataki

Introduction to Proximal Policy Optimization (PPO) in Machine Learning

Today, we are going to take a look at Proximal Policy Optimization: https://arxiv.org/pdf/1707.06347 This is different than your usual supervised learning (neural network will try to predict as close as it possibly can to the labelled dataset). Instead, reinforcement learning aims to increase the reward of the performance of model (agent in this case)


This is a good representation of how RL process typically works. Our "Agent" is basically our computer/ neural network in this. It is trying to find the best action in a specific environment to maximize the reward.

The "state" in this case is everything that the neural network can see. If we want our good old computer to play pong, for example, our environment would be the pong game itself, our "state" would be stuff like where the ball is, where the current position of our paddle, and what is the current position of our enemy paddle.

Core Concepts in Proximal Policy Optimization

Now, we are going to jump ahead of years and years of research into Proximal Policy Optimization (PPO). Bear with me here! Basically, our agent is trained to maximize our reward. But we are going to add a little bit extra here and there!

Detailed Breakdown of AI Equations


Equation (7) is what our "objective" is. Aka, this is the function that the neural network is trying to maximize here. What is E_t? That is basically the average across all the training examples our agent has accumulated. min(r(t), clip(r(t), 1 - e, 1 + e)) is simply a fancy way of saying: hey model, don't get too far ahead of yourself. In other words, do not change your thought process / strategy too much until you have seen more experience. A_t is like our reward! But a little bit more:

A_t is a big fancy equation for basically quantizing: "hmmm, is this action good for us or bad for us"?

Pay close attention to Eq 11 and Eq 12. That's basically the main equation on how it is usually calculated (although... there are various methods in the field of RL)

We see Delta_t, which is reward + gamma (usually 0.99) * V(s_t+1) - V(s_t). This equation has several features in of itself: r_t is simply our reward current rewardgamma * V(s_t+1) is simply saying, "hmmm what is the value of our next state"? It is trying to look ahead in the future, and see whether our future actions are favorable, even if the current reward (r_t) is bad. We can control how much the future actions controls our policy by using the gammaV(s_t) is simply saying "what is the value of our current state"? We minus the future state and the current state to find the Advantage of the overall state. By the way, V(t) is another neural network, which is typically trained on the sum of rewards. (to those who are more interested/inclined to know, the more accurate term is the sum of discounted rewards)

Calculating the Advantage: Key Equations and Methods

Oftentimes, we see that Delta_t is enough to calculate the advantage. Equation 11 is taking the sum of discounted advantages, which is kinda step ahead of what we are looking for now.

Don't believe me? Let's prove it:

In fact, advantage took a slightly different equation here: A(t) = r + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... - V(s). It still does the same thing tho: find the advantage of our current state. This was taken right from: https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py, which is a really good codebase for reviewing how PPO works! Credits to the author and highly recommended to view out the code :) You will see stuff that I didn't touch upon, like how is the r(t) actually calculated (this delves deep into multivariable statistics, you could see MultivariateNormal class over there and log probs and stuff like that :).

We have this big huge function that calculates the advantage of the current state; this is calculated using the value of the next state minus the value of the current state.

Machine Learning Value Function

This is using another neural network -- called the value function -- that calculates the value of the current state and the future state:

  1. We have min(r(t) * a(t), clip(r(t), 1 - e, 1 + e) * a(t)) that says:

    • Hey computer don't get ahead of yourself! Only change your thought process/parameters after you have seen a few examples

  2. We train on a bunch of examples!

Note how we have two copies of the "policy" (another fancy way of saying the neural network that deciding our actions; "the computer"). One is the old and one is the new. The old policy can be the "parent" (stuck on a strategy), and the new policy can be the "child" (exploring new things). What is holding the "child" away from the parent is our big fancy equation with the min and clips and stuff like that

Applications and Relevance in Ethical AI

Understanding PPO is essential for advanced AI research labs, especially those focused on Nonprofit AI and Ethical AI. PPO is widely used in various fields, including fine-tuning large language models (LLMs) like GPT, utilizing Reinforcement Learning with Human Feedback (RLHF). This combination of reinforcement learning and human feedback exemplifies the cutting-edge advancements in AI research.

Proximal Policy Optimization continues to play a significant role in AI research labs dedicated to ethical AI practices, making substantial contributions to both theoretical and practical advancements in the field.

For further details and an in-depth understanding, refer to the original PPO paper: Proximal Policy Optimization.