Policy Gradients
Last updated
Last updated
The problem with Q-learning is that Q-function can be very complicated. For a problem with high-dimensional state, it is hard to learn exact or accurate Q-value for every pair of state-action. However, policy can be much simpler. The question is can we learn a policy directly?
We will define a class of parametrized policies
For each policy, we will define its value as
We want to find the optimal policy that will give us the best expected reward.
Mathematically we can write
We want to do gradient ascent to maximize the expected reward from the policy. So we need to differentiate the integral!
Here's a trick to do
Then we inject it back into the original integral
We can estimate with Monte Carlo sampling.
How can we compute the integral without knowing the transition probabilities? We know that probability of a state transition trajectory is the following.
Now if we take the log of the above expression, we get the following.
Once we differentiate this expression with respect to $\theta$ then we can see that it does not depend on transition probabilities!!!
Therefore, when we sample a trajectory $\tau$, we can estimate $J(\theta)$ with the following.
Now we have defined out gradient estimator.
Here is the interpretation:
It may seem simplistic to say that if a trajectory is good, then all its actions are good. Howevr, it averages out in expectation.
Suppose you want to train the agent such that it always takes the best action in every time step for a given state. The estimator does not specifically do that for you. It only looks at a whole trajectory and makes some estimation about what is good and what is bad. It does not train itself to make the best decision at every time step. Thus, although we may have a good reward trajectory, individual actions within this trajectory are not guaranteed to be the best choice. However, it works out if we have enough samples. The estimator requires a lot of samples to become unbiased in its gradient estimation. The challenge is, how can we reduce variance when samples are small.
Push up probabilities of an action seen, only by the cumulative future reward from that state.
Use discount factor gamma to ignore delayed effects
And is the reward of a state transition trajectory
However, this is intractable. Gradient of an expectation value is problematic because probability depends on . But, personally I don't see why this is the case. NOTE: Figure this shit out.
If reward from the trajectory is high, i.e. is high, then gradient ascent will increase the probabilities of the actions seen.
If reward from the trajectory is low, i.e. is low, then gradient ascent will decrease the probabilities of the actions seen.