Policy Gradients
The problem with Q-learning is that Q-function can be very complicated. For a problem with high-dimensional state, it is hard to learn exact or accurate Q-value for every pair of state-action. However, policy can be much simpler. The question is can we learn a policy directly?
We will define a class of parametrized policies
For each policy, we will define its value as
We want to find the optimal policy that will give us the best expected reward.
Reinforce Algorithm
Formulation
Mathematically we can write
And is the reward of a state transition trajectory
We want to do gradient ascent to maximize the expected reward from the policy. So we need to differentiate the integral!
However, this is intractable. Gradient of an expectation value is problematic because probability depends on . But, personally I don't see why this is the case. NOTE: Figure this shit out.
Here's a trick to do
Then we inject it back into the original integral
We can estimate with Monte Carlo sampling.
Gradient Estimation
How can we compute the integral without knowing the transition probabilities? We know that probability of a state transition trajectory is the following.
Now if we take the log of the above expression, we get the following.
Once we differentiate this expression with respect to $\theta$ then we can see that it does not depend on transition probabilities!!!
Therefore, when we sample a trajectory $\tau$, we can estimate $J(\theta)$ with the following.
Intuition
Now we have defined out gradient estimator.
Here is the interpretation:
If reward from the trajectory is high, i.e. is high, then gradient ascent will increase the probabilities of the actions seen.
If reward from the trajectory is low, i.e. is low, then gradient ascent will decrease the probabilities of the actions seen.
It may seem simplistic to say that if a trajectory is good, then all its actions are good. Howevr, it averages out in expectation.
Variance Reduction
Suppose you want to train the agent such that it always takes the best action in every time step for a given state. The estimator does not specifically do that for you. It only looks at a whole trajectory and makes some estimation about what is good and what is bad. It does not train itself to make the best decision at every time step. Thus, although we may have a good reward trajectory, individual actions within this trajectory are not guaranteed to be the best choice. However, it works out if we have enough samples. The estimator requires a lot of samples to become unbiased in its gradient estimation. The challenge is, how can we reduce variance when samples are small.
Approach #1
Push up probabilities of an action seen, only by the cumulative future reward from that state.
Approach #2
Use discount factor gamma to ignore delayed effects
Last updated