PPO
Proximal Policy Optimization (PPO) is one of the most powerful actor-critic methods. It can stably update policy parameters in trust region using surrogate objective function.
PPO suggests two objective functions. We use clipped surrogate objective function of them, which is known to have better performance.
Paper: Proximal Policy Optimization Algorithms
PPO has below features:
- Actor-Critic method
- on-policy
- stable
- general
Configuration
PPO has slightly complex hyperparameters, but default values are good enough.
Note if the setting has default value, you can skip it.
| Parameter | Description |
|---|---|
n_steps | (int) The number of time steps to collect experiences until training. The number of total experiences (entire_batch_size) is num_envs * n_steps. Since PPO is on-policy method, the experiences are discarded after training. |
epoch | (int) The number of times the entire experience batch is used to update parameters |
mini_batch_size | (int) The mini-batches are selected randomly and independently from the entire experience batch during one epoch. The number of parameters updates at each epoch is the integer value of entire_batch_size / mini_batch_size. |
gamma | (float, default = 0.99) Discount factor \(\gamma\) of future rewards. |
lam | (float, default = 0.95) Regularization parameter \(\lambda\) which controls the bias-variance trade-off of Generalized Advantage Estimation (GAE). |
advantage_normalization | (bool, default = False) Whether or not normalize advantage estimates across single mini batch. It may reduce variance and lead to stability, but does not seem to effect performance much. |
epsilon_clip | (float, default = 0.2) Clamps the probability ratio (\(\dfrac{\pi_{\text{new}}}{\pi_{\text{old}}}\)) into the range \([1 - \epsilon, 1 + \epsilon]\). |
value_loss_coef | (float, default = 0.5) State value loss (critic loss) multiplier. |
entropy_coef | (float, default = 0.001) Entropy multiplier used to compute loss. It adjusts exploration-exploitation trade-off. |
device | (str | None, default = None) Device on which the agent works. If this setting is None, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: None, cpu, cuda, cuda:0 and other devices of torch.device() argument |
Network
class: PPOSharedNetwork
Note that since it uses the Actor-Critic architecure and the parameter sharing, the encoding layer must be shared between Actor and Critic.
You need to implement below methods.
Forward
@abstractmethod
def forward(
self,
obs: Observation
) -> tuple[PolicyDist, Tensor]
Parameters:
| Name | Description | Shape |
|---|---|---|
obs (Observation) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
Returns:
| Name | Description | Shape |
|---|---|---|
policy_dist (PolicyDist) | Policy distribution \(\pi(a \vert s)\). | *batch_shape = (batch_size,) details in PolicyDist docs |
state_value (Tensor) | State value \(V(s)\). | (batch_size, 1) |