PPO

Proximal Policy Optimization (PPO) is one of the most powerful actor-critic methods. It can stably update policy parameters in trust region using surrogate objective function.

PPO suggests two objective functions. We use clipped surrogate objective function of them, which is known to have better performance.

Paper: Proximal Policy Optimization Algorithms

PPO has below features:

Actor-Critic method
on-policy
stable
general

Configuration

PPO has slightly complex hyperparameters, but default values are good enough.

Note if the setting has default value, you can skip it.

Parameter	Description
`n_steps`	(`int`) The number of time steps to collect experiences until training. The number of total experiences (`entire_batch_size`) is `num_envs * n_steps`. Since PPO is on-policy method, the experiences are discarded after training.
`epoch`	(`int`) The number of times the entire experience batch is used to update parameters
`mini_batch_size`	(`int`) The mini-batches are selected randomly and independently from the entire experience batch during one epoch. The number of parameters updates at each epoch is the integer value of `entire_batch_size` / `mini_batch_size`.
`gamma`	(`float`, default = `0.99`) Discount factor \(\gamma\) of future rewards.
`lam`	(`float`, default = `0.95`) Regularization parameter \(\lambda\) which controls the bias-variance trade-off of Generalized Advantage Estimation (GAE).
`advantage_normalization`	(`bool`, default = `False`) Whether or not normalize advantage estimates across single mini batch. It may reduce variance and lead to stability, but does not seem to effect performance much.
`epsilon_clip`	(`float`, default = `0.2`) Clamps the probability ratio (\(\dfrac{\pi_{\text{new}}}{\pi_{\text{old}}}\)) into the range \([1 - \epsilon, 1 + \epsilon]\).
`value_loss_coef`	(`float`, default = `0.5`) State value loss (critic loss) multiplier.
`entropy_coef`	(`float`, default = `0.001`) Entropy multiplier used to compute loss. It adjusts exploration-exploitation trade-off.
`device`	(`str \| None`, default = `None`) Device on which the agent works. If this setting is `None`, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: `None`, `cpu`, `cuda`, `cuda:0` and other devices of `torch.device()` argument

Network

class: PPOSharedNetwork

Note that since it uses the Actor-Critic architecure and the parameter sharing, the encoding layer must be shared between Actor and Critic.

You need to implement below methods.

Forward

@abstractmethod
def forward(
    self, 
    obs: Observation
) -> tuple[PolicyDist, Tensor]
  

Parameters:

Name	Description	Shape
obs (`Observation`)	Observation batch tuple.	`*batch_shape` = `(batch_size,)` details in `Observation` docs

Returns:

Name	Description	Shape
policy_dist (`PolicyDist`)	Policy distribution \(\pi(a \vert s)\).	`*batch_shape` = `(batch_size,)` details in `PolicyDist` docs
state_value (`Tensor`)	State value \(V(s)\).	`(batch_size, 1)`