A2C

Advantage Actor-Critic (A2C) is a simple actor-critic method. Differnt from REINFORCE, it has a simple idea: instead of computing return \(G_t\), estimate state value \(V(s)\) and bootstrapping! A2C uses advantage function \(A(s,a) = Q(s,a) - V(s)\) to update policy parameters. \(Q(s,a)\) is action value, \(V(s)\) is state value.

A2C has below features:

policy-based
Temporal Difference (TD) method
on-policy
high bias, low variance

Since A2C is TD method, you don't need to wait to update parameters until an episode is terminated. You can do online and batch learning.

You can see source code in a2c.

Configuration

Since it has simple hyperparameters, you don't need to understand deeply reinforcement learning.

Note if the setting has default value, you can skip it.

Setting	Description
`n_steps`	(`int`) The number of time steps to collect experiences until training. The number of total experiences (`entire_batch_size`) which is used to training is `num_envs` x `n_steps`. Since PPO is on-policy method, the experiences are discarded after training.
`gamma`	(`float`, default = `0.99`) Discount factor \(\gamma\) of future rewards.
`lam`	(`float`, default = `0.95`) Regularization parameter \(\lambda\) which controls the bias-variance trade-off of Generalized Advantage Estimation (GAE).
`value_loss_coef`	(`float`, default = `0.5`) State value loss (critic loss) multiplier.
`entropy_coef`	(`float`, default = `0.001`) Entropy multiplier used to compute loss. It adjusts exploration/exploitation balance.
`device`	(`str \| None`, default = `None`) Device on which the agent works. If this setting is `None`, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: `None`, `cpu`, `cuda`, `cuda:0` and other devices of `torch.device()` argument

Network

class: A2CSharedNetwork

Note that since it uses the Actor-Critic architecure and the parameter sharing, the encoding layer must be shared between Actor and Critic.

You need to implement below methods.

Forward

@abstractmethod
def forward(
    self, 
    obs: Observation
) -> tuple[PolicyDist, Tensor]
  

Parameters:

Name	Description	Shape
obs (`Observation`)	Observation batch tuple.	`*batch_shape` = `(batch_size,)` details in `Observation` docs

Returns:

Name	Description	Shape
policy_dist (`PolicyDist`)	Policy distribution \(\pi(a \vert s)\).	`*batch_shape` = `(batch_size,)` details in `PolicyDist` docs
state_value (`Tensor`)	State value \(V(s)\)	`(batch_size, 1)`