Recurrent PPO RND
Recurrent PPO RND is modified version of PPO RND to use recurrent network like LSTM or GRU. You can see why recurrent network is useful in Recurrent PPO.
Configuration
| Parameter | Description |
|---|---|
n_steps | (int) The number of time steps to collect experiences until training. The number of total experiences (entire_batch_size) is num_envs * n_steps. Since PPO is on-policy method, the experiences are discarded after training. |
epoch | (int) The number of times the entire experience batch is used to update parameters |
seq_len | (int) The sequence length of the experience sequence batches when training. T1he entire experience batch is split by seq_len unit to the experience sequence batches with padding_value. This is why the entire sequence batch size (entire_seq_batch_size) is greater than entire_batch_size. Typically 8 or greater value are recommended. |
seq_mini_batch_size | (int) The sequence mini-batches are selected randomly and independently from the entire experience sequence batch during one epoch. The number of parameters updates at each epoch is the integer value of entire_seq_batch_size / seq_mini_batch_size. |
padding_value | (float, default = 0.0) Pad sequences to the value for the same seq_len. |
ext_gamma | (float, default = 0.999) Discount factor \(\gamma_E\) of future extrinsic rewards. |
int_gamma | (float, default = 0.99) Discount factor \(\gamma_I\) of future intrinsic rewards. |
ext_adv_coef | (float, default = 1.0) Extrinsic advantage multiplier. |
int_adv_coef | (float, default = 1.0) Intrinsic advantage multiplier. |
lam | (float, default = 0.95) Regularization parameter \(\lambda\) which controls the bias-variance trade-off of Generalized Advantage Estimation (GAE). |
epsilon_clip | (float, default = 0.2) Clamps the probability ratio (\(\dfrac{\pi_{\text{new}}}{\pi_{\text{old}}}\)) into the range \([1 - \epsilon, 1 + \epsilon]\). |
value_loss_coef | (float, default = 0.5) State value loss (critic loss) multiplier. |
entropy_coef | (float, default = 0.001) Entropy multiplier used to compute loss. It adjusts exploration-exploitation trade-off. |
rnd_pred_exp_proportion | (float, default = 0.25) The proportion of experiences used to train RND predictor to keep the effective batch size. |
init_norm_steps | (int | None, default = 50) The initial time steps to initialize normalization parameters of both observation and hidden state. When the value is None, the algorithm never normalize them during training. |
obs_norm_clip_range | ([float, float], default = [-5.0, 5.0]) Clamps the normalized observation into the range [min, max]. |
hidden_state_norm_clip_range | ([float, float], default = [-5.0, 5.0]) Clamps the normalized hidden state into the range [min, max]. |
device | (str | None, default = None) Device on which the agent works. If this setting is None, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: None, cpu, cuda, cuda:0 and other devices of torch.device() argument |
Network
class: RecurrentPPORNDNetwork
Since it uses the recurrent network, you must consider the hidden state which can achieve the action-observation history.
Note that since PPO uses the Actor-Critic architecure and the parameter sharing, the encoding layer must be shared between Actor and Critic. Be careful not to share parameters between PPO and RND networks.
RND uses extrinsic and intrinsic reward streams. Each stream can be different episodic or non-episodic, and can have different discount factors. RND constitutes of the predictor and target networks. Both of them should have the similar architectures (not must same) but their initial parameters should not be the same. The target network is deterministic, which means it will be never updated.
You need to implement below methods.
Forward Actor-Critic
@abstractmethod
def forward_actor_critic(
self,
obs_seq: Observation,
hidden_state: Tensor
) -> tuple[PolicyDist, Tensor, Tensor, Tensor]
Parameters:
| Name | Description | Shape |
|---|---|---|
obs_seq (Observation) | Observation sequence batch tuple. | *batch_shape = (seq_batch_size, seq_len) details in Observation docs |
hidden_state (Tensor) | Hidden states at the beginning of each sequence. | (D x num_layers, seq_batch_size, H) |
Returns:
| Name | Description | Shape |
|---|---|---|
policy_dist_seq (PolicyDist) | Policy distribution \(\pi(a \vert s)\) sequences. | *batch_shape = (seq_batch_size, seq_len) details in PolicyDist docs |
ext_state_value_seq (Tensor) | Extrinsic state value \(V_E(s)\) sequences. | (seq_batch_size, seq_len, 1) |
int_state_value_seq (Tensor) | Intrinsic state value \(V_I(s)\) sequences. | (seq_batch-size, seq_len, 1) |
next_seq_hidden_state (Tensor) | Hidden states which will be used for the next sequence. | (D x num_layers, seq_batch_size, H) |
Refer to the following explanation:
seq_batch_size: the number of independent sequencesseq_len: the length of each sequencenum_layers: the number of recurrent layersD: 2 if bidirectional otherwise 1H: the value depends on the type of the recurrent network
When you use LSTM, H = H_cell + H_out. See details in https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html.
When you use GRU, H = H_out. See details in https://pytorch.org/docs/stable/generated/torch.nn.GRU.html.
Forward RND
@abstractmethod
def forward_rnd(
self,
obs: Observation,
hidden_state: Tensor
) -> tuple[Tensor, Tensor]
The value of out_features depends on you.
Parameters:
| Name | Description | Shape |
|---|---|---|
obs (Observation) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
hidden_state (Tensor) | Hidden state batch with flattened features. | (batch_size, D x num_layers x H) |
Returns:
| Name | Description | Shape |
|---|---|---|
predicted_feature (Tensor) | Predicted feature \(\hat{f}(s)\) whose gradient flows. | (batch_size, out_features) |
target_feature (Tensor) | Target feature \(f(s)\) whose gradient doesn't flow. | (batch_size, out_features) |