None
, cpu
, cuda
, cuda:0
and other devices of torch.device()
argument class: RecurrentPPORNDNetwork
Since it uses the recurrent network, you must consider the hidden state which can achieve the action-observation history.
Note that since PPO uses the Actor-Critic architecure and the parameter sharing, the encoding layer must be shared between Actor and Critic. Be careful not to share parameters between PPO and RND networks.
RND uses extrinsic and intrinsic reward streams. Each stream can be different episodic or non-episodic, and can have different discount factors. RND constitutes of the predictor and target networks. Both of them should have the similar architectures (not must same) but their initial parameters should not be the same. The target network is deterministic, which means it will be never updated.
You need to implement below methods.
@abstractmethod
def forward_actor_critic(
self,
obs_seq: Observation,
hidden_state: Tensor
) -> tuple[PolicyDist, Tensor, Tensor, Tensor]
Parameters:
Name | Description | Shape |
---|---|---|
obs_seq (Observation ) | Observation sequence batch tuple. | *batch_shape = (seq_batch_size, seq_len) details in Observation docs |
hidden_state (Tensor ) | Hidden states at the beginning of each sequence. | (D x num_layers, seq_batch_size, H) |
Returns:
Name | Description | Shape |
---|---|---|
policy_dist_seq (PolicyDist ) | Policy distribution \(\pi(a \vert s)\) sequences. | *batch_shape = (seq_batch_size, seq_len) details in PolicyDist docs |
ext_state_value_seq (Tensor ) | Extrinsic state value \(V_E(s)\) sequences. | (seq_batch_size, seq_len, 1) |
int_state_value_seq (Tensor ) | Intrinsic state value \(V_I(s)\) sequences. | (seq_batch-size, seq_len, 1) |
next_seq_hidden_state (Tensor ) | Hidden states which will be used for the next sequence. | (D x num_layers, seq_batch_size, H) |
Refer to the following explanation:
seq_batch_size
: the number of independent sequencesseq_len
: the length of each sequencenum_layers
: the number of recurrent layersD
: 2 if bidirectional otherwise 1H
: the value depends on the type of the recurrent networkWhen you use LSTM, H
= H_cell
+ H_out
. See details in https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html.
When you use GRU, H
= H_out
. See details in https://pytorch.org/docs/stable/generated/torch.nn.GRU.html.
@abstractmethod
def forward_rnd(
self,
obs: Observation,
hidden_state: Tensor
) -> tuple[Tensor, Tensor]
The value of out_features
depends on you.
Parameters:
Name | Description | Shape |
---|---|---|
obs (Observation ) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
hidden_state (Tensor ) | Hidden state batch with flattened features. | (batch_size, D x num_layers x H) |
Returns:
Name | Description | Shape |
---|---|---|
predicted_feature (Tensor ) | Predicted feature \(\hat{f}(s)\) whose gradient flows. | (batch_size, out_features) |
target_feature (Tensor ) | Target feature \(f(s)\) whose gradient doesn't flow. | (batch_size, out_features) |