None
, cpu
, cuda
, cuda:0
and other devices of torch.device()
argument class: RecurrentPPOSharedNetwork
Since it uses the recurrent network, you must consider the hidden state which can achieve the action-observation history.
Note that since it uses the Actor-Critic architecture and the parameter sharing, the encoding layer must be shared between Actor and Critic.
You need to implement below methods.
@abstractmethod
def forward(
self,
obs_seq: Observation,
hidden_state: Tensor
) -> tuple[PolicyDist, Tensor, Tensor]
Parameters:
Name | Description | Shape |
---|---|---|
obs_seq (Observation ) | Observation sequence batch tuple. | *batch_shape = (seq_batch_size, seq_len) details in Observation docs |
hidden_state (Tensor ) | Hidden states at the beginning of each sequence. | (D x num_layers, seq_batch_size, H) |
Returns:
Name | Description | Shape |
---|---|---|
policy_dist_seq (PolicyDist ) | Policy distribution \(\pi(a \vert s)\) sequences. | *batch_shape = (seq_batch_size, seq_len) details in PolicyDist docs |
state_value_seq (Tensor ) | State value \(V(s)\) sequences. | (seq_batch_size, seq_len, 1) |
next_seq_hidden_state (Tensor ) | Hidden states which will be used for the next sequence. | (D x num_layers, seq_batch_size, H) |
Refer to the following explanation:
seq_batch_size
: the number of independent sequencesseq_len
: the length of each sequencenum_layers
: the number of recurrent layersD
: 2 if bidirectional otherwise 1H
: the value depends on the type of the recurrent networkWhen you use LSTM, H
= H_cell
+ H_out
. See details in https://pyorg/docs/stable/generated/nn.LSTM.html.
When you use GRU, H
= H_out
. See details in https://pyorg/docs/stable/generated/nn.GRU.html.