None
, cpu
, cuda
, cuda:0
and other devices of torch.device()
argument class: PPORNDNetwork
Note that since it uses the Actor-Critic architecture and the parameter sharing, the encoding layer must be shared between Actor and Critic.
RND uses extrinsic and intrinsic reward streams. Each stream can be different episodic or non-episodic, and can have different discount factors. RND constitutes of the predictor and target networks. Both of them should have the similar architectures (not must same) but their initial parameters should not be the same. The target network is deterministic, which means it will be never updated.
You need to implement below methods.
@abstractmethod
def forward_actor_critic(
self,
obs: Observation
) -> tuple[PolicyDist, Tensor, Tensor]
Parameters:
Name | Description | Shape |
---|---|---|
obs (Observation ) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
Returns:
Name | Description | Shape |
---|---|---|
policy_dist (PolicyDist ) | Policy distribution \(\pi(a \vert s)\). | *batch_shape = (batch_size,) details in PolicyDist docs |
ext_state_value (Tensor ) | Extrinsic state value \(V_E(s)\). | (batch_size, 1) |
int_state_value (Tensor ) | Intrinsic state value \(V_I(s)\). | (batch_size, 1) |
@abstractmethod
def forward_rnd(
self,
obs: Observation,
) -> tuple[Tensor, Tensor]
The value of out_features
depends on you.
Parameters:
Name | Description | Shape |
---|---|---|
obs (Observation ) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
Returns:
Name | Description | Shape |
---|---|---|
predicted_feature (Tensor ) | Predicted feature \(\hat{f}(s)\) whose gradient flows. | (batch_size, out_features) |
target_feature (Tensor ) | Target feature \(f(s)\) whose gradient doesn't flow. | (batch_size, out_features) |