None
, cpu
, cuda
, cuda:0
and other devices of torch.device()
argument device
str | None
, default = None
) Device on which the agent works. If this setting is None
, the agent device is same as your network's one. Otherwise, the network device changes to this device. None
, cpu
, cuda
, cuda:0
and other devices of torch.device()
argumentIf both replace_freq
and polyak_ratio
are None
, it uses replace_freq
as 1
. If both of them are set any value, it uses replace_freq
.
class: DoubleDQNNetwork
:
Note that policy distribution according to the action value is allowed (e.g., \(\epsilon\)-greedy policy, Boltzmann policy).
You need to implement below methods.
@abstractmethod
def forward(
self,
obs: Observation
) -> tuple[PolicyDist, ActionValue]
Parameters:
Name | Description | Shape |
---|---|---|
obs (Observation ) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
Returns:
Name | Description | Shape |
---|---|---|
policy_dist (PolicyDist ) | Policy distribution \(\pi(a \vert s)\). | *batch_shape = (batch_size,) details in PolicyDist docs |
action_value (ActionValue ) | Action value \(Q(s,a)\) batch tuple. | (batch_size, num_discrete_actions) x num_discrete_branches |