Double DQN

Double Deep Q Network (DQN) is a value-based off-policy TD method. It estimates action value \(Q(s,a)\) and sample actions from the values using policy (e.g., \(\epsilon\)-greedy policy). Double DQN is improved version of DQN. It uses Double Q-learning idea in a tabular setting so that it reduces the observed overestimations.

Paper: Deep Reinforcement Learning with Double Q-learning

Double DQN has below features:

value-based
Temporal Difference (TD) method
off-policy
high bias, low variance

Since DQN is TD method, you don't need to wait to update parameters until an episode is terminated. Also, it uses replay buffer which has fixed buffer size. Replay buffer stores experiences and samples the part of them from the buffer. It's because DQN is off-policy method.

Configuration

Double DQN is simple but you need to consider carefully some hyperparameters. It may significantly affect the training performance.

Note if the setting has default value, you can skip it.

Setting	Description
`n_steps`	(`int`) The number of time steps to collect experiences until training. The number of total experiences (`entire_batch_size`) which is used to training is `num_envs` x `n_steps`. Since DQN is off-policy method, the experiences can be reused even if they have been used for training.
`batch_size`	(`int`) The size of experience batch from the replay buffer
`capacity`	(`int`) The number of experineces to be stored in replay buffer. If it exceeds the capacity, the oldest experience is removed (FIFO).
`epoch`	(`int`) The number of parameters updates at each `n_steps`
`gamma`	(`float`, default = `0.99`) Discount factor \(\gamma\) of future rewards.
`replace_freq`	(`int \| None`, default = `None`) The frequency of entirely replacing the target network with the update network. It can stabilize training since the target \(Q\) value is fixed.
`polyak_ratio`	(`float \| None`, default = `None`) The target network is weighted replaced with the update network. The higher the value, the more replaced with the update network parameters. The value \(\tau\) must be \(0 < \tau \leq 1\).
`replay_buffer_device`	(`str \| None`, default = `None`) What device the replay buffer uses. Since replay buffer may use a lot of memory space, you need to consider which device to store the experiences on. Default is agent device. Options: `None`, `cpu`, `cuda`, `cuda:0` and other devices of `torch.device()` argument
`device`	(`str \| None`, default = `None`) Device on which the agent works. If this setting is `None`, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: `None`, `cpu`, `cuda`, `cuda:0` and other devices of `torch.device()` argument

If both replace_freq and polyak_ratio are None, it uses replace_freq as 1. If both of them are set any value, it uses replace_freq.

Network

class: DoubleDQNNetwork:

Note that policy distribution according to the action value is allowed (e.g., \(\epsilon\)-greedy policy, Boltzmann policy).

You need to implement below methods.

Forward

@abstractmethod
def forward(
    self, 
    obs: Observation
) -> tuple[PolicyDist, ActionValue]
  

Parameters:

Name	Description	Shape
obs (`Observation`)	Observation batch tuple.	`*batch_shape` = `(batch_size,)` details in `Observation` docs

Returns:

Name	Description	Shape
policy_dist (`PolicyDist`)	Policy distribution \(\pi(a \vert s)\).	`*batch_shape` = `(batch_size,)` details in `PolicyDist` docs
action_value (`ActionValue`)	Action value \(Q(s,a)\) batch tuple.	`(batch_size, num_discrete_actions)` x `num_discrete_branches`