Double DQN
Double Deep Q Network (DQN) is a value-based off-policy TD method. It estimates action value \(Q(s,a)\) and sample actions from the values using policy (e.g., \(\epsilon\)-greedy policy). Double DQN is improved version of DQN. It uses Double Q-learning idea in a tabular setting so that it reduces the observed overestimations.
Paper: Deep Reinforcement Learning with Double Q-learning
Double DQN has below features:
- value-based
- Temporal Difference (TD) method
- off-policy
- high bias, low variance
Since DQN is TD method, you don't need to wait to update parameters until an episode is terminated. Also, it uses replay buffer which has fixed buffer size. Replay buffer stores experiences and samples the part of them from the buffer. It's because DQN is off-policy method.
Configuration
Double DQN is simple but you need to consider carefully some hyperparameters. It may significantly affect the training performance.
Note if the setting has default value, you can skip it.
| Setting | Description |
|---|---|
n_steps | (int) The number of time steps to collect experiences until training. The number of total experiences (entire_batch_size) which is used to training is num_envs x n_steps. Since DQN is off-policy method, the experiences can be reused even if they have been used for training. |
batch_size | (int) The size of experience batch from the replay buffer |
capacity | (int) The number of experineces to be stored in replay buffer. If it exceeds the capacity, the oldest experience is removed (FIFO). |
epoch | (int) The number of parameters updates at each n_steps |
gamma | (float, default = 0.99) Discount factor \(\gamma\) of future rewards. |
replace_freq | (int | None, default = None) The frequency of entirely replacing the target network with the update network. It can stabilize training since the target \(Q\) value is fixed. |
polyak_ratio | (float | None, default = None) The target network is weighted replaced with the update network. The higher the value, the more replaced with the update network parameters. The value \(\tau\) must be \(0 < \tau \leq 1\). |
replay_buffer_device | (str | None, default = None) What device the replay buffer uses. Since replay buffer may use a lot of memory space, you need to consider which device to store the experiences on. Default is agent device. Options: None, cpu, cuda, cuda:0 and other devices of torch.device() argument |
device | (str | None, default = None) Device on which the agent works. If this setting is None, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: None, cpu, cuda, cuda:0 and other devices of torch.device() argument |
If both replace_freq and polyak_ratio are None, it uses replace_freq as 1. If both of them are set any value, it uses replace_freq.
Network
class: DoubleDQNNetwork:
Note that policy distribution according to the action value is allowed (e.g., \(\epsilon\)-greedy policy, Boltzmann policy).
You need to implement below methods.
Forward
@abstractmethod
def forward(
self,
obs: Observation
) -> tuple[PolicyDist, ActionValue]
Parameters:
| Name | Description | Shape |
|---|---|---|
obs (Observation) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
Returns:
| Name | Description | Shape |
|---|---|---|
policy_dist (PolicyDist) | Policy distribution \(\pi(a \vert s)\). | *batch_shape = (batch_size,) details in PolicyDist docs |
action_value (ActionValue) | Action value \(Q(s,a)\) batch tuple. | (batch_size, num_discrete_actions) x num_discrete_branches |