REINFORCE
REINFORCE is a simple and basic policy gradient method based on Monte Carlo (MC) method. Differnt from value-based method, it has a simple idea: policy is a parameterized function itself!
Paper: Policy Gradient Methods for Reinforcement Learning with Function Approximation
REINFORCE has below features:
- policy-based
- Monte Carlo (MC) method (episodic)
- on-policy
- no bias, high variance
Since REINFORCE is MC method, it computes return \(G_t\) which is used to update policy parameters. It must wait to update parameters until an episode is terminated to compute return \(G_t\).
REINFORCE high variance can be reduced using baseline \(b(s)\). It's called REINFORCE with Baseline. REINFORCE agent uses \(G_t - b(s)\) instead of just \(G_t\), and baseline \(b(s)\) is the mean of returns.
You can see source code in reinforce.
Configuration
Since it has simple hyperparameters, you don't need to understand deeply reinforcement learning.
Note if the setting has default value, you can skip it.
| Setting | Description |
|---|---|
gamma | (float, default = 0.99) Discount factor \(\gamma\) of future rewards. |
entropy_coef | (float, default = 0.001) Entropy multiplier used to compute loss. It adjusts exploration/exploitation balance. |
device | (str | None, default = None) Device on which the agent works. If this setting is None, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: None, cpu, cuda, cuda:0 and other devices of torch.device() argument |
Network
class: REINFORCENetwork
You need to implement below methods.
Forward
@abstractmethod
def forward(
self,
obs: Observation
) -> PolicyDist
Parameters:
| Name | Description | Shape |
|---|---|---|
obs (Observation) | Observation batch tuple. | *batch_shape = (batch_size,) details in Observation docs |
Returns:
| Name | Description | Shape |
|---|---|---|
policy_dist (PolicyDist) | Policy distribution \(\pi(a \vert s)\). | *batch_shape = (batch_size,) details in PolicyDist docs |