REINFORCE

REINFORCE is a simple and basic policy gradient method based on Monte Carlo (MC) method. Differnt from value-based method, it has a simple idea: policy is a parameterized function itself!

Paper: Policy Gradient Methods for Reinforcement Learning with Function Approximation

REINFORCE has below features:

policy-based
Monte Carlo (MC) method (episodic)
on-policy
no bias, high variance

Since REINFORCE is MC method, it computes return \(G_t\) which is used to update policy parameters. It must wait to update parameters until an episode is terminated to compute return \(G_t\).

REINFORCE high variance can be reduced using baseline \(b(s)\). It's called REINFORCE with Baseline. REINFORCE agent uses \(G_t - b(s)\) instead of just \(G_t\), and baseline \(b(s)\) is the mean of returns.

You can see source code in reinforce.

Configuration

Since it has simple hyperparameters, you don't need to understand deeply reinforcement learning.

Note if the setting has default value, you can skip it.

Setting	Description
`gamma`	(`float`, default = `0.99`) Discount factor \(\gamma\) of future rewards.
`entropy_coef`	(`float`, default = `0.001`) Entropy multiplier used to compute loss. It adjusts exploration/exploitation balance.
`device`	(`str \| None`, default = `None`) Device on which the agent works. If this setting is `None`, the agent device is same as your network's one. Otherwise, the network device changes to this device. Options: `None`, `cpu`, `cuda`, `cuda:0` and other devices of `torch.device()` argument

Network

class: REINFORCENetwork

You need to implement below methods.

Forward

@abstractmethod
def forward(
    self, 
    obs: Observation
) -> PolicyDist
  

Parameters:

Name	Description	Shape
obs (`Observation`)	Observation batch tuple.	`*batch_shape` = `(batch_size,)` details in `Observation` docs

Returns:

Name	Description	Shape
policy_dist (`PolicyDist`)	Policy distribution \(\pi(a \vert s)\).	`*batch_shape` = `(batch_size,)` details in `PolicyDist` docs