Penalty Function Algorithms¶

`P3O`(env_id, cfgs)	The Implementation of the P3O algorithm.
`IPO`(env_id, cfgs)	The Implementation of the IPO algorithm.

Penalized Proximal Policy Optimization¶

Documentation

class omnisafe.algorithms.on_policy.P3O(env_id, cfgs)[source]¶

The Implementation of the P3O algorithm.

References

Title: Penalized Proximal Policy Optimization for Safe Reinforcement Learning
Authors: Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, Dacheng Tao.
URL: P3O

Initialize an instance of algorithm.

_init_log()[source]¶

Log the P3O specific information.

Things to log	Description
Loss/Loss_pi_cost	The loss of the cost performance.

Return type:: None

_loss_pi_cost(obs, act, logp, adv_c)[source]¶

Compute the performance of cost on this moment.

We compute the loss of cost of policy cost from real cost.

(2)¶\[L = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi^{'} (a|s)}{\pi (a|s)} A^{C}_{\pi_{\theta}} (s, a) \right]\]

where \(A^{C}_{\pi_{\theta}} (s, a)\) is the cost advantage, \(\pi (a|s)\) is the old policy, and \(\pi^{'} (a|s)\) is the current policy.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
act (torch.Tensor) – The action sampled from buffer.
logp (torch.Tensor) – The log probability of action sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The loss of the cost performance.

Return type:

Tensor

_update_actor(obs, act, logp, adv_r, adv_c)[source]¶

Update policy network under a double for loop.

The pseudo code is shown below:

for _ in range(self.cfgs.actor_iters):
    for _ in range(self.cfgs.num_mini_batches):
        # Get mini-batch data
        # Compute loss
        # Update network

Warning

For some KL divergence based algorithms (e.g. TRPO, CPO, etc.), the KL divergence between the old policy and the new policy is calculated. And the KL divergence is used to determine whether the update is successful. If the KL divergence is too large, the update will be terminated.

Parameters:

obs (torch.Tensor) – observation stored in buffer.
act (torch.Tensor) – action stored in buffer.
logp (torch.Tensor) – log_p stored in buffer.
adv_r (torch.Tensor) – reward_advantage stored in buffer.
adv_c (torch.Tensor) – cost_advantage stored in buffer.

Return type:

None

Interior-point Policy Optimization¶

Documentation

class omnisafe.algorithms.on_policy.IPO(env_id, cfgs)[source]¶

The Implementation of the IPO algorithm.

References

Title: IPO: Interior-point Policy Optimization under Constraints
Authors: Yongshuai Liu, Jiaxin Ding, Xin Liu.
URL: IPO

Initialize an instance of algorithm.

_compute_adv_surrogate(adv_r, adv_c)[source]¶

Compute surrogate loss.

IPO uses the following surrogate loss:

(4)¶\[L = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi_{\theta}^{'} (a_t|s_t)}{\pi_{\theta} (a_t|s_t)} A (s_t, a_t) - \kappa \frac{J^{C}_{\pi_{\theta}} (s_t, a_t)}{C - J^{C}_{\pi_{\theta}} (s_t, a_t) + \epsilon} \right]\]

Where \(\kappa\) is the penalty coefficient, \(C\) is the cost limit, and \(\epsilon\) is a small number to avoid division by zero.

Parameters:

adv_r (torch.Tensor) – The reward_advantage sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The advantage function combined with reward and cost.

Return type:

Tensor

_init_log()[source]¶

Log the IPO specific information.

Things to log	Description
Misc/Penalty	The penalty coefficient.

Return type:: None