Penalty Function Algorithms¶
|
The Implementation of the P3O algorithm. |
|
The Implementation of the IPO algorithm. |
Penalized Proximal Policy Optimization¶
Documentation
- class omnisafe.algorithms.on_policy.P3O(env_id, cfgs)[source]¶
The Implementation of the P3O algorithm.
References
Title: Penalized Proximal Policy Optimization for Safe Reinforcement Learning
Authors: Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, Dacheng Tao.
URL: P3O
Initialize an instance of algorithm.
- _init_log()[source]¶
Log the P3O specific information.
Things to log
Description
Loss/Loss_pi_cost
The loss of the cost performance.
- Return type:
None
- _loss_pi_cost(obs, act, logp, adv_c)[source]¶
Compute the performance of cost on this moment.
We compute the loss of cost of policy cost from real cost.
(2)¶\[L = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi^{'} (a|s)}{\pi (a|s)} A^{C}_{\pi_{\theta}} (s, a) \right]\]where \(A^{C}_{\pi_{\theta}} (s, a)\) is the cost advantage, \(\pi (a|s)\) is the old policy, and \(\pi^{'} (a|s)\) is the current policy.
- Parameters:
obs (torch.Tensor) – The
observationsampled from buffer.act (torch.Tensor) – The
actionsampled from buffer.logp (torch.Tensor) – The
log probabilityof action sampled from buffer.adv_c (torch.Tensor) – The
cost_advantagesampled from buffer.
- Returns:
The loss of the cost performance.
- Return type:
Tensor
- _update_actor(obs, act, logp, adv_r, adv_c)[source]¶
Update policy network under a double for loop.
The pseudo code is shown below:
for _ in range(self.cfgs.actor_iters): for _ in range(self.cfgs.num_mini_batches): # Get mini-batch data # Compute loss # Update network
Warning
For some
KL divergencebased algorithms (e.g. TRPO, CPO, etc.), theKL divergencebetween the old policy and the new policy is calculated. And theKL divergenceis used to determine whether the update is successful. If theKL divergenceis too large, the update will be terminated.- Parameters:
obs (torch.Tensor) –
observationstored in buffer.act (torch.Tensor) –
actionstored in buffer.logp (torch.Tensor) –
log_pstored in buffer.adv_r (torch.Tensor) –
reward_advantagestored in buffer.adv_c (torch.Tensor) –
cost_advantagestored in buffer.
- Return type:
None
Interior-point Policy Optimization¶
Documentation
- class omnisafe.algorithms.on_policy.IPO(env_id, cfgs)[source]¶
The Implementation of the IPO algorithm.
References
Title: IPO: Interior-point Policy Optimization under Constraints
Authors: Yongshuai Liu, Jiaxin Ding, Xin Liu.
URL: IPO
Initialize an instance of algorithm.
- _compute_adv_surrogate(adv_r, adv_c)[source]¶
Compute surrogate loss.
IPO uses the following surrogate loss:
(4)¶\[L = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi_{\theta}^{'} (a_t|s_t)}{\pi_{\theta} (a_t|s_t)} A (s_t, a_t) - \kappa \frac{J^{C}_{\pi_{\theta}} (s_t, a_t)}{C - J^{C}_{\pi_{\theta}} (s_t, a_t) + \epsilon} \right]\]Where \(\kappa\) is the penalty coefficient, \(C\) is the cost limit, and \(\epsilon\) is a small number to avoid division by zero.
- Parameters:
adv_r (torch.Tensor) – The
reward_advantagesampled from buffer.adv_c (torch.Tensor) – The
cost_advantagesampled from buffer.
- Returns:
The advantage function combined with reward and cost.
- Return type:
Tensor