Off-Policy Algorithms¶
The OmniSafe Safety-Gymnasium Benchmark for off-policy algorithms evaluates the effectiveness of OmniSafe’s off-policy algorithms across multiple environments from the Safety-Gymnasium task suite. For each supported algorithm and environment, we offer the following:
Default hyperparameters used for the benchmark and scripts that enable result replication.
Performance comparison with other open-source implementations.
Graphs and raw data that can be utilized for research purposes.
Detailed logs obtained during training.
Supported algorithms are listed below:
[ICLR 2016] Deep Deterministic Policy Gradient (DDPG)
[ICML 2018] Twin Delayed DDPG (TD3)
[ICML 2018] Soft Actor-Critic (SAC)
[Preprint 2019][1] The Lagrangian version of DDPG (DDPGLag)
[Preprint 2019][1] The Lagrangian version of TD3 (TD3Lag)
[Preprint 2019][1] The Lagrangian version of SAC (SACLag)
[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (DDPGPID)
[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (TD3PID)
[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (SACPID)
Safety-Gymnasium¶
We highly recommend using Safety-Gymnasium to run the following experiments. To install, in a linux machine, type:
pip install safety_gymnasium
Run the Benchmark¶
You can set the main function of examples/benchmarks/experiment_grid.py as:
if __name__ == '__main__':
eg = ExperimentGrid(exp_name='Off-Policy-Benchmarks')
# set up the algorithms.
off_policy = ['DDPG', 'SAC', 'TD3', 'DDPGLag', 'TD3Lag', 'SACLag', 'DDPGPID', 'TD3PID', 'SACPID']
eg.add('algo', off_policy)
# you can use wandb to monitor the experiment.
eg.add('logger_cfgs:use_wandb', [False])
# you can use tensorboard to monitor the experiment.
eg.add('logger_cfgs:use_tensorboard', [True])
# the default configs here are as follows:
# eg.add('algo_cfgs:steps_per_epoch', [2000])
# eg.add('train_cfgs:total_steps', [2000 * 500])
# which can reproduce results of 1e6 steps.
# if you want to reproduce results of 3e6 steps, using
# eg.add('algo_cfgs:steps_per_epoch', [2000])
# eg.add('train_cfgs:total_steps', [2000 * 1500])
# set the device.
avaliable_gpus = list(range(torch.cuda.device_count()))
gpu_id = [0, 1, 2, 3]
# if you want to use CPU, please set gpu_id = None
# gpu_id = None
if gpu_id and not set(gpu_id).issubset(avaliable_gpus):
warnings.warn('The GPU ID is not available, use CPU instead.', stacklevel=1)
gpu_id = None
# set up the environments.
eg.add('env_id', [
'SafetyHopper',
'SafetyWalker2d',
'SafetySwimmer',
'SafetyAnt',
'SafetyHalfCheetah',
'SafetyHumanoid'
])
eg.add('seed', [0, 5, 10, 15, 20])
eg.run(train, num_pool=5, gpu_id=gpu_id)
After that, you can run the following command to run the benchmark:
cd examples/benchmarks
python run_experiment_grid.py
You can also plot the results by running the following command:
cd examples
python analyze_experiment_results.py
For a detailed usage of OmniSafe statistics tool, please refer to this tutorial.
Logs are saved in examples/benchmarks/exp-x and can be monitored with tensorboard or wandb.
tensorboard --logdir examples/benchmarks/exp-x
After the experiment is finished, you can use the following command to generate the video of the trained agent:
cd examples
python evaluate_saved_policy.py
Please note that before you evaluate, please set the LOG_DIR in evaluate_saved_policy.py.
For example, if I train DDPG in SafetyHumanoid
LOG_DIR = '~/omnisafe/examples/runs/DDPG-<SafetyHumanoid>/seed-000'
play = True
save_replay = True
if __name__ == '__main__':
evaluator = omnisafe.Evaluator(play=play, save_replay=save_replay)
for item in os.scandir(os.path.join(LOG_DIR, 'torch_save')):
if item.is_file() and item.name.split('.')[-1] == 'pt':
evaluator.load_saved(
save_dir=LOG_DIR, model_name=item.name, camera_name='track', width=256, height=256
)
evaluator.render(num_episodes=1)
evaluator.evaluate(num_episodes=1)
OmniSafe Benchmark¶
Classic Reinforcement Learning Algorithms¶
In an effort to ascertain the credibility of OmniSafe’s algorithmic implementation, a comparative assessment was conducted, juxtaposing the performance of classical reinforcement learning algorithms, such as DDPG, TD3 and SAC. The performance table is provided in Table 1, with well-established open-source implementations, specifically Tianshou and Stable-Baselines3.
| DDPG | TD3 | SAC | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Environment | OmniSafe (Ours) | Tianshou | Stable-Baselines3 | OmniSafe (Ours) | Tianshou | Stable-Baselines3 | OmniSafe (Ours) | Tianshou | Stable-Baselines3 |
| SafetyAntVelocity-v1 | 860.86 ± 198.03 | 308.60 ± 318.60 | 2654.58 ± 1738.21 | 5246.86 ± 580.50 | 5379.55 ± 224.69 | 3079.45 ± 1456.81 | 5456.31 ± 156.04 | 6012.30 ± 102.64 | 2404.50 ± 1152.65 |
| SafetyHalfCheetahVelocity-v1 | 11377.10 ± 75.29 | 12493.55 ± 437.54 | 7796.63 ± 3541.64 | 11246.12 ± 488.62 | 10246.77 ± 908.39 | 8631.27 ± 2869.15 | 11488.86 ± 513.09 | 12083.89 ± 564.51 | 7767.74 ± 3159.07 |
| SafetyHopperVelocity-v1 | 1462.56 ± 591.14 | 2018.97 ± 1045.20 | 2214.06 ± 1219.57 | 3404.41 ± 82.57 | 2682.53 ± 1004.84 | 2542.67 ± 1253.33 | 3597.70 ± 32.23 | 3546.59 ± 76 .00 | 2158.54 ± 1343.24 |
| SafetyHumanoidVelocity-v1 | 1537.39 ± 335.62 | 124.96 ± 61.68 | 2276.92 ± 2299.68 | 5798.01 ± 160.72 | 3838.06 ± 1832.90 | 3511.06 ± 2214.12 | 6039.77 ± 167.82 | 5424.55 ± 118.52 | 2713.60 ± 2256.89 |
| SafetySwimmerVelocity-v1 | 139.39 ± 11.74 | 138.98 ± 8.60 | 210.40 ± 148.01 | 98.39 ± 32.28 | 94.43 ±9.63 | 247.09 ± 131.69 | 46.44 ±1.23 | 44.34 ±2.01 | 247.33 ± 122.02 |
| SafetyWalker2dVelocity-v1 | 1911.70 ± 395.97 | 543.23 ± 316.10 | 3917.46 ± 1077.38 | 3034.83 ± 1374.72 | 4267.05 ± 678.65 | 4087.94 ± 755.10 | 4419.29 ± 232.06 | 4619.34 ± 274.43 | 3906.78 ± 795.48 |
Table 1: The performance of OmniSafe, which was evaluated in relation to published baselines within the Safety-Gymnasium environments. Experimental outcomes, comprising mean and standard deviation, were derived from 10 assessment iterations encompassing multiple random seeds. A noteworthy distinction lies in the fact that Stable-Baselines3 employs distinct parameters tailored to each environment, while OmniSafe maintains a consistent parameter set across all environments.
Safe Reinforcement Learning Algorithms¶
To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of cost_limit=25.00. The results are presented in Table 2, Figure 1, Figure 2, Figure 3.
Performance Table¶
| DDPG | TD3 | SAC | ||||
|---|---|---|---|---|---|---|
| Environment | Reward | Cost | Reward | Cost | Reward | Cost |
| SafetyAntVelocity-v1 | 860.86 ± 198.03 | 234.80 ± 40.63 | 5246.86 ± 580.50 | 912.90 ± 93.73 | 5456.31 ± 156.04 | 943.10 ± 47.51 |
| SafetyHalfCheetahVelocity-v1 | 11377.10 ± 75.29 | 980.93 ± 1.05 | 11246.12 ± 488.62 | 981.27 ± 0.31 | 11488.86 ± 513.09 | 981.93 ± 0.33 |
| SafetyHopperVelocity-v1 | 1462.56 ± 591.14 | 429.17 ± 220.05 | 3404.41 ± 82.57 | 973.80 ± 4.92 | 3537.70 ± 32.23 | 975.23 ± 2.39 |
| SafetyHumanoidVelocity-v1 | 1537.39 ± 335.62 | 48.79 ±13.06 | 5798.01 ± 160.72 | 255.43 ± 437.13 | 6039.77 ± 167.82 | 41.42 ±49.78 |
| SafetySwimmerVelocity-v1 | 139.39 ± 11.74 | 200.53 ± 43.28 | 98.39 ±32.28 | 115.27 ± 44.90 | 46.44 ±1.23 | 40.97 ±0.47 |
| SafetyWalker2dVelocity-v1 | 1911.70 ± 395.97 | 318.10 ± 71.03 | 3034.83 ± 1374.72 | 606.47 ± 337.33 | 4419.29 ± 232.06 | 877.70 ± 8.95 |
| SafetyCarCircle1-v0 | 44.64 ±2.15 | 371.93 ± 38.75 | 44.57 ±2.71 | 383.37 ± 62.03 | 43.46 ±4.39 | 406.87 ± 78.78 |
| SafetyCarGoal1-v0 | 36.99 ±1.66 | 57.13 ±38.40 | 36.26 ±2.35 | 69.70 ±52.18 | 35.71 ±2.24 | 54.73 ±46.74 |
| SafetyPointCircle1-v0 | 113.67 ± 1.33 | 421.53 ± 142.66 | 115.15 ± 2.24 | 391.07 ± 38.34 | 115.06 ± 2.04 | 403.43 ± 44.78 |
| SafetyPointGoal1-v0 | 25.55 ±2.62 | 41.60 ±37.17 | 27.28 ±1.21 | 51.43 ±33.05 | 27.04 ±1.49 | 67.57 ±32.13 |
| DDPGLag | TD3Lag | SACLag | ||||
| Environment | Reward | Cost | Reward | Cost | Reward | Cost |
| SafetyAntVelocity-v1 | 1271.48 ± 581.71 | 33.27 ±13.34 | 1944.38 ± 759.20 | 63.27 ±46.89 | 1897.32 ± 1213.74 | 5.73 ±7.83 |
| SafetyHalfCheetahVelocity-v1 | 2743.06 ± 21.77 | 0.33 ±0.12 | 2741.08 ± 49.13 | 10.47 ±14.45 | 2833.72 ± 3.62 | 0.00 ±0.00 |
| SafetyHopperVelocity-v1 | 1093.25 ± 81.55 | 15.00 ±21.21 | 928.79 ± 389.48 | 40.67 ±30.99 | 963.49 ± 291.64 | 20.23 ±28.47 |
| SafetyHumanoidVelocity-v1 | 2059.96 ± 485.68 | 19.71 ±4.05 | 5751.99 ± 157.28 | 10.71 ±23.60 | 5940.04 ± 121.93 | 17.59 ±6.24 |
| SafetySwimmerVelocity-v1 | 13.18 ±20.31 | 28.27 ±32.27 | 15.58 ±16.97 | 13.27 ±17.64 | 11.03 ±11.17 | 22.70 ±32.10 |
| SafetyWalker2dVelocity-v1 | 2238.92 ± 400.67 | 33.43 ±20.08 | 2996.21 ± 74.40 | 22.50 ±16.97 | 2676.47 ± 300.43 | 30.67 ±32.30 |
| SafetyCarCircle1-v0 | 33.29 ±6.55 | 20.67 ±28.48 | 34.38 ±1.55 | 2.25 ±3.90 | 31.42 ±11.67 | 22.33 ±26.16 |
| SafetyCarGoal1-v0 | 22.80 ±8.75 | 17.33 ±21.40 | 7.31 ±5.34 | 33.83 ±31.03 | 10.83 ±11.29 | 22.67 ±28.91 |
| SafetyPointCircle1-v0 | 70.71 ±13.61 | 22.00 ±32.80 | 83.07 ±3.49 | 7.83 ±15.79 | 83.68 ±3.32 | 12.83 ±19.53 |
| SafetyPointGoal1-v0 | 17.17 ±10.03 | 20.33 ±31.59 | 25.27 ±2.74 | 28.00 ±15.75 | 21.45 ±6.97 | 19.17 ±9.72 |
| DDPGPID | TD3PID | SACPID | ||||
| Environment | Reward | Cost | Reward | Cost | Reward | Cost |
| SafetyAntVelocity-v1 | 2078.27 ± 704.77 | 18.20 ±7.21 | 2410.46 ± 217.00 | 44.50 ±38.39 | 1940.55 ± 482.41 | 13.73 ±7.24 |
| SafetyHalfCheetahVelocity-v1 | 2737.61 ± 45.93 | 36.10 ±11.03 | 2695.64 ± 29.42 | 35.93 ±14.03 | 2689.01 ± 15.46 | 21.43 ±5.49 |
| SafetyHopperVelocity-v1 | 1034.42 ± 350.59 | 29.53 ±34.54 | 1225.97 ± 224.71 | 46.87 ±65.28 | 812.80 ± 381.86 | 92.23 ±77.64 |
| SafetyHumanoidVelocity-v1 | 1082.36 ± 486.48 | 15.00 ±19.51 | 6179.38 ± 105.70 | 5.60 ±6.23 | 6107.36 ± 113.24 | 6.20 ±10.14 |
| SafetySwimmerVelocity-v1 | 23.99 ±7.76 | 30.70 ±21.81 | 28.62 ±8.48 | 22.47 ±7.69 | 7.50 ±10.42 | 7.77 ±8.48 |
| SafetyWalker2dVelocity-v1 | 1378.75 ± 896.73 | 14.77 ±13.02 | 2769.64 ± 67.23 | 6.53 ±8.86 | 1251.87 ± 721.54 | 41.23 ±73.33 |
| SafetyCarCircle1-v0 | 26.89 ±11.18 | 31.83 ±33.59 | 34.77 ±3.24 | 47.00 ±39.53 | 34.41 ±7.19 | 5.00 ±11.18 |
| SafetyCarGoal1-v0 | 19.35 ±14.63 | 17.50 ±21.31 | 27.28 ±4.50 | 9.50 ±12.15 | 16.21 ±12.65 | 6.67 ±14.91 |
| SafetyPointCircle1-v0 | 71.63 ±8.39 | 0.00 ±0.00 | 70.95 ±6.00 | 0.00 ±0.00 | 75.15 ±6.65 | 4.50 ±4.65 |
| SafetyPointGoal1-v0 | 19.85 ±5.32 | 22.67 ±13.73 | 18.76 ±7.87 | 12.17 ±9.39 | 15.87 ±6.73 | 27.50 ±15.25 |
Table 2: The performance of OmniSafe off-policy algorithms, which underwent evaluation under the experimental setting of cost_limit=25.00. During experimentation, it was observed that off-policy algorithms did not violate safety constraints in SafetyHumanoidVeloicty-v1. This observation suggests that the agent may not have fully learned to run within 1e6 steps; consequently, the 3e6 results were utilized in off-policy SafetyHumanoidVeloicty-v1. Meanwhile, in environments with strong stochasticity such as SafetyCarCircle1-v0, SafetyCarGoal1-v0, SafetyPointCircle1-v0, and SafetyPointGoal1-v0, off-policy methods require more training steps to estimate a more accurate Q-function. Therefore, we also conducted evaluations on these four environments using a training duration of 3e6 steps. For other environments, we use the evaluation results after 1e6 training steps.
Performance Curves¶
DDPG, TD3, and SAC
SafetyAntVelocity-v1
|
SafetyHalfCheetahVelocity-v1
|
SafetyHopperVelocity-v1
|
SafetyHumanoidVelocity-v1
|
SafetySwimmerVelocity-v1
|
SafetyWalker2dVelocity-v1
|
SafetyCarCircle1-v0
|
SafetyCarGoal1-v0
|
SafetyPointCircle1-v0
|
SafetyPointGoal1-v0
|
Figure 1: Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.
DDPGLag, TD3Lag, and SACLag
SafetyAntVelocity-v1
|
SafetyHalfCheetahVelocity-v1
|
SafetyHopperVelocity-v1
|
SafetyHumanoidVelocity-v1
|
SafetySwimmerVelocity-v1
|
SafetyWalker2dVelocity-v1
|
SafetyCarCircle1-v0
|
SafetyCarGoal1-v0
|
SafetyPointCircle1-v0
|
SafetyPointGoal1-v0
|
Figure 2: Training curves in Safety-Gymnasium environments, covering lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.
DDPGPID, TD3PID, and SACPID
SafetyAnt
|
SafetyHalfCheetahVelocity-v1
|
SafetyHopperVelocity-v1
|
SafetyHumanoidVelocity-v1
|
SafetySwimmerVelocity-v1
|
SafetyWalker2dVelocity-v1
|
SafetyCarCircle1-v0
|
SafetyCarGoal1-v0
|
SafetyPointCircle1-v0
|
SafetyPointGoal1-v0
|
Figure 3: Training curves in Safety-Gymnasium environments, covering pid-lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.