Off-Policy Algorithms¶

The OmniSafe Safety-Gymnasium Benchmark for off-policy algorithms evaluates the effectiveness of OmniSafe’s off-policy algorithms across multiple environments from the Safety-Gymnasium task suite. For each supported algorithm and environment, we offer the following:

Default hyperparameters used for the benchmark and scripts that enable result replication.
Performance comparison with other open-source implementations.
Graphs and raw data that can be utilized for research purposes.
Detailed logs obtained during training.

Supported algorithms are listed below:

[ICLR 2016] Deep Deterministic Policy Gradient (DDPG)
[ICML 2018] Twin Delayed DDPG (TD3)
[ICML 2018] Soft Actor-Critic (SAC)
[Preprint 2019]^[1] The Lagrangian version of DDPG (DDPGLag)
[Preprint 2019]^[1] The Lagrangian version of TD3 (TD3Lag)
[Preprint 2019]^[1] The Lagrangian version of SAC (SACLag)
[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (DDPGPID)
[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (TD3PID)
[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (SACPID)

Safety-Gymnasium¶

We highly recommend using Safety-Gymnasium to run the following experiments. To install, in a linux machine, type:

pip install safety_gymnasium

Run the Benchmark¶

You can set the main function of examples/benchmarks/experiment_grid.py as:

if __name__ == '__main__':
    eg = ExperimentGrid(exp_name='Off-Policy-Benchmarks')

    # set up the algorithms.
    off_policy = ['DDPG', 'SAC', 'TD3', 'DDPGLag', 'TD3Lag', 'SACLag', 'DDPGPID', 'TD3PID', 'SACPID']
    eg.add('algo', off_policy)

    # you can use wandb to monitor the experiment.
    eg.add('logger_cfgs:use_wandb', [False])
    # you can use tensorboard to monitor the experiment.
    eg.add('logger_cfgs:use_tensorboard', [True])

    # the default configs here are as follows:
    # eg.add('algo_cfgs:steps_per_epoch', [2000])
    # eg.add('train_cfgs:total_steps', [2000 * 500])
    # which can reproduce results of 1e6 steps.

    # if you want to reproduce results of 3e6 steps, using
    # eg.add('algo_cfgs:steps_per_epoch', [2000])
    # eg.add('train_cfgs:total_steps', [2000 * 1500])

    # set the device.
    avaliable_gpus = list(range(torch.cuda.device_count()))
    gpu_id = [0, 1, 2, 3]
    # if you want to use CPU, please set gpu_id = None
    # gpu_id = None

    if gpu_id and not set(gpu_id).issubset(avaliable_gpus):
        warnings.warn('The GPU ID is not available, use CPU instead.', stacklevel=1)
        gpu_id = None

    # set up the environments.
    eg.add('env_id', [
        'SafetyHopper',
        'SafetyWalker2d',
        'SafetySwimmer',
        'SafetyAnt',
        'SafetyHalfCheetah',
        'SafetyHumanoid'
        ])
    eg.add('seed', [0, 5, 10, 15, 20])
    eg.run(train, num_pool=5, gpu_id=gpu_id)

After that, you can run the following command to run the benchmark:

cd examples/benchmarks
python run_experiment_grid.py

You can also plot the results by running the following command:

cd examples
python analyze_experiment_results.py

For a detailed usage of OmniSafe statistics tool, please refer to this tutorial.

Logs are saved in examples/benchmarks/exp-x and can be monitored with tensorboard or wandb.

tensorboard --logdir examples/benchmarks/exp-x

After the experiment is finished, you can use the following command to generate the video of the trained agent:

cd examples
python evaluate_saved_policy.py

Please note that before you evaluate, please set the LOG_DIR in evaluate_saved_policy.py.

For example, if I train DDPG in SafetyHumanoid

LOG_DIR = '~/omnisafe/examples/runs/DDPG-<SafetyHumanoid>/seed-000'
play = True
save_replay = True
if __name__ == '__main__':
    evaluator = omnisafe.Evaluator(play=play, save_replay=save_replay)
    for item in os.scandir(os.path.join(LOG_DIR, 'torch_save')):
        if item.is_file() and item.name.split('.')[-1] == 'pt':
            evaluator.load_saved(
                save_dir=LOG_DIR, model_name=item.name, camera_name='track', width=256, height=256
            )
            evaluator.render(num_episodes=1)
            evaluator.evaluate(num_episodes=1)

OmniSafe Benchmark¶

Classic Reinforcement Learning Algorithms¶

In an effort to ascertain the credibility of OmniSafe’s algorithmic implementation, a comparative assessment was conducted, juxtaposing the performance of classical reinforcement learning algorithms, such as DDPG, TD3 and SAC. The performance table is provided in Table 1, with well-established open-source implementations, specifically Tianshou and Stable-Baselines3.

	DDPG			TD3			SAC
Environment	OmniSafe (Ours)	Tianshou	Stable-Baselines3	OmniSafe (Ours)	Tianshou	Stable-Baselines3	OmniSafe (Ours)	Tianshou	Stable-Baselines3
SafetyAntVelocity-v1	860.86 ± 198.03	308.60 ± 318.60	2654.58 ± 1738.21	5246.86 ± 580.50	5379.55 ± 224.69	3079.45 ± 1456.81	5456.31 ± 156.04	6012.30 ± 102.64	2404.50 ± 1152.65
SafetyHalfCheetahVelocity-v1	11377.10 ± 75.29	12493.55 ± 437.54	7796.63 ± 3541.64	11246.12 ± 488.62	10246.77 ± 908.39	8631.27 ± 2869.15	11488.86 ± 513.09	12083.89 ± 564.51	7767.74 ± 3159.07
SafetyHopperVelocity-v1	1462.56 ± 591.14	2018.97 ± 1045.20	2214.06 ± 1219.57	3404.41 ± 82.57	2682.53 ± 1004.84	2542.67 ± 1253.33	3597.70 ± 32.23	3546.59 ± 76 .00	2158.54 ± 1343.24
SafetyHumanoidVelocity-v1	1537.39 ± 335.62	124.96 ± 61.68	2276.92 ± 2299.68	5798.01 ± 160.72	3838.06 ± 1832.90	3511.06 ± 2214.12	6039.77 ± 167.82	5424.55 ± 118.52	2713.60 ± 2256.89
SafetySwimmerVelocity-v1	139.39 ± 11.74	138.98 ± 8.60	210.40 ± 148.01	98.39 ± 32.28	94.43 ±9.63	247.09 ± 131.69	46.44 ±1.23	44.34 ±2.01	247.33 ± 122.02
SafetyWalker2dVelocity-v1	1911.70 ± 395.97	543.23 ± 316.10	3917.46 ± 1077.38	3034.83 ± 1374.72	4267.05 ± 678.65	4087.94 ± 755.10	4419.29 ± 232.06	4619.34 ± 274.43	3906.78 ± 795.48

Table 1: The performance of OmniSafe, which was evaluated in relation to published baselines within the Safety-Gymnasium environments. Experimental outcomes, comprising mean and standard deviation, were derived from 10 assessment iterations encompassing multiple random seeds. A noteworthy distinction lies in the fact that Stable-Baselines3 employs distinct parameters tailored to each environment, while OmniSafe maintains a consistent parameter set across all environments.

Safe Reinforcement Learning Algorithms¶

To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of cost_limit=25.00. The results are presented in Table 2, Figure 1, Figure 2, Figure 3.

Performance Table¶

	DDPG		TD3		SAC
Environment	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	860.86 ± 198.03	234.80 ± 40.63	5246.86 ± 580.50	912.90 ± 93.73	5456.31 ± 156.04	943.10 ± 47.51
SafetyHalfCheetahVelocity-v1	11377.10 ± 75.29	980.93 ± 1.05	11246.12 ± 488.62	981.27 ± 0.31	11488.86 ± 513.09	981.93 ± 0.33
SafetyHopperVelocity-v1	1462.56 ± 591.14	429.17 ± 220.05	3404.41 ± 82.57	973.80 ± 4.92	3537.70 ± 32.23	975.23 ± 2.39
SafetyHumanoidVelocity-v1	1537.39 ± 335.62	48.79 ±13.06	5798.01 ± 160.72	255.43 ± 437.13	6039.77 ± 167.82	41.42 ±49.78
SafetySwimmerVelocity-v1	139.39 ± 11.74	200.53 ± 43.28	98.39 ±32.28	115.27 ± 44.90	46.44 ±1.23	40.97 ±0.47
SafetyWalker2dVelocity-v1	1911.70 ± 395.97	318.10 ± 71.03	3034.83 ± 1374.72	606.47 ± 337.33	4419.29 ± 232.06	877.70 ± 8.95
SafetyCarCircle1-v0	44.64 ±2.15	371.93 ± 38.75	44.57 ±2.71	383.37 ± 62.03	43.46 ±4.39	406.87 ± 78.78
SafetyCarGoal1-v0	36.99 ±1.66	57.13 ±38.40	36.26 ±2.35	69.70 ±52.18	35.71 ±2.24	54.73 ±46.74
SafetyPointCircle1-v0	113.67 ± 1.33	421.53 ± 142.66	115.15 ± 2.24	391.07 ± 38.34	115.06 ± 2.04	403.43 ± 44.78
SafetyPointGoal1-v0	25.55 ±2.62	41.60 ±37.17	27.28 ±1.21	51.43 ±33.05	27.04 ±1.49	67.57 ±32.13
	DDPGLag		TD3Lag		SACLag
Environment	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	1271.48 ± 581.71	33.27 ±13.34	1944.38 ± 759.20	63.27 ±46.89	1897.32 ± 1213.74	5.73 ±7.83
SafetyHalfCheetahVelocity-v1	2743.06 ± 21.77	0.33 ±0.12	2741.08 ± 49.13	10.47 ±14.45	2833.72 ± 3.62	0.00 ±0.00
SafetyHopperVelocity-v1	1093.25 ± 81.55	15.00 ±21.21	928.79 ± 389.48	40.67 ±30.99	963.49 ± 291.64	20.23 ±28.47
SafetyHumanoidVelocity-v1	2059.96 ± 485.68	19.71 ±4.05	5751.99 ± 157.28	10.71 ±23.60	5940.04 ± 121.93	17.59 ±6.24
SafetySwimmerVelocity-v1	13.18 ±20.31	28.27 ±32.27	15.58 ±16.97	13.27 ±17.64	11.03 ±11.17	22.70 ±32.10
SafetyWalker2dVelocity-v1	2238.92 ± 400.67	33.43 ±20.08	2996.21 ± 74.40	22.50 ±16.97	2676.47 ± 300.43	30.67 ±32.30
SafetyCarCircle1-v0	33.29 ±6.55	20.67 ±28.48	34.38 ±1.55	2.25 ±3.90	31.42 ±11.67	22.33 ±26.16
SafetyCarGoal1-v0	22.80 ±8.75	17.33 ±21.40	7.31 ±5.34	33.83 ±31.03	10.83 ±11.29	22.67 ±28.91
SafetyPointCircle1-v0	70.71 ±13.61	22.00 ±32.80	83.07 ±3.49	7.83 ±15.79	83.68 ±3.32	12.83 ±19.53
SafetyPointGoal1-v0	17.17 ±10.03	20.33 ±31.59	25.27 ±2.74	28.00 ±15.75	21.45 ±6.97	19.17 ±9.72
	DDPGPID		TD3PID		SACPID
Environment	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	2078.27 ± 704.77	18.20 ±7.21	2410.46 ± 217.00	44.50 ±38.39	1940.55 ± 482.41	13.73 ±7.24
SafetyHalfCheetahVelocity-v1	2737.61 ± 45.93	36.10 ±11.03	2695.64 ± 29.42	35.93 ±14.03	2689.01 ± 15.46	21.43 ±5.49
SafetyHopperVelocity-v1	1034.42 ± 350.59	29.53 ±34.54	1225.97 ± 224.71	46.87 ±65.28	812.80 ± 381.86	92.23 ±77.64
SafetyHumanoidVelocity-v1	1082.36 ± 486.48	15.00 ±19.51	6179.38 ± 105.70	5.60 ±6.23	6107.36 ± 113.24	6.20 ±10.14
SafetySwimmerVelocity-v1	23.99 ±7.76	30.70 ±21.81	28.62 ±8.48	22.47 ±7.69	7.50 ±10.42	7.77 ±8.48
SafetyWalker2dVelocity-v1	1378.75 ± 896.73	14.77 ±13.02	2769.64 ± 67.23	6.53 ±8.86	1251.87 ± 721.54	41.23 ±73.33
SafetyCarCircle1-v0	26.89 ±11.18	31.83 ±33.59	34.77 ±3.24	47.00 ±39.53	34.41 ±7.19	5.00 ±11.18
SafetyCarGoal1-v0	19.35 ±14.63	17.50 ±21.31	27.28 ±4.50	9.50 ±12.15	16.21 ±12.65	6.67 ±14.91
SafetyPointCircle1-v0	71.63 ±8.39	0.00 ±0.00	70.95 ±6.00	0.00 ±0.00	75.15 ±6.65	4.50 ±4.65
SafetyPointGoal1-v0	19.85 ±5.32	22.67 ±13.73	18.76 ±7.87	12.17 ±9.39	15.87 ±6.73	27.50 ±15.25

Table 2: The performance of OmniSafe off-policy algorithms, which underwent evaluation under the experimental setting of cost_limit=25.00. During experimentation, it was observed that off-policy algorithms did not violate safety constraints in SafetyHumanoidVeloicty-v1. This observation suggests that the agent may not have fully learned to run within 1e6 steps; consequently, the 3e6 results were utilized in off-policy SafetyHumanoidVeloicty-v1. Meanwhile, in environments with strong stochasticity such as SafetyCarCircle1-v0, SafetyCarGoal1-v0, SafetyPointCircle1-v0, and SafetyPointGoal1-v0, off-policy methods require more training steps to estimate a more accurate Q-function. Therefore, we also conducted evaluations on these four environments using a training duration of 3e6 steps. For other environments, we use the evaluation results after 1e6 training steps.

Performance Curves¶

DDPG, TD3, and SAC

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetySwimmerVelocity-v1

SafetyWalker2dVelocity-v1

SafetyCarCircle1-v0

SafetyCarGoal1-v0

SafetyPointCircle1-v0

SafetyPointGoal1-v0

Figure 1: Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.

DDPGLag, TD3Lag, and SACLag

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetySwimmerVelocity-v1

SafetyWalker2dVelocity-v1

SafetyCarCircle1-v0

SafetyCarGoal1-v0

SafetyPointCircle1-v0

SafetyPointGoal1-v0

Figure 2: Training curves in Safety-Gymnasium environments, covering lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.

DDPGPID, TD3PID, and SACPID

SafetyAnt

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetySwimmerVelocity-v1

SafetyWalker2dVelocity-v1

SafetyCarCircle1-v0

SafetyCarGoal1-v0

SafetyPointCircle1-v0

SafetyPointGoal1-v0

Figure 3: Training curves in Safety-Gymnasium environments, covering pid-lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.