On-Policy Algorithms¶

The OmniSafe Safety-Gymnasium Benchmark for on-policy algorithms evaluates the effectiveness of OmniSafe’s on-policy algorithms across multiple environments from the Safety-Gymnasium task suite. For each supported algorithm and environment, we offer the following:

Default hyperparameters used for the benchmark and scripts that enable result replication.
Performance comparison with other open-source implementations.
Graphs and raw data that can be utilized for research purposes.
Detailed logs obtained during training.

Supported algorithms are listed below:

First-Order

[NIPS 1999] Policy Gradient (PG)
[Preprint 2017] Proximal Policy Optimization (PPO)
The Lagrange version of PPO (PPOLag)
[IJCAI 2022] Penalized Proximal Policy Optimization for Safe Reinforcement Learning (P3O)
[NeurIPS 2020] First Order Constrained Optimization in Policy Space (FOCOPS)
[NeurIPS 2022] Constrained Update Projection Approach to Safe Policy Optimization (CUP)

Second-Order

[NeurIPS 2001] A Natural Policy Gradient (NaturalPG))
[PMLR 2015] Trust Region Policy Optimization (TRPO)
The Lagrange version of TRPO (TRPOLag)
[ICML 2017] Constrained Policy Optimization (CPO)
[ICML 2017] Proximal Constrained Policy Optimization (PCPO)
[ICLR 2019] Reward Constrained Policy Optimization (RCPO)

Saute RL

[ICML 2022] Sauté RL: Almost Surely Safe Reinforcement Learning Using State Augmentation (PPOSaute, TRPOSaute)

Simmer

[NeurIPS 2022] Effects of Safety State Augmentation on Safe Exploration (PPOSimmerPID, TRPOSimmerPID)

PID-Lagrangian

[ICML 2020] Responsive Safety in Reinforcement Learning by PID Lagrangian Methods (CPPOPID, TRPOPID)

Early Terminated MDP

[Preprint 2021] Safe Exploration by Solving Early Terminated MDP (PPOEarlyTerminated, TRPOEarlyTerminated)

Safety-Gymnasium¶

We highly recommend using Safety-Gymnasium to run the following experiments. To install, in a linux machine, type:

pip install safety_gymnasium

Run the Benchmark¶

You can set the main function of examples/benchmarks/experiment_grid.py as:

if __name__ == '__main__':
    eg = ExperimentGrid(exp_name='On-Policy-Benchmarks')

    # set up the algorithms.
    base_policy = ['PolicyGradient', 'NaturalPG', 'TRPO', 'PPO']
    naive_lagrange_policy = ['PPOLag', 'TRPOLag', 'RCPO']
    first_order_policy = ['CUP', 'FOCOPS', 'P3O']
    second_order_policy = ['CPO', 'PCPO']
    saute_policy = ['PPOSaute', 'TRPOSaute']
    simmer_policy = ['PPOSimmerPID', 'TRPOSimmerPID']
    pid_policy = ['CPPOPID', 'TRPOPID']
    early_mdp_policy = ['PPOEarlyTerminated', 'TRPOEarlyTerminated']

    eg.add(
        'algo',
        base_policy +
        naive_lagrange_policy +
        first_order_policy +
        second_order_policy +
        saute_policy +
        simmer_policy +
        pid_policy +
        early_mdp_policy
    )

    # you can use wandb to monitor the experiment.
    eg.add('logger_cfgs:use_wandb', [False])
    # you can use tensorboard to monitor the experiment.
    eg.add('logger_cfgs:use_tensorboard', [True])

    # the default configs here are as follows:
    # eg.add('algo_cfgs:steps_per_epoch', [20000])
    # eg.add('train_cfgs:total_steps', [20000 * 500])
    # which can reproduce results of 1e7 steps.

    # if you want to reproduce results of 1e6 steps, using
    # eg.add('algo_cfgs:steps_per_epoch', [2048])
    # eg.add('train_cfgs:total_steps', [2048 * 500])

    # set the device.
    avaliable_gpus = list(range(torch.cuda.device_count()))
    # if you want to use GPU, please set gpu_id like follows:
    # gpu_id = [0, 1, 2, 3]
    # if you want to use CPU, please set gpu_id = None
    # we recommends using CPU to obtain results as consistent
    # as possible with our publicly available results,
    # since the performance of all on-policy algorithms
    # in OmniSafe is tested on CPU.
    gpu_id = None

    if gpu_id and not set(gpu_id).issubset(avaliable_gpus):
        warnings.warn('The GPU ID is not available, use CPU instead.', stacklevel=1)
        gpu_id = None

    # set up the environment.
    eg.add('env_id', [
        'SafetyHopper',
        'SafetyWalker2d',
        'SafetySwimmer',
        'SafetyAnt',
        'SafetyHalfCheetah',
        'SafetyHumanoid'
        ])
    eg.add('seed', [0, 5, 10, 15, 20])

    # total experiment num must can be divided by num_pool.
    # meanwhile, users should decide this value according to their machine.
    eg.run(train, num_pool=5, gpu_id=gpu_id)

After that, you can run the following command to run the benchmark:

cd examples/benchmarks
python run_experiment_grid.py

You can also plot the results by running the following command:

cd examples
python analyze_experiment_results.py

For a detailed usage of OmniSafe statistics tool, please refer to this tutorial.

Logs is saved in examples/benchmarks/exp-x and can be monitored with tensorboard or wandb.

tensorboard --logdir examples/benchmarks/exp-x

After the experiment is finished, you can use the following command to generate the video of the trained agent:

cd examples
python evaluate_saved_policy.py

Please note that before you evaluate, set the LOG_DIR in evaluate_saved_policy.py.

For example, if I train PPOLag in SafetyHumanoid

LOG_DIR = '~/omnisafe/examples/runs/PPOLag-<SafetyHumanoid>/seed-000'
play = True
save_replay = True
if __name__ == '__main__':
    evaluator = omnisafe.Evaluator(play=play, save_replay=save_replay)
    for item in os.scandir(os.path.join(LOG_DIR, 'torch_save')):
        if item.is_file() and item.name.split('.')[-1] == 'pt':
            evaluator.load_saved(
                save_dir=LOG_DIR, model_name=item.name, camera_name='track', width=256, height=256
            )
            evaluator.render(num_episodes=1)
            evaluator.evaluate(num_episodes=1)

OmniSafe Benchmark¶

Classic Reinforcement Learning Algorithms¶

To ascertain the credibility of OmniSafe ’s algorithmic implementation, a comparative assessment was conducted, juxtaposing the performance of classical reinforcement learning algorithms. Such as Policy Gradient, Natural Policy Gradient, TRPO and PPO. The performance table is provided in Table 1. with well-established open-source implementations, specifically Tianshou and Stable-Baselines3.

	Policy Gradient			PPO
Environment	OmniSafe (Ours)	Tianshou	Stable-Baselines3	OmniSafe (Ours)	Tianshou	Stable-Baselines3
SafetyAntVelocity-v1	2769.45 ± 550.71	145.33 ± 127.55	- ±-	4295.96 ± 658.2	2607.48 ± 1415.78	1780.61 ± 780.65
SafetyHalfCheetahVelocity-v1	2625.44 ± 1079.04	707.56 ± 158.59	- ±-	3507.47 ± 1563.69	6299.27 ± 1692.38	5074.85 ± 2225.47
SafetyHopperVelocity-v1	1884.38 ± 825.13	343.88 ± 51.85	- ±-	2679.98 ± 921.96	1834.7 ± 862.06	838.96 ± 351.10
SafetyHumanoidVelocity-v1	647.52 ± 154.82	438.97 ± 123.68	- ±-	1106.09 ± 607.6	677.43 ± 189.96	762.73 ± 170.22
SafetySwimmerVelocity-v1	47.31 ± 16.19	27.12 ±7.47	- ±-	113.28 ± 20.22	37.93 ±8.68	273.86 ± 87.76
SafetyWalker2dVelocity-v1	1665 .00 ± 930.18	373.63 ± 129.2	- ±-	3806.39 ± 1547.48	3748.26 ± 1832.83	3304.35 ± 706.13
	NaturalPG			TRPO
Environment	OmniSafe (Ours)	Tianshou	Stable-Baselines3	OmniSafe (Ours)	Tianshou	Stable-Baselines3
SafetyAntVelocity-v1	3793.70 ± 583.66	2062.45 ± 876.43	- ±-	4362.43 ± 640.54	2521.36 ± 1442.10	3233.58 ± 1437.16
SafetyHalfCheetahVelocity-v1	4096.77 ± 1223.70	3430.9 ± 239.38	- ±-	3313.31 ± 1048.78	4255.73 ± 1053.82	7185.06 ± 3650.82
SafetyHopperVelocity-v1	2590.54 ± 631.05	993.63 ± 489.42	- ±-	2698.19 ± 568.80	1346.94 ± 984.09	2467.10 ± 1160.25
SafetyHumanoidVelocity-v1	3838.67 ± 1654.79	810.76 ± 270.69	- ±-	1461.51 ± 602.23	749.42 ± 149.81	2828.18 ± 2256.38
SafetySwimmerVelocity-v1	116.33 ± 5.97	29.75 ±12.00	- ±-	105.08 ± 31.00	37.21 ±4.04	258.62 ± 124.91
SafetyWalker2dVelocity-v1	4054.62 ± 1266.76	3372.59 ± 1049.14	- ±-	4099.97 ± 409.05	3372.59 ± 961.74	4227.91 ± 760.93

Table 1:The performance of OmniSafe, which was evaluated in relation to published baselines within the Safety-Gymnasium MuJoCo Velocity environments. Experimental outcomes, comprising mean and standard deviation, were derived from 10 assessment iterations encompassing multiple random seeds.

Safe Reinforcement Learning Algorithms¶

To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of cost_limit=25.00. The results are presented in Table 2 and the training curves are in the following sections (Please click the triangle button to see the training curves).

Performance Table¶

	Policy Gradient		Natural PG		TRPO		PPO
Environment	Reward	Cost	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	5292.29 ± 913.44	919.42 ± 158.61	5547.20 ± 807.89	895.56 ± 77.13	6026.79 ± 314.98	933.46 ± 41.28	5977.73 ± 885.65	958.13 ± 134.5
SafetyHalfCheetahVelocity-v1	5188.46 ± 1202.76	896.55 ± 184.7	5878.28 ± 2012.24	847.74 ± 249.02	6490.76 ± 2507.18	734.26 ± 321.88	6921.83 ± 1721.79	919.2 ±173.08
SafetyHopperVelocity-v1	3218.17 ± 672.88	881.76 ± 198.46	2613.95 ± 866.13	587.78 ± 220.97	2047.35 ± 447.33	448.12 ± 103.87	2337.11 ± 942.06	550.02 ± 237.70
SafetyHumanoidVelocity-v1	7001.78 ± 419.67	834.11 ± 212.43	8055.20 ± 641.67	946.40 ± 9.11	8681.24 ± 3934.08	718.42 ± 323.30	9115.93 ± 596.88	960.44 ± 7.06
SafetySwimmerVelocity-v1	77.05 ±33.44	107.1 ±60.58	120.19 ± 7.74	161.78 ± 17.51	124.91 ± 6.13	176.56 ± 15.95	119.77 ± 13.8	165.27 ± 20.15
SafetyWalker2dVelocity-v1	4832.34 ± 685.76	866.59 ± 93.47	5347.35 ± 436.86	914.74 ± 32.61	6096.67 ± 723.06	914.46 ± 27.85	6239.52 ± 879.99	902.68 ± 100.93
SafetyCarGoal1-v0	35.86 ±1.97	57.46 ±48.34	36.07 ±1.25	58.06 ±10.03	36.60 ±0.22	55.58 ±12.68	33.41 ±2.89	58.06 ±42.06
SafetyCarButton1-v0	19.76 ±10.15	353.26 ± 177.08	22.16 ±4.48	333.98 ± 67.49	21.98 ±2.06	343.22 ± 24.60	17.51 ±9.46	373.98 ± 156.64
SafetyCarGoal2-v0	29.43 ±4.62	179.2 ±84.86	30.26 ±0.38	209.62 ± 29.97	32.17 ±1.24	190.74 ± 21.05	29.88 ±4.55	194.16 ± 106.2
SafetyCarButton2-v0	18.06 ±10.53	349.82 ± 187.07	20.85 ±3.14	313.88 ± 58.20	20.51 ±3.34	316.42 ± 35.28	21.35 ±8.22	312.64 ± 138.4
SafetyPointGoal1-v0	26.19 ±3.44	201.22 ± 80.4	26.92 ±0.58	57.92 ±9.97	27.20 ±0.44	45.88 ±11.27	25.44 ±5.43	55.72 ±35.55
SafetyPointButton1-v0	29.98 ±5.24	141.74 ± 75.13	31.95 ±1.53	123.98 ± 32.05	30.61 ±0.40	134.38 ± 22.06	27.03 ±6.14	152.48 ± 80.39
SafetyPointGoal2-v0	25.18 ±3.62	204.96 ± 104.97	26.19 ±0.84	193.60 ± 18.54	25.61 ±0.89	202.26 ± 15.15	25.49 ±2.46	159.28 ± 87.13
SafetyPointButton2-v0	26.88 ±4.38	153.88 ± 65.54	28.45 ±1.49	160.40 ± 20.08	28.78 ±2.05	170.30 ± 30.59	25.91 ±6.15	166.6 ±111.21
	RCPO		TRPOLag		PPOLag		P3O
Environment	Reward	Cost	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	3139.52 ± 110.34	12.34 ±3.11	3041.89 ± 180.77	19.52 ±20.21	3261.87 ± 80.00	12.05 ±6.57	2636.62 ± 181.09	20.69 ±10.23
SafetyHalfCheetahVelocity-v1	2440.97 ± 451.88	9.02 ±9.34	2884.68 ± 77.47	9.04 ±11.83	2946.15 ± 306.35	3.44 ±4.77	2117.84 ± 313.55	27.6 ±8.36
SafetyHopperVelocity-v1	1428.58 ± 199.87	11.12 ±12.66	1391.79 ± 269.07	11.22 ±9.97	961.92 ± 752.87	13.96 ±19.33	1231.52 ± 465.35	16.33 ±11.38
SafetyHumanoidVelocity-v1	6286.51 ± 151.03	19.47 ±7.74	6551.30 ± 58.42	59.56 ±117.37	6624.46 ± 25.9	5.87 ±9.46	6342.47 ± 82.45	126.4 ±193.76
SafetySwimmerVelocity-v1	61.29 ±18.12	22.60 ±1.16	81.18 ±16.33	22.24 ±3.91	64.74 ±17.67	28.02 ±4.09	38.02 ±34.18	18.4 ±12.13
SafetyWalker2dVelocity-v1	3064.43 ± 218.83	3.02 ±1.48	3207.10 ± 7.88	14.98 ±9.27	2982.27 ± 681.55	13.49 ±14.55	2713.57 ± 313.2	20.51 ±14.09
SafetyCarGoal1-v0	18.71 ±2.72	23.10 ±12.57	27.04 ±1.82	26.80 ±5.64	13.27 ±9.26	21.72 ±32.06	-1.10 ±6.851	50.58 ±99.24
SafetyCarButton1-v0	-2.04 ±2.98	43.48 ±31.52	-0.38 ±0.85	37.54 ±31.72	0.33 ±1.96	55.5 ±89.64	-2.06 ±7.2	43.78 ±98.01
SafetyCarGoal2-v0	2.30 ±1.76	22.90 ±16.22	3.65 ±1.09	39.98 ±20.29	1.58 ±2.49	13.82 ±24.62	-0.07 ±1.62	43.86 ±99.58
SafetyCarButton2-v0	-1.35 ±2.41	42.02 ±31.77	-1.68 ±2.55	20.36 ±13.67	0.76 ±2.52	47.86 ±103.27	0.11 ±0.72	85.94 ±122.01
SafetyPointGoal1-v0	15.27 ±4.05	30.56 ±19.15	18.51 ±3.83	22.98 ±8.45	12.96 ±6.95	25.80 ±34.99	1.6 ±3.01	31.1 ±80.03
SafetyPointButton1-v0	3.65 ±4.47	26.30 ±9.22	6.93 ±1.84	31.16 ±20.58	4.60 ±4.73	20.8 ±35.78	-0.34 ±1.53	52.86 ±85.62
SafetyPointGoal2-v0	2.17 ±1.46	33.82 ±21.93	4.64 ±1.43	26.00 ±4.70	1.98 ±3.86	41.20 ±61.03	0.34 ±2.2	65.84 ±195.76
SafetyPointButton2-v0	7.18 ±1.93	45.02 ±25.28	5.43 ±3.44	25.10 ±8.98	0.93 ±3.69	33.72 ±58.75	0.33 ±2.44	28.5 ±49.79
	CUP		PCPO		FOCOPS		CPO
Environment	Reward	Cost	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	3215.79 ± 346.68	18.25 ±17.12	2257.07 ± 47.97	10.44 ±5.22	3184.48 ± 305.59	14.75 ±6.36	3098.54 ± 78.90	14.12 ±3.41
SafetyHalfCheetahVelocity-v1	2850.6 ± 244.65	4.27 ±4.46	1677.93 ± 217.31	19.06 ±15.26	2965.2 ± 290.43	2.37 ±3.5	2786.48 ± 173.45	4.70 ±6.72
SafetyHopperVelocity-v1	1716.08 ± 5.93	7.48 ±5.535	1551.22 ± 85.16	15.46 ±9.83	1437.75 ± 446.87	10.13 ±8.87	1713.71 ± 18.26	13.40 ±5.82
SafetyHumanoidVelocity-v1	6109.94 ± 497.56	24.69 ±20.54	5852.25 ± 78.01	0.24 ±0.48	6489.39 ± 35.1	13.86 ±39.33	6465.34 ± 79.87	0.18 ±0.36
SafetySwimmerVelocity-v1	63.83 ±46.45	21.95 ±11.04	54.42 ±38.65	17.34 ±1.57	53.87 ±17.9	29.75 ±7.33	65.30 ±43.25	18.22 ±8.01
SafetyWalker2dVelocity-v1	2466.95 ± 1114.13	6.63 ±8.25	1802.86 ± 714.04	18.82 ±5.57	3117.05 ± 53.60	8.78 ±12.38	2074.76 ± 962.45	21.90 ±9.41
SafetyCarGoal1-v0	6.14 ±6.97	36.12 ±89.56	21.56 ±2.87	38.42 ±8.36	15.23 ±10.76	31.66 ±93.51	25.52 ±2.65	43.32 ±14.35
SafetyCarButton1-v0	1.49 ±2.84	103.24 ± 123.12	0.36 ±0.85	40.52 ±21.25	0.21 ±2.27	31.78 ±47.03	0.82 ±1.60	37.86 ±27.41
SafetyCarGoal2-v0	1.78 ±4.03	95.4 ±129.64	1.62 ±0.56	48.12 ±31.19	2.09 ±4.33	31.56 ±58.93	3.56 ±0.92	32.66 ±3.31
SafetyCarButton2-v0	1.49 ±2.64	173.68 ± 163.77	0.66 ±0.42	49.72 ±36.50	1.14 ±3.18	46.78 ±57.47	0.17 ±1.19	48.56 ±29.34
SafetyPointGoal1-v0	14.42 ±6.74	19.02 ±20.08	18.57 ±1.71	22.98 ±6.56	14.97 ±9.01	33.72 ±42.24	20.46 ±1.38	28.84 ±7.76
SafetyPointButton1-v0	3.5 ±7.07	39.56 ±54.26	2.66 ±1.83	49.40 ±36.76	5.89 ±7.66	38.24 ±42.96	4.04 ±4.54	40.00 ±4.52
SafetyPointGoal2-v0	1.06 ±2.67	107.3 ±204.26	1.06 ±0.69	51.92 ±47.40	2.21 ±4.15	37.92 ±111.81	2.50 ±1.25	40.84 ±23.31
SafetyPointButton2-v0	2.88 ±3.65	54.24 ±71.07	1.05 ±1.27	41.14 ±12.35	2.43 ±3.33	17.92 ±26.1	5.09 ±1.83	48.92 ±17.79
	PPOSaute		TRPOSaute		PPOSimmerPID		TRPOSimmerPID
Environment	Reward	Cost	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	2978.74 ± 93.65	16.77 ±0.92	2507.65 ± 63.97	8.036 ±0.39	2944.84 ± 60.53	16.20 ±0.66	3018.95 ± 66.44	16.52 ±0.23
SafetyHalfCheetahVelocity-v1	2901.40 ± 25.49	16.20 ± 0.60	2521.80 ± 477.29	7.61 ±0.39	2922.17 ± 24.84	16.14 ±0.14	2737.79 ± 37.53	16.44 ±0.21
SafetyHopperVelocity-v1	1650.91 ± 152.65	17.87 ±1.33	1368.28 ± 576.08	10.38 ±4.38	1699.94 ± 24.25	17.04 ±0.41	1608.41 ± 88.23	16.30 ±0.30
SafetyHumanoidVelocity-v1	6401.00 ± 32.23	17.10 ±2.41	5759.44 ± 75.73	15.84 ±1.42	6401.85 ± 57.62	11.06 ±5.35	6411.32 ± 44.26	13.04 ±2.68
SafetySwimmerVelocity-v1	35.61 ±4.37	3.44 ±1.35	34.72 ±1.37	10.19 ±2.32	77.52 ±40.20	0.98 ±1.91	51.39 ±40.09	0.00 ±0.00
SafetyWalker2dVelocity-v1	2410.89 ± 241.22	18.88 ±2.38	2548.82 ± 891.65	13.21 ±6.09	3187.56 ± 32.66	17.10 ±0.49	3156.99 ± 30.93	17.14 ±0.54
SafetyCarGoal1-v0	7.12 ±5.41	21.68 ±29.11	16.67 ±10.57	23.58 ±26.39	8.45 ±7.16	18.98 ±25.63	15.08 ±13.41	23.22 ±19.80
SafetyCarButton1-v0	-1.72 ±0.89	51.88 ±28.18	-2.03 ±0.40	6.24 ±6.14	-0.57 ±0.63	49.14 ±37.77	-1.24 ±0.47	17.26 ±16.13
SafetyCarGoal2-v0	0.90 ±1.20	19.98 ±10.12	1.76 ±5.20	31.50 ±45.50	1.02 ±1.41	27.32 ±60.12	0.93 ±2.21	26.66 ±60.07
SafetyCarButton2-v0	-1.89 ±1.86	47.33 ±28.90	-2.60 ±0.40	74.57 ±84.95	-1.31 ±0.93	52.33 ±19.96	-0.99 ±0.63	20.40 ±12.77
SafetyPointGoal1-v0	7.06 ±5.85	20.04 ±21.91	16.18 ±9.55	29.94 ±26.68	8.30 ±6.03	25.32 ±31.91	11.64 ±8.46	30.00 ±27.67
SafetyPointButton1-v0	-1.47 ±0.98	22.60 ±13.91	-3.13 ±3.51	9.04 ±3.94	-1.97 ±1.41	12.80 ±7.84	-1.36 ±0.37	2.14 ±1.73
SafetyPointGoal2-v0	0.84 ±2.93	14.06 ±30.21	1.64 ±4.02	19.00 ±34.69	0.56 ±2.52	12.36 ±43.39	1.55 ±4.68	14.90 ±27.82
SafetyPointButton2-v0	-1.38 ±0.11	12.00 ±8.60	-2.56 ±0.67	17.27 ±10.01	-1.70 ±0.29	7.90 ±3.30	-1.66 ±0.99	6.70 ±4.74
	CPPOPID		TRPOPID		PPOEarlyTerminated		TRPOEarlyTerminated
Environment	Reward	Cost	Reward	Cost	Reward	Cost	Reward	Cost
SafetyAntVelocity-v1	3213.36 ± 146.78	14.30 ±7.39	3052.94 ± 139.67	15.22 ±3.68	2801.53 ± 19.66	0.23 ±0.09	3052.63 ± 58.41	0.40 ±0.23
SafetyHalfCheetahVelocity-v1	2837.89 ± 398.52	8.06 ±9.62	2796.75 ± 190.84	11.16 ±9.80	2447.25 ± 346.84	3.47 ±4.90	2555.70 ± 368.17	0.06 ±0.08
SafetyHopperVelocity-v1	1713.29 ± 10.21	8.96 ±4.28	1178.59 ± 646.71	18.76 ±8.93	1643.39 ± 2.58	0.77 ±0.26	1646.47 ± 49.95	0.42 ±0.84
SafetyHumanoidVelocity-v1	6579.26 ± 55.70	3.76 ±3.61	6407.95 ± 254.06	7.38 ±11.34	6321.45 ± 35.73	0.00 ±0.00	6332.14 ± 89.86	0.00 ±0.00
SafetySwimmerVelocity-v1	91.05 ±62.68	19.12 ±8.33	69.75 ±46.52	20.48 ±9.13	33.02 ±7.26	24.23 ±0.54	39.24 ±5.01	23.20 ±0.48
SafetyWalker2dVelocity-v1	2183.43 ± 1300.69	14.12 ±10.28	2707.75 ± 980.56	9.60 ±8.94	2195.57 ± 1046.29	7.63 ±10.44	2079.64 ± 1028.73	13.74 ±15.94
SafetyCarGoal1-v0	10.60 ±2.51	30.66 ±7.53	25.49 ±1.31	28.92 ±7.66	17.92 ±1.54	21.60 ±0.83	22.09 ±3.07	17.97 ±1.35
SafetyCarButton1-v0	-1.36 ±0.68	14.62 ±9.40	-0.31 ±0.49	15.24 ±17.01	4.47 ±1.12	25.00 ±0.00	4.34 ±0.72	25.00 ±0.00
SafetyCarGoal2-v0	0.13 ±1.11	23.50 ±1.22	1.77 ±1.20	17.43 ±12.13	6.59 ±0.58	25.00 ±0.00	7.12 ±4.06	23.37 ±1.35
SafetyCarButton2-v0	-1.59 ±0.70	39.97 ±26.91	-2.95 ±4.03	27.90 ±6.37	4.86 ±1.57	25.00 ±0.00	5.07 ±1.24	25.00 ±0.00
SafetyPointGoal1-v0	8.43 ±3.43	25.74 ±7.83	19.24 ±3.94	21.38 ±6.96	16.03 ±8.60	19.17 ±9.42	16.31 ±6.99	22.10 ±6.13
SafetyPointButton1-v0	1.18 ±1.02	29.42 ±12.10	6.40 ±1.43	27.90 ±13.27	7.48 ±8.47	24.27 ±3.95	9.52 ±7.86	25.00 ±0.00
SafetyPointGoal2-v0	-0.56 ±0.06	48.43 ±40.55	1.67 ±1.43	23.50 ±11.17	6.09 ±5.03	25.00 ±0.00	8.62 ±7.13	25.00 ±0.00
SafetyPointButton2-v0	0.42 ±0.63	28.87 ±11.27	1.00 ±1.00	30.00 ±9.50	6.94 ±4.47	25.00 ±0.00	8.35 ±10.44	25.00 ±0.00

Table 2: The performance of OmniSafe on-policy algorithms, encompassing both reward and cost, was assessed within the Safety-Gymnasium environments. It is crucial to highlight that all on-policy algorithms underwent evaluation following 1e7 training steps.

First Order Algorithms¶

1e6 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 1.1: Training curves in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps

1e7 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 1.2: Training curves in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps

1e7 Steps Navigation Results

SafetyCarButton1-v0

SafetyCarButton2-v0

SafetyCarGoal1-v0

SafetyCarGoal2-v0

SafetyPointButton1-v0

SafetyPointButton2-v0

SafetyPointGoal1-v0

SafetyPointGoal2-v0

Figure 1.3: Training curves in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps

Second Order Algorithms¶

1e6 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 2.1: Training curves of second order algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps

1e7 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 2.2: Training curves of second order algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps

1e7 Steps Navigation Results

SafetyCarButton1-v0

SafetyCarButton2-v0

SafetyCarGoal1-v0

SafetyCarGoal2-v0

SafetyPointButton1-v0

SafetyPointButton2-v0

SafetyPointGoal1-v0

SafetyPointGoal2-v0

Figure 2.3: Training curves of second order algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps

Saute Algorithms¶

1e6 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 3.1: Training curves of Saute MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps

1e7 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 3.2: Training curves of Saute MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps

1e7 Steps Navigation Results

SafetyCarButton1-v0

SafetyCarButton2-v0

SafetyCarCircle1-v0

SafetyCarCircle2-v0

SafetyCarGoal1-v0

SafetyCarGoal2-v0

SafetyPointButton1-v0

SafetyPointButton2-v0

SafetyPointCircle1-v0

SafetyPointCircle2-v0

SafetyPointGoal1-v0

SafetyPointGoal2-v0

Figure 3.3: Training curves of Saute MDP algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps

Simmer Algorithms¶

1e6 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 4.1: Training curves of Simmer MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps

1e7 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 4.2: Training curves of Simmer MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps

1e7 Steps Navigation Results

SafetyCarButton1-v0

SafetyCarButton2-v0

SafetyCarGoal1-v0

SafetyCarGoal2-v0

SafetyPointButton1-v0

SafetyPointButton2-v0

SafetyPointGoal1-v0

SafetyPointGoal2-v0

Figure 4.3: Training curves of Simmer MDP algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps

PID-Lagrangian Algorithms¶

1e6 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 5.1: Training curves of PID-Lagrangian algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps

1e7 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 5.2: Training curves of PID-Lagrangian algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps

1e7 Steps Navigation Results

SafetyCarButton1-v0

SafetyCarButton2-v0

SafetyCarGoal1-v0

SafetyCarGoal2-v0

SafetyPointButton1-v0

SafetyPointButton2-v0

SafetyPointGoal1-v0

SafetyPointGoal2-v0

Figure 5.3: Training curves of PID-Lagrangian algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps.

Early Terminated MDP Algorithms¶

1e6 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 6.1: Training curves of early terminated MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps.

1e7 Steps Velocity Results

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetyWalker2dVelocity-v1

SafetySwimmerVelocity-v1

Figure 6.2: Training curves of early terminated MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps.

1e7 Steps Navigation Results

SafetyCarButton1-v0

SafetyCarButton2-v0

SafetyCarGoal1-v0

SafetyCarGoal2-v0

SafetyPointButton1-v0

SafetyPointButton2-v0

SafetyPointGoal1-v0

SafetyPointGoal2-v0

Figure 6.3: Training curves of early terminated MDP algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps.