Off-Policy Algorithms

The OmniSafe Safety-Gymnasium Benchmark for off-policy algorithms evaluates the effectiveness of OmniSafe’s off-policy algorithms across multiple environments from the Safety-Gymnasium task suite. For each supported algorithm and environment, we offer the following:

  • Default hyperparameters used for the benchmark and scripts that enable result replication.

  • Performance comparison with other open-source implementations.

  • Graphs and raw data that can be utilized for research purposes.

  • Detailed logs obtained during training.

Supported algorithms are listed below:

Safety-Gymnasium

We highly recommend using Safety-Gymnasium to run the following experiments. To install, in a linux machine, type:

pip install safety_gymnasium

Run the Benchmark

You can set the main function of examples/benchmarks/experiment_grid.py as:

if __name__ == '__main__':
    eg = ExperimentGrid(exp_name='Off-Policy-Benchmarks')

    # set up the algorithms.
    off_policy = ['DDPG', 'SAC', 'TD3', 'DDPGLag', 'TD3Lag', 'SACLag', 'DDPGPID', 'TD3PID', 'SACPID']
    eg.add('algo', off_policy)

    # you can use wandb to monitor the experiment.
    eg.add('logger_cfgs:use_wandb', [False])
    # you can use tensorboard to monitor the experiment.
    eg.add('logger_cfgs:use_tensorboard', [True])

    # the default configs here are as follows:
    # eg.add('algo_cfgs:steps_per_epoch', [2000])
    # eg.add('train_cfgs:total_steps', [2000 * 500])
    # which can reproduce results of 1e6 steps.

    # if you want to reproduce results of 3e6 steps, using
    # eg.add('algo_cfgs:steps_per_epoch', [2000])
    # eg.add('train_cfgs:total_steps', [2000 * 1500])

    # set the device.
    avaliable_gpus = list(range(torch.cuda.device_count()))
    gpu_id = [0, 1, 2, 3]
    # if you want to use CPU, please set gpu_id = None
    # gpu_id = None

    if gpu_id and not set(gpu_id).issubset(avaliable_gpus):
        warnings.warn('The GPU ID is not available, use CPU instead.', stacklevel=1)
        gpu_id = None

    # set up the environments.
    eg.add('env_id', [
        'SafetyHopper',
        'SafetyWalker2d',
        'SafetySwimmer',
        'SafetyAnt',
        'SafetyHalfCheetah',
        'SafetyHumanoid'
        ])
    eg.add('seed', [0, 5, 10, 15, 20])
    eg.run(train, num_pool=5, gpu_id=gpu_id)

After that, you can run the following command to run the benchmark:

cd examples/benchmarks
python run_experiment_grid.py

You can also plot the results by running the following command:

cd examples
python analyze_experiment_results.py

For a detailed usage of OmniSafe statistics tool, please refer to this tutorial.

Logs are saved in examples/benchmarks/exp-x and can be monitored with tensorboard or wandb.

tensorboard --logdir examples/benchmarks/exp-x

After the experiment is finished, you can use the following command to generate the video of the trained agent:

cd examples
python evaluate_saved_policy.py

Please note that before you evaluate, please set the LOG_DIR in evaluate_saved_policy.py.

For example, if I train DDPG in SafetyHumanoid

LOG_DIR = '~/omnisafe/examples/runs/DDPG-<SafetyHumanoid>/seed-000'
play = True
save_replay = True
if __name__ == '__main__':
    evaluator = omnisafe.Evaluator(play=play, save_replay=save_replay)
    for item in os.scandir(os.path.join(LOG_DIR, 'torch_save')):
        if item.is_file() and item.name.split('.')[-1] == 'pt':
            evaluator.load_saved(
                save_dir=LOG_DIR, model_name=item.name, camera_name='track', width=256, height=256
            )
            evaluator.render(num_episodes=1)
            evaluator.evaluate(num_episodes=1)

OmniSafe Benchmark

Classic Reinforcement Learning Algorithms

In an effort to ascertain the credibility of OmniSafe’s algorithmic implementation, a comparative assessment was conducted, juxtaposing the performance of classical reinforcement learning algorithms, such as DDPG, TD3 and SAC. The performance table is provided in Table 1, with well-established open-source implementations, specifically Tianshou and Stable-Baselines3.

DDPG TD3 SAC
Environment OmniSafe (Ours) Tianshou Stable-Baselines3 OmniSafe (Ours) Tianshou Stable-Baselines3 OmniSafe (Ours) Tianshou Stable-Baselines3
SafetyAntVelocity-v1 860.86 ± 198.03 308.60 ± 318.60 2654.58 ± 1738.21 5246.86 ± 580.50 5379.55 ± 224.69 3079.45 ± 1456.81 5456.31 ± 156.04 6012.30 ± 102.64 2404.50 ± 1152.65
SafetyHalfCheetahVelocity-v1 11377.10 ± 75.29 12493.55 ± 437.54 7796.63 ± 3541.64 11246.12 ± 488.62 10246.77 ± 908.39 8631.27 ± 2869.15 11488.86 ± 513.09 12083.89 ± 564.51 7767.74 ± 3159.07
SafetyHopperVelocity-v1 1462.56 ± 591.14 2018.97 ± 1045.20 2214.06 ± 1219.57 3404.41 ± 82.57 2682.53 ± 1004.84 2542.67 ± 1253.33 3597.70 ± 32.23 3546.59 ± 76 .00 2158.54 ± 1343.24
SafetyHumanoidVelocity-v1 1537.39 ± 335.62 124.96 ± 61.68 2276.92 ± 2299.68 5798.01 ± 160.72 3838.06 ± 1832.90 3511.06 ± 2214.12 6039.77 ± 167.82 5424.55 ± 118.52 2713.60 ± 2256.89
SafetySwimmerVelocity-v1 139.39 ± 11.74 138.98 ± 8.60 210.40 ± 148.01 98.39 ± 32.28 94.43 ±9.63 247.09 ± 131.69 46.44 ±1.23 44.34 ±2.01 247.33 ± 122.02
SafetyWalker2dVelocity-v1 1911.70 ± 395.97 543.23 ± 316.10 3917.46 ± 1077.38 3034.83 ± 1374.72 4267.05 ± 678.65 4087.94 ± 755.10 4419.29 ± 232.06 4619.34 ± 274.43 3906.78 ± 795.48

Table 1: The performance of OmniSafe, which was evaluated in relation to published baselines within the Safety-Gymnasium environments. Experimental outcomes, comprising mean and standard deviation, were derived from 10 assessment iterations encompassing multiple random seeds. A noteworthy distinction lies in the fact that Stable-Baselines3 employs distinct parameters tailored to each environment, while OmniSafe maintains a consistent parameter set across all environments.

Safe Reinforcement Learning Algorithms

To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of cost_limit=25.00. The results are presented in Table 2, Figure 1, Figure 2, Figure 3.

Performance Table

DDPG TD3 SAC
Environment Reward Cost Reward Cost Reward Cost
SafetyAntVelocity-v1 860.86 ± 198.03 234.80 ± 40.63 5246.86 ± 580.50 912.90 ± 93.73 5456.31 ± 156.04 943.10 ± 47.51
SafetyHalfCheetahVelocity-v1 11377.10 ± 75.29 980.93 ± 1.05 11246.12 ± 488.62 981.27 ± 0.31 11488.86 ± 513.09 981.93 ± 0.33
SafetyHopperVelocity-v1 1462.56 ± 591.14 429.17 ± 220.05 3404.41 ± 82.57 973.80 ± 4.92 3537.70 ± 32.23 975.23 ± 2.39
SafetyHumanoidVelocity-v1 1537.39 ± 335.62 48.79 ±13.06 5798.01 ± 160.72 255.43 ± 437.13 6039.77 ± 167.82 41.42 ±49.78
SafetySwimmerVelocity-v1 139.39 ± 11.74 200.53 ± 43.28 98.39 ±32.28 115.27 ± 44.90 46.44 ±1.23 40.97 ±0.47
SafetyWalker2dVelocity-v1 1911.70 ± 395.97 318.10 ± 71.03 3034.83 ± 1374.72 606.47 ± 337.33 4419.29 ± 232.06 877.70 ± 8.95
SafetyCarCircle1-v0 44.64 ±2.15 371.93 ± 38.75 44.57 ±2.71 383.37 ± 62.03 43.46 ±4.39 406.87 ± 78.78
SafetyCarGoal1-v0 36.99 ±1.66 57.13 ±38.40 36.26 ±2.35 69.70 ±52.18 35.71 ±2.24 54.73 ±46.74
SafetyPointCircle1-v0 113.67 ± 1.33 421.53 ± 142.66 115.15 ± 2.24 391.07 ± 38.34 115.06 ± 2.04 403.43 ± 44.78
SafetyPointGoal1-v0 25.55 ±2.62 41.60 ±37.17 27.28 ±1.21 51.43 ±33.05 27.04 ±1.49 67.57 ±32.13
DDPGLag TD3Lag SACLag
Environment Reward Cost Reward Cost Reward Cost
SafetyAntVelocity-v1 1271.48 ± 581.71 33.27 ±13.34 1944.38 ± 759.20 63.27 ±46.89 1897.32 ± 1213.74 5.73 ±7.83
SafetyHalfCheetahVelocity-v1 2743.06 ± 21.77 0.33 ±0.12 2741.08 ± 49.13 10.47 ±14.45 2833.72 ± 3.62 0.00 ±0.00
SafetyHopperVelocity-v1 1093.25 ± 81.55 15.00 ±21.21 928.79 ± 389.48 40.67 ±30.99 963.49 ± 291.64 20.23 ±28.47
SafetyHumanoidVelocity-v1 2059.96 ± 485.68 19.71 ±4.05 5751.99 ± 157.28 10.71 ±23.60 5940.04 ± 121.93 17.59 ±6.24
SafetySwimmerVelocity-v1 13.18 ±20.31 28.27 ±32.27 15.58 ±16.97 13.27 ±17.64 11.03 ±11.17 22.70 ±32.10
SafetyWalker2dVelocity-v1 2238.92 ± 400.67 33.43 ±20.08 2996.21 ± 74.40 22.50 ±16.97 2676.47 ± 300.43 30.67 ±32.30
SafetyCarCircle1-v0 33.29 ±6.55 20.67 ±28.48 34.38 ±1.55 2.25 ±3.90 31.42 ±11.67 22.33 ±26.16
SafetyCarGoal1-v0 22.80 ±8.75 17.33 ±21.40 7.31 ±5.34 33.83 ±31.03 10.83 ±11.29 22.67 ±28.91
SafetyPointCircle1-v0 70.71 ±13.61 22.00 ±32.80 83.07 ±3.49 7.83 ±15.79 83.68 ±3.32 12.83 ±19.53
SafetyPointGoal1-v0 17.17 ±10.03 20.33 ±31.59 25.27 ±2.74 28.00 ±15.75 21.45 ±6.97 19.17 ±9.72
DDPGPID TD3PID SACPID
Environment Reward Cost Reward Cost Reward Cost
SafetyAntVelocity-v1 2078.27 ± 704.77 18.20 ±7.21 2410.46 ± 217.00 44.50 ±38.39 1940.55 ± 482.41 13.73 ±7.24
SafetyHalfCheetahVelocity-v1 2737.61 ± 45.93 36.10 ±11.03 2695.64 ± 29.42 35.93 ±14.03 2689.01 ± 15.46 21.43 ±5.49
SafetyHopperVelocity-v1 1034.42 ± 350.59 29.53 ±34.54 1225.97 ± 224.71 46.87 ±65.28 812.80 ± 381.86 92.23 ±77.64
SafetyHumanoidVelocity-v1 1082.36 ± 486.48 15.00 ±19.51 6179.38 ± 105.70 5.60 ±6.23 6107.36 ± 113.24 6.20 ±10.14
SafetySwimmerVelocity-v1 23.99 ±7.76 30.70 ±21.81 28.62 ±8.48 22.47 ±7.69 7.50 ±10.42 7.77 ±8.48
SafetyWalker2dVelocity-v1 1378.75 ± 896.73 14.77 ±13.02 2769.64 ± 67.23 6.53 ±8.86 1251.87 ± 721.54 41.23 ±73.33
SafetyCarCircle1-v0 26.89 ±11.18 31.83 ±33.59 34.77 ±3.24 47.00 ±39.53 34.41 ±7.19 5.00 ±11.18
SafetyCarGoal1-v0 19.35 ±14.63 17.50 ±21.31 27.28 ±4.50 9.50 ±12.15 16.21 ±12.65 6.67 ±14.91
SafetyPointCircle1-v0 71.63 ±8.39 0.00 ±0.00 70.95 ±6.00 0.00 ±0.00 75.15 ±6.65 4.50 ±4.65
SafetyPointGoal1-v0 19.85 ±5.32 22.67 ±13.73 18.76 ±7.87 12.17 ±9.39 15.87 ±6.73 27.50 ±15.25

Table 2: The performance of OmniSafe off-policy algorithms, which underwent evaluation under the experimental setting of cost_limit=25.00. During experimentation, it was observed that off-policy algorithms did not violate safety constraints in SafetyHumanoidVeloicty-v1. This observation suggests that the agent may not have fully learned to run within 1e6 steps; consequently, the 3e6 results were utilized in off-policy SafetyHumanoidVeloicty-v1. Meanwhile, in environments with strong stochasticity such as SafetyCarCircle1-v0, SafetyCarGoal1-v0, SafetyPointCircle1-v0, and SafetyPointGoal1-v0, off-policy methods require more training steps to estimate a more accurate Q-function. Therefore, we also conducted evaluations on these four environments using a training duration of 3e6 steps. For other environments, we use the evaluation results after 1e6 training steps.

Performance Curves

DDPG, TD3, and SAC

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetySwimmerVelocity-v1

SafetyWalker2dVelocity-v1

SafetyCarCircle1-v0

SafetyCarGoal1-v0

SafetyPointCircle1-v0

SafetyPointGoal1-v0

Figure 1: Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.

DDPGLag, TD3Lag, and SACLag

SafetyAntVelocity-v1

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetySwimmerVelocity-v1

SafetyWalker2dVelocity-v1

SafetyCarCircle1-v0

SafetyCarGoal1-v0

SafetyPointCircle1-v0

SafetyPointGoal1-v0

Figure 2: Training curves in Safety-Gymnasium environments, covering lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.

DDPGPID, TD3PID, and SACPID

SafetyAnt

SafetyHalfCheetahVelocity-v1

SafetyHopperVelocity-v1

SafetyHumanoidVelocity-v1

SafetySwimmerVelocity-v1

SafetyWalker2dVelocity-v1

SafetyCarCircle1-v0

SafetyCarGoal1-v0

SafetyPointCircle1-v0

SafetyPointGoal1-v0

Figure 3: Training curves in Safety-Gymnasium environments, covering pid-lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.