Offline Algorithms¶

OmniSafe’s Mujoco Velocity Benchmark evaluated the performance of OmniSafe’s offline algorithm implementations in SafetyPointCirlce, SafetyPointCirlce from the Safety-Gymnasium task suite. For each algorithm and environment supported, we provide:

Default hyperparameters used for the benchmark and scripts to reproduce the results.
A comparison of performance or code-level details with other open-source implementations or classic papers.
Graphs and raw data that can be used for research purposes.
Log details obtained during training.

Supported algorithms are listed below:

[ICML 2019] Batch-Constrained deep Q-learning(BCQ)
The Lagrange version of BCQ (BCQ-Lag)
[NeurIPS 2020] Critic Regularized Regression
The Constrained version of CRR (C-CRR)
[ICLR 2022 (Spotlight)] COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation

Safety-Gymnasium¶

We highly recommend using safety-gymnasium to run the following experiments. To install, in a linux machine, type:

pip install safety_gymnasium

Training agents used to generate data¶

omnisafe train --env-id SafetyAntVelocity-v1 --algo PPO
omnisafe train --env-id SafetyAntVelocity-v1 --algo PPOLag

Collect offline data¶

from omnisafe.common.offline.data_collector import OfflineDataCollector


# please change agent path
env_name = 'SafetyAntVelocity-v1'
size = 1_000_000
agents = [
    ('./runs/PPO', 'epoch-500', 500_000),
    ('./runs/PPOLag', 'epoch-500', 500_000),
]
save_dir = './data'

if __name__ == '__main__':
    col = OfflineDataCollector(size, env_name)
    for agent, model_name, num in agents:
        col.register_agent(agent, model_name, num)
    col.collect(save_dir)

Run the Benchmark¶

You can set the main function of examples/benchmarks/experimrnt_grid.py as:

if __name__ == '__main__':
    eg = ExperimentGrid(exp_name='offline-Benchmarks')

    # set up the algorithms.
    offline_policy = ['VAEBC', 'BCQ', 'BCQLag', 'CCR', 'CCRR', 'COptiDICE']

    eg.add('algo', offline_policy)

    # you can use wandb to monitor the experiment.
    eg.add('logger_cfgs:use_wandb', [False])
    # you can use tensorboard to monitor the experiment.
    eg.add('logger_cfgs:use_tensorboard', [True])
    # add dataset path
    eg.add('train_cfgs:dataset', [dataset_path])

    # set up the environment.
    eg.add('env_id', [
        'SafetyAntVelocity-v1',
        ])
    eg.add('seed', [0, 5, 10, 15, 20])

    # total experiment num must can be divided by num_pool
    # meanwhile, users should decide this value according to their machine
    eg.run(train, num_pool=5)

After that, you can run the following command to run the benchmark:

cd examples/benchmarks
python run_experiment_grid.py

You can also plot the results by running the following command:

cd examples
python plot.py --log-dir ALGODIR

OmniSafe Benchmark¶

Performance Table¶

	VAE-BC		C-CRR		BCQLag		COptiDICE
Environment	Reward	Cost	Reward	Cost	Reward	Cost	Reward	Cost
SafetyPointCircle1-v0(beta=0.25)	43.66 ± 0.90	109.86 ± 13.24	45.48 ± 0.87	127.30 ± 12.60	43.31 ± 0.76	113.39 ± 12.81	40.68 ± 0.93	67.11 ± 13.15
SafetyPointCircle1-v0(beta=0.50)	42.84 ± 1.36	62.34 ± 14.84	45.99 ± 1.36	97.20 ± 13.57	44.68 ± 1.97	95.06 ± 33.07	39.55 ± 1.39	53.87 ± 13.27
SafetyPointCircle1-v0(beta=0.75)	40.23 ± 0.75	41.25 ± 10.12	40.66 ± 0.88	49.90 ± 10.81	42.94 ± 1.04	85.37 ± 23.41	40.98 ± 0.89	70.40 ± 12.14
SafetyCarCircle1-v0(beta=0.25)	19.62 ± 0.28	150.54 ± 7.63	18.53 ± 0.45	122.63 ± 13.14	18.88 ± 0.61	125.44 ± 15.68	17.25 ± 0.37	90.86 ± 10.75
SafetyCarCircle1-v0(beta=0.50)	18.69 ± 0.33	125.97 ± 10.36	17.24 ± 0.43	89.47 ± 11.55	18.14 ± 0.96	108.07 ± 20.70	16.38 ± 0.43	70.54 ± 12.36
SafetyCarCircle1-v0(beta=0.75)	17.31 ± 0.33	85.53 ± 11.33	15.74 ± 0.42	48.38 ± 10.31	17.10 ± 0.84	77.54 ± 14.07	15.58 ± 0.37	49.42 ± 8.70

Table 1:The performance of OmniSafe offline algorithms, which was evaluated following 1e6 training steps and under the experimental setting of cost limit=25.00. We introduce a quantization parameter beta from the perspective of safe trajectories and control the trajectory distribution of the mixed dataset. This parameter beta indicates the difficulty of this dataset to a certain extent. When beta is smaller, it means that the number of safe trajectories in the current dataset is smaller, the less safe information can be available for the algorithm to learn.