ROS Projects Archives - The Construct

Testing Different OpenAI RL Algorithms With ROS And Gazebo

This post was written by Miguel A. Rodriguez and Ricardo Tellez

In this post we are going to see how to test different reinforcement learning (RL) algorithms from the OpenAI framework in the same robot trying to solve the same task. We are going to use the openai_ros package, which allows to change algorithms very easily and hence compare performances.

For the whole process, we are going to use the ROS Development Studio (ROSDS), which you can use for free. Also you can reproduce the results in your own local computer following these instructions.

The task to solve

We are going to use for this post the typical example in reinforcement learning of the cartpole. On the cartpole task, a pole mounted on top of a cart has to be balanced by the cart by moving left or right.

This problem is defined in the OpenAI Gym in the following links:

Wiki for Cartpole-v0: https://github.com/openai/gym/wiki/CartPole-v0
Cartpole Env definition: https://github.com/openai/gym/blob/5404b39d06f72012f562ec41f60734bd4b5ceb4b/gym/envs/classic_control/cartpole.py

However, the OpenAi Gym just solves the problem on a 2D flat simulated environment with no real controllers for the cart and very limited physics simulation. That environment and its training looks something like the following video.

In this post, we are going to solve the same problem but using a 3D Gazebo simulation which includes full ROS controllers. The simulation includes physics and controllers behavior which makes the problem to solve even more complicated, but closer to reality.

Remember that the same thing we are doing in this post can be applied to ANY ROS BASED ROBOT.

The ROSject containing the simulation and pre-made code

We have created a ROSject containing the Gazebo simulation we are going to use, as well as some classes that interconnect the simulation to OpenAI. Those classes use the openai_ros package for easy definition of the RobotEnvironment (defines the connection of OpenAI to the simulated robot) and the TaskEnvironment (defines the task to be solved). This is already provided by the openai_ros package, so we’ll have to do nothing.

We provide all that work already done because in this post we want to concentrate only on comparing RL algorithms on the same robot and the same task.

You can get the code up and running on the ROSDS by getting this ROSject (click on the link to get the code). You can create a free account on ROSDS if you don’t have one. From the rest of tutorial we will assume you have the ROSject open in ROSDS (click on the link, then press Open ROSject on the ROSDS screen).

The Cartpole Simulation

You can launch the simulation inside the ROSDS going to the Simulations Drop Down Menu select Select Launch File and then select:

cartpole_description main.launch

You should get something like this:

That is the simulation we are going to use. Remember that you have to have it launched for the rest of the tutorial.

The algorithms we are going to test

We took several of the best algorithms posted by the users on the OpenAI Gym website to solve the 2D cartpole system:

n1try: https://gym.openai.com/evaluations/eval_EIcM1ZBnQW2LBaFN6FY65g/
mbalunovic: https://gym.openai.com/evaluations/eval_OeUSZwUcR2qSAqMmOE1UIw/
ruippeixotog: https://gym.openai.com/evaluations/eval_aCiCDmwhTCytFuxMpKoyvQ/

We adapted them for ROS and to use our Cartpole3D environment. For that we just had to do mainly 2 things:

Load the learning parameters from a yaml file instead of loading them from global variables. This allows us to change the values easily and fast. So we are going to create a yaml file for each algorithm that will be loaded before the algorithm is launched.
We have changed the OpenAI environment from the 2DCartpole provided by OpenAI (see video above) to the one provided by openai_ros package which interfaces with ROS and Gazebo.

Create the training code

First let’s create the ROS package that will contain everything

Create the ROS package for training

Let’s first create a ROS package where to put all the code we are going to develop. For that, open a shell (Tools->Shell) and type:

$ cd catkin_ws/src
$ catkin_create_pkg cartpole_tests rospy

Create the config files

Here you have an example of of the yaml file for the n1try. Its divided in two main parts:

Parameters for Reinforcement Learning algorithm: Values to needed by your learning algorithm, in this case would be n1try, but could be any.
Parameters for the RobotEnvironment and TaskEnvironment: These variables are used to tune the task and simulation during setup. These variables have already been set up for you to be optimum, and if you are asking where you would get them from the first place, you can just copy paste the ones in the git example for openai_ros, where you can find examples for all the supported simulations and tasks. https://bitbucket.org/theconstructcore/openai_examples_projects/src/master/

To create the file do the following:

$ roscd cartpole_tests
$ mkdir config
$ mkdir scripts
$ mkdir launch

Now open the IDE of the ROSDS (Tools->IDE), navigate to the config directory and create there the config file with name cartpole_n1try_params.yaml

Then paste the following content to it and save it:


cartpole_v0: #namespace

    #qlearn parameters
    
    alpha: 0.01 # Learning Rate
    alpha_decay: 0.01
    gamma: 1.0 # future rewards value 0 none 1 a lot
    epsilon: 1.0 # exploration, 0 none 1 a lot
    epsilon_decay: 0.995 # how we reduse the exploration
    epsilon_min: 0.01 # minimum value that epsilon can have
    batch_size: 64 # maximum size of the batches sampled from memory
    episodes_training: 1000
    n_win_ticks: 250 # If the mean of rewards is bigger than this and have passed min_episodes, the task is considered finished
    min_episodes: 100
    #max_env_steps: None
    monitor: True
    quiet: False
    

    # Cartpole3D environment variables
    control_type: "velocity"
    min_pole_angle: -0.7 #-23°
    max_pole_angle: 0.7 #23°
    max_base_velocity: 50
    max_base_pose_x: 1.0
    min_base_pose_x: -1.0
    
    n_observations: 4 # Number of lasers to consider in the observations
    n_actions: 2 # Number of actions used by algorithm and task
    
    # those parameters are very important. They are affecting the learning experience
    # They indicate how fast the control can be
    # If the running step is too large, then there will be a long time between 2 ctrl commans
    # If the pos_step is too large, then the changes in position will be very abrupt
    running_step: 0.04 # amount of time the control will be executed
    pos_step: 1.0     # increment in position/velocity/effort, depends on the control for each command
    init_pos: 0.0 # Position in which the base will start
    wait_time: 0.1 # Time to wait in the reset phases

Modify the learning algorithm for openai_ros

On the IDE create the following file on the scripts folder of your package: cartpole3D_n1try_algorithm_with_rosparams.py

#!/usr/bin/env python
import rospy

# Inspired by https://keon.io/deep-q-learning/
import random
import gym
import math
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up

class DQNRobotSolver():
    def __init__(self, environment_name, n_observations, n_actions, n_episodes=1000, n_win_ticks=195, min_episodes= 100, max_env_steps=None, gamma=1.0, epsilon=1.0, epsilon_min=0.01, epsilon_log_decay=0.995, alpha=0.01, alpha_decay=0.01, batch_size=64, monitor=False, quiet=False):
        self.memory = deque(maxlen=100000)
        self.env = gym.make(environment_name)
        if monitor: self.env = gym.wrappers.Monitor(self.env, '../data/cartpole-1', force=True)
        
        self.input_dim = n_observations
        self.n_actions = n_actions
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_log_decay
        self.alpha = alpha
        self.alpha_decay = alpha_decay
        self.n_episodes = n_episodes
        self.n_win_ticks = n_win_ticks
        self.min_episodes = min_episodes
        self.batch_size = batch_size
        self.quiet = quiet
        if max_env_steps is not None: self.env._max_episode_steps = max_env_steps

        # Init model
        self.model = Sequential()
        
        self.model.add(Dense(24, input_dim=self.input_dim, activation='tanh'))
        self.model.add(Dense(48, activation='tanh'))
        self.model.add(Dense(self.n_actions, activation='linear'))
        self.model.compile(loss='mse', optimizer=Adam(lr=self.alpha, decay=self.alpha_decay))
    

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def choose_action(self, state, epsilon):
        return self.env.action_space.sample() if (np.random.random() <= epsilon) else np.argmax(self.model.predict(state)) def get_epsilon(self, t): return max(self.epsilon_min, min(self.epsilon, 1.0 - math.log10((t + 1) * self.epsilon_decay))) def preprocess_state(self, state): return np.reshape(state, [1, self.input_dim]) def replay(self, batch_size): x_batch, y_batch = [], [] minibatch = random.sample( self.memory, min(len(self.memory), batch_size)) for state, action, reward, next_state, done in minibatch: y_target = self.model.predict(state) y_target[0][action] = reward if done else reward + self.gamma * np.max(self.model.predict(next_state)[0]) x_batch.append(state[0]) y_batch.append(y_target[0]) self.model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0) if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def run(self):
        
        rate = rospy.Rate(30)
        
        scores = deque(maxlen=100)

        for e in range(self.n_episodes):
            
            init_state = self.env.reset()
            
            state = self.preprocess_state(init_state)
            done = False
            i = 0
            while not done:
                # openai_ros doesnt support render for the moment
                #self.env.render()
                action = self.choose_action(state, self.get_epsilon(e))
                next_state, reward, done, _ = self.env.step(action)
                next_state = self.preprocess_state(next_state)
                self.remember(state, action, reward, next_state, done)
                state = next_state
                i += 1
                

            scores.append(i)
            mean_score = np.mean(scores)
            if mean_score >= self.n_win_ticks and e >= min_episodes:
                if not self.quiet: print('Ran {} episodes. Solved after {} trials'.format(e, e - min_episodes))
                return e - min_episodes
            if e % 1 == 0 and not self.quiet:
                print('[Episode {}] - Mean survival time over last {} episodes was {} ticks.'.format(e, min_episodes ,mean_score))

            self.replay(self.batch_size)
            

        if not self.quiet: print('Did not solve after {} episodes'.format(e))
        return e
        
if __name__ == '__main__':
    rospy.init_node('cartpole_n1try_algorithm', anonymous=True, log_level=rospy.FATAL)
    
    environment_name = 'CartPoleStayUp-v0'
    
    n_observations = rospy.get_param('/cartpole_v0/n_observations')
    n_actions = rospy.get_param('/cartpole_v0/n_actions')
    
    n_episodes = rospy.get_param('/cartpole_v0/episodes_training')
    n_win_ticks = rospy.get_param('/cartpole_v0/n_win_ticks')
    min_episodes = rospy.get_param('/cartpole_v0/min_episodes')
    max_env_steps = None
    gamma =  rospy.get_param('/cartpole_v0/gamma')
    epsilon = rospy.get_param('/cartpole_v0/epsilon')
    epsilon_min = rospy.get_param('/cartpole_v0/epsilon_min')
    epsilon_log_decay = rospy.get_param('/cartpole_v0/epsilon_decay')
    alpha = rospy.get_param('/cartpole_v0/alpha')
    alpha_decay = rospy.get_param('/cartpole_v0/alpha_decay')
    batch_size = rospy.get_param('/cartpole_v0/batch_size')
    monitor = rospy.get_param('/cartpole_v0/monitor')
    quiet = rospy.get_param('/cartpole_v0/quiet')
    
    
    agent = DQNRobotSolver(     environment_name,
                                n_observations,
                                n_actions,
                                n_episodes,
                                n_win_ticks,
                                min_episodes,
                                max_env_steps,
                                gamma,
                                epsilon,
                                epsilon_min,
                                epsilon_log_decay,
                                alpha,
                                alpha_decay,
                                batch_size,
                                monitor,
                                quiet)
    agent.run()

As you can see the main differences are the rosparam gets and the rosnode init. Also one element that has to be removed is the render method. Just because in Gazebo Render makes no sense, unless is for recording purposes, that will be supported in OpenAI_ROS RobotEnvironments in the future updates.

Create a launch file

And now you have to create a launch file thats loads those paramateres from the YAML file and starts up your script. Go to the launch directory and create the file start_training_n1try.launch

<?xml version="1.0" encoding="UTF-8"?>
<launch>
    <rosparam command="load" file="$(find cartpole_tests)/config/cartpole_n1try_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="cartpole_tests" name="cartpole3D_mbalunovic_algorithm" type="cartpole3D_n1try_algorithm_with_rosparams.py" output="screen"/>
</launch>

Adapt the algorithm to work with openai_ros TaskEnvironment

And this is the easiest part. You just have to change the name of the OpenAI Environment to load, and in this case import the openai_ros module. And that is, you change the 2D cartpole by the 3D gazebo realistic cartpole. That SIMPLE

This at the top in the import section:

# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up

And this line

self.env = gym.make('CartPole-v0')

To a different Task and its loaded through a variable of the class:


environment_name = 'CartPoleStayUp-v0'
agent = DQNRobotSolver(     environment_name,
                       ...
self.env = gym.make(environment_name)

And that’s it, you can now launch it with your original algorithm pretty much untouched.

Launch the training

On a shell type the following:

$ roslaunch cartpole_tests start_training_n1try.launch

After a few seconds you should see the cartpole start moving and trying to learn.

Results you may obtain

These are tests made with 1000 epsidoes maximum and it will stop the learning if:

We depleat our episodes ( more than 1000 )
We Get a Mean Episode Reward bigger than n_win_ticks and we have executed more than min_episodes number of episodes. In this case of n_win_ticks = 250 and min_episodes = 100. This is the reason why the 3D version stoped before the 1000 episodes had ended. Because the mean value of the episode rewards was bigger then 250. While the 2D had values not exceeding 200.

Note: to see the reward chart, got to Tools->OpenAI Reward Chart

How to change the learning algorithm

What if you want to change your algorithm for the same TaskEnvironment, like the CartPoleStayUp-v0? Well, with openai_ros is really fast. You just have to change the following:

Change the ROS Param YAML file loaded in the Launch file
Change the algorithm python script to be launched in the launch file to your new one.
Inside your new algorithm main script, place the same TaskEnvironment you had in the previous algorithm script.

For example lets say you need to change from the n1try to the mbalunovic or the ruippeixotog.

You will have to create new YAML files with the RL algorithm parameters you need, but with the same simulation parameters

For the mbalunovic algorithm

Create the config file cartpole_mbalunovic_params.yaml


cartpole_v0: #namespace

    #qlearn parameters
    state_size: 4 # number of elements in the state array from observations.
    n_actions: 2 # Number of actions used by algorithm and task
    learning_rate: 0.001
    #num_epochs: 50
    replay_memory_size: 1000
    target_update_freq: 100
    initial_random_action: 1
    gamma: 0.99 # future rewards value, 0 none 1 a lot
    epsilon_decay: 0.99 # how we reduse the exploration
    done_episode_reward: -200 # Reward given when episode finishes before interations depleated.
    batch_size: 32 # maximum size of the batches sampled from memory
    max_iterations: 1000 # maximum iterations that an episode can have
    episodes_training: 1000

    # Environment variables
    control_type: "velocity"
    min_pole_angle: -0.7 #-23°
    max_pole_angle: 0.7 #23°
    max_base_velocity: 50
    max_base_pose_x: 1.0
    min_base_pose_x: -1.0
    
    # those parameters are very important. They are affecting the learning experience
    # They indicate how fast the control can be
    # If the running step is too large, then there will be a long time between 2 ctrl commans
    # If the pos_step is too large, then the changes in position will be very abrupt
    running_step: 0.04 # amount of time the control will be executed
    pos_step: 1.0     # increment in position/velocity/effort, depends on the control for each command
    init_pos: 0.0 # Position in which the base will start
    wait_time: 0.1 # Time to wait in the reset phases

Create the launch file start_training_mbalunovic.launch

<?xml version="1.0" encoding="UTF-8"?>
<launch>
    <rosparam command="load" file="$(find cartpole_tests)/config/cartpole_mbalunovic_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="cartpole_tests" name="cartpole3D_mbalunovic_algorithm" type="cartpole3D_mbalunovic_algorithm_with_rosparams.py" output="screen"/>
</launch>

Create the script file cartpole3D_mbalunovic_algorithm_with_rosparams.py (modification from original learning algorithm)


#!/usr/bin/env python
import rospy

import gym
import keras
import numpy as np
import random

from gym import wrappers
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

from collections import deque

# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up



class ReplayBuffer():

  def __init__(self, max_size):
    self.max_size = max_size
    self.transitions = deque()

  def add(self, observation, action, reward, observation2):
    if len(self.transitions) > self.max_size:
      self.transitions.popleft()
    self.transitions.append((observation, action, reward, observation2))

  def sample(self, count):
    return random.sample(self.transitions, count)

  def size(self):
    return len(self.transitions)

def get_q(model, observation, state_size):
  np_obs = np.reshape(observation, [-1, state_size])
  return model.predict(np_obs)

def train(model, observations, targets, actions_dim, state_size):

  np_obs = np.reshape(observations, [-1, state_size])
  np_targets = np.reshape(targets, [-1, actions_dim])

  #model.fit(np_obs, np_targets, epochs=1, verbose=0)
  model.fit(np_obs, np_targets, nb_epoch=1, verbose=0)

def predict(model, observation, state_size):
  np_obs = np.reshape(observation, [-1, state_size])
  return model.predict(np_obs)

def get_model(state_size, learning_rate):
  model = Sequential()
  model.add(Dense(16, input_shape=(state_size, ), activation='relu'))
  model.add(Dense(16, input_shape=(state_size,), activation='relu'))
  model.add(Dense(2, activation='linear'))

  model.compile(
    optimizer=Adam(lr=learning_rate),
    loss='mse',
    metrics=[],
  )

  return model

def update_action(action_model, target_model, sample_transitions, actions_dim, state_size, gamma):
  random.shuffle(sample_transitions)
  batch_observations = []
  batch_targets = []

  for sample_transition in sample_transitions:
    old_observation, action, reward, observation = sample_transition

    targets = np.reshape(get_q(action_model, old_observation, state_size), actions_dim)
    targets[action] = reward
    if observation is not None:
      predictions = predict(target_model, observation, state_size)
      new_action = np.argmax(predictions)
      targets[action] += gamma * predictions[0, new_action]

    batch_observations.append(old_observation)
    batch_targets.append(targets)

  train(action_model, batch_observations, batch_targets, actions_dim, state_size)

def main():
  
  
  state_size = rospy.get_param('/cartpole_v0/state_size')
  action_size = rospy.get_param('/cartpole_v0/n_actions')
  gamma = rospy.get_param('/cartpole_v0/gamma')
  batch_size = rospy.get_param('/cartpole_v0/batch_size')
  target_update_freq = rospy.get_param('/cartpole_v0/target_update_freq')
  initial_random_action = rospy.get_param('/cartpole_v0/initial_random_action')
  replay_memory_size = rospy.get_param('/cartpole_v0/replay_memory_size')
  episodes_training = rospy.get_param('/cartpole_v0/episodes_training')
  max_iterations = rospy.get_param('/cartpole_v0/max_iterations')
  epsilon_decay = rospy.get_param('/cartpole_v0/max_iterations')
  learning_rate = rospy.get_param('/cartpole_v0/learning_rate')
  done_episode_reward = rospy.get_param('/cartpole_v0/done_episode_reward')
  
  steps_until_reset = target_update_freq
  random_action_probability = initial_random_action

  # Initialize replay memory D to capacity N
  replay = ReplayBuffer(replay_memory_size)

  # Initialize action-value model with random weights
  action_model = get_model(state_size, learning_rate)

  env = gym.make('CartPoleStayUp-v0')
  env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1', force=True)

  for episode in range(episodes_training):
    observation = env.reset()

    for iteration in range(max_iterations):
      random_action_probability *= epsilon_decay
      random_action_probability = max(random_action_probability, 0.1)
      old_observation = observation

      # We dont support render in openai_ros
      """
      if episode % 1 == 0:
        env.render()
      """
      
      if np.random.random() < random_action_probability: action = np.random.choice(range(action_size)) else: q_values = get_q(action_model, observation, state_size) action = np.argmax(q_values) observation, reward, done, info = env.step(action) if done: print 'Episode {}, iterations: {}'.format( episode, iteration ) # print action_model.get_weights() # print target_model.get_weights() #print 'Game finished after {} iterations'.format(iteration) reward = done_episode_reward replay.add(old_observation, action, reward, None) break replay.add(old_observation, action, reward, observation) if replay.size() >= batch_size:
        sample_transitions = replay.sample(batch_size)
        update_action(action_model, action_model, sample_transitions, action_size, state_size, gamma)
        steps_until_reset -= 1



if __name__ == "__main__":
    rospy.init_node('cartpole_mbalunovic_algorithm', anonymous=True, log_level=rospy.FATAL)
    main()

Results

$ roslaunch cartpole_tests start_training_mbalunovic.launch

For the ruippeixotog algorithm

Create the config file cartpole_ruippeixotog_params.yaml


cartpole_v0: #namespace

    #qlearn parameters
    state_size: 4 # number of elements in the state array from observations.
    n_actions: 2 # Number of actions used by algorithm and task
    gamma: 0.95 # future rewards value, 0 none 1 a lot
    epsilon: 1.0 # exploration, 0 none 1 a lot
    epsilon_decay: 0.995 # how we reduse the exploration
    epsilon_min: 0.1 # minimum value that epsilon can have
    batch_size: 32 # maximum size of the batches sampled from memory
    episodes_training: 1000
    episodes_running: 500

    #environment variables
    control_type: "velocity"
    min_pole_angle: -0.7 #-23°
    max_pole_angle: 0.7 #23°
    max_base_velocity: 50
    max_base_pose_x: 1.0
    min_base_pose_x: -1.0
    
    # those parameters are very important. They are affecting the learning experience
    # They indicate how fast the control can be
    # If the running step is too large, then there will be a long time between 2 ctrl commans
    # If the pos_step is too large, then the changes in position will be very abrupt
    running_step: 0.04 # amount of time the control will be executed
    pos_step: 1.0     # increment in position/velocity/effort, depends on the control for each command
    init_pos: 0.0 # Position in which the base will start
    wait_time: 0.1 # Time to wait in the reset phases

Create the launch file start_training_ruippeixotog.launch

<?xml version="1.0" encoding="UTF-8"?>
<launch>
    <rosparam command="load" file="$(find cartpole_tests)/config/cartpole_ruippeixotog_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="cartpole_tests" name="cartpole3D_ruippeixotog_algorithm" type="cartpole3D_ruippeixotog_algorithm.py" output="screen"/>
</launch>

Create the training script cartpole_ruippeixotog_algorithm.py

#!/usr/bin/env python
import rospy

import os

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

from gym_runner import GymRunner
from q_learning_agent import QLearningAgent

# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up

class CartPoleAgent(QLearningAgent):
    def __init__(self):
        
        state_size = rospy.get_param('/cartpole_v0/state_size')
        action_size = rospy.get_param('/cartpole_v0/n_actions')
        gamma = rospy.get_param('/cartpole_v0/gamma')
        epsilon = rospy.get_param('/cartpole_v0/epsilon')
        epsilon_decay = rospy.get_param('/cartpole_v0/epsilon_decay')
        epsilon_min = rospy.get_param('/cartpole_v0/epsilon_min')
        batch_size = rospy.get_param('/cartpole_v0/batch_size')
        
        
        
        QLearningAgent.__init__(self,
                                state_size=state_size,
                                action_size=action_size,
                                gamma=gamma,
                                epsilon=epsilon,
                                epsilon_decay=epsilon_decay,
                                epsilon_min=epsilon_min,
                                batch_size=batch_size)
        
        #super(CartPoleAgent, self).__init__(4, 2)

    def build_model(self):
        model = Sequential()
        model.add(Dense(12, activation='relu', input_dim=4))
        model.add(Dense(12, activation='relu'))
        model.add(Dense(2))
        model.compile(Adam(lr=0.001), 'mse')

        # load the weights of the model if reusing previous training session
        # model.load_weights("models/cartpole-v0.h5")
        return model


if __name__ == "__main__":
    rospy.init_node('cartpole3D_ruippeixotog_algorithm', anonymous=True, log_level=rospy.FATAL)
    
    episodes_training = rospy.get_param('/cartpole_v0/episodes_training')
    episodes_running = rospy.get_param('/cartpole_v0/episodes_running')
    max_timesteps = rospy.get_param('/cartpole_v0/max_timesteps', 10000)
    
    gym = GymRunner('CartPoleStayUp-v0', 'gymresults/cartpole-v0', max_timesteps)
    agent = CartPoleAgent()

    gym.train(agent, episodes_training, do_train=True, do_render=False, publish_reward=False)
    gym.run(agent, episodes_running, do_train=False, do_render=False, publish_reward=False)

    agent.model.save_weights("models/cartpole-v0.h5", overwrite=True)
    #gym.close_and_upload(os.environ['API_KEY'])

Results obtained

$ roslaunch cartpole_tests start_training_ruippeixotog.launch

Exercise for you!

Do the same process we did for the cartpole but for a Turtlebot robot that has to learn how to move around a round maze. Check the openai_ros package for the required RobotEnvironment and TaskEnvironment to use. You will see how easy it is to change what we have developed for the cartpole into a different robot and task!

Related Course

Using OpenAI with ROS

[ROS Projects] – Use OpenAI_ROS with Turtlebot2 Step by Step – PE

This is an extra short video updating what it has been done new on OpenAI_ROS package. You can find the link to the package wiki down below.
This is important for the people that saw the other two videos, because the structure of the package has changed and it might lead to some confusion and errors.

[irp posts=”10259″ name=”ROS Projects – Use OpenAI_ROS with Turtlebot2 Step by Step #Part 1″]

[irp posts=”10441″ name=”ROS Projects – Use OpenAI_ROS with TurtleBot2 Step-by-Step #Part 2″]

Step 0. Create a project in ROS Development Studio(ROSDS)

ROSDS helps you follow our tutorial in a fast pace without dealing without setting up an environment locally. If you haven’t had an account yet, you can create a free account here. Let’s create a new project and call it rrbot_control. You can easily clone the project with this link.

Step 1. Train Turtlebot 2 with OpenAI ROS

In the ROSject, we already clone the openai ros package for you. You can run the turtlebot 2 simulation from Simulations->Select launch file->main.launch. You should see the simulation is up and running.

Then you can run the training script with

cd catkin_ws
source devel/setup.bash
roslaunch turtle2_openai_ros_example start_training.launch

The robot starts moving and learning how to move in a maze.

Want to learn more?

If you are interested in using the openAI gym in ROS, please check our OpenAI Gym for Robotics 101 course.

Edit by: Tony Huang

We have also created a ROS Wiki fo the package with tutorials:
[https://goo.gl/Hu2FhY]
A page with all the robots supported by OpenAi_ROS package:
[https://goo.gl/xro8Z9]
Here you have the ROSJect with the updated code from the git and simulation, just click and run: [https://rds.theconstructsim.com/tc_projects/use_project_share_link/2f58476a-40e9-46e9-a25c-9588b5a78441]

[ROS Projects] Use OpenAI_ROS with WAM-V Step-by-Step

What you will learn?

In this video you will learn how to use openai_ros package to make learn the WAM-V sea robot of the RobotX Challenge to pass the first Navigation Control. The only thing you need to code is the AI algorithm if you want to use a different one to the one provided Q-Learn.

You can run the ROSject containing all the ROS code and simulation of the robot and sea by clicking on the next button:

And here the different gits used:

OpenAI_ROS package: [https://bitbucket.org/theconstructcore/openai_ros/src/master/]
Example Code: [https://bitbucket.org/theconstructcore/openai_examples_projects/src/tutorials/]
Simulation: [https://bitbucket.org/theconstructcore/vmrc/src/master/]

And a course on OpenAI in RobotIgniteAcademy: OpenAI Gym for ROS Based Robots 101

Link to RobotX PDF instructions for the challenge:
[https://robotx.org/images/files/RobotX_2018_Task_Summary.pdf]
[https://www.robotx.org/]

[ROS Projects] – Use OpenAI_ROS with TurtleBot2 Step-by-Step #Part 2

In this second part, you will continue by creating your own TaskEnvironment for a different TurtleBot simulation environment with a wall.
You will create this Task Env that allows the robot to learn how to reach a certain position in the map without running into the wall.

[ROS Projects] – Use OpenAI_ROS with Turtlebot2 Step by Step #Part 1

In this new video series, we are going to learn how to use the openai_ros package created by TheConstruct. You will learn to use it with all the robots that we give suport to. You will learn in this video divided into two parts, how to use openai_ros with Turtlebot2. You will learn how to setup everything to use our package and make it learn how to move around a maze with very little coding.

Here is the git of OpenAi_ros:
https://bitbucket.org/theconstructcore/openai_ros/src/master/

Here is the git to the code generated in this video and the next to come:
https://bitbucket.org/theconstructcore/openai_examples_projects/src/master/

Here a link to the OpenAI 101 course in RobotIgniteAcademy

Step 1. Create a project in ROS Development Studio(ROSDS)

ROSDS helps you follow our tutorial in a fast pace without dealing without setting up an environment locally. If you haven’t had an account yet, you can create a free account here.

Step 2. Clone the repository and create a package for the project

At first, we have to clone the repositories that we need for this project

cd ~/simulation_ws/src
git clone https://bitbucket.org/theconstructcore/open_ai_gym_construct.git
cd ~/catkin_ws/src
git clone https://bitbucket.org/theconstructcore/openai_ros.git

You can launch the simulation environment from simulations->Select launch file->gym_construct-main.launch

Inside the open_ai ROS package, we need to specify turtlebot2_maze.py as our task.

We’ll start by creating the folders for the project

cd ~/catkin_ws/src
mkdir openai_examples_projects/turtle2_open_ai_ros_example

Then create three folder config, launch, and scripts under the directory. Inside the launch folder, let’s create a script called start_training.launch with the following content

<launch>
    <!-- This version uses the openai_ros environments -->
    <rosparam command="load" file="$(find turtle2_openai_ros_example)/config/turtlebot2_openai_qlearn_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="turtle2_openai_ros_example" name="turtlebot2_maze" type="start_qlearning.py" output="screen"/>
</launch>

This launch file will load the parameters first for the algorithm, you can define the parameter in a file called turtlebot2_openai_qlearn_params.yaml in the config folder with the following content

turtlebot2: #namespace

    running_step: 0.04 # amount of time the control will be executed
    pos_step: 0.016     # increment in position for each command
    
    #qlearn parameters
    alpha: 0.1
    gamma: 0.7
    epsilon: 0.9
    epsilon_discount: 0.999
    nepisodes: 500
    nsteps: 10000
    number_splits: 10 #set to change the number of state splits for the continuous problem and also the number of env_variable splits

    running_step: 0.06 # Time for each step
    wait_time: 0.1 # Time to wait in the reset phases

    n_actions: 4 # We have 3 actions, Forwards,TurnLeft,TurnRight,Backwards
    n_observations: 6 # We have 6 different observations

    speed_step: 1.0 # Time to wait in the reset phases

    linear_forward_speed: 0.5 # Spwwed for ging fowards
    linear_turn_speed: 0.05 # Lienare speed when turning
    angular_speed: 0.3 # Angular speed when turning Left or Right
    init_linear_forward_speed: 0.0 # Initial linear speed in which we start each episode
    init_linear_turn_speed: 0.0 # Initial angular speed in shich we start each episode
    
    new_ranges: 5 # How many laser readings we jump in each observation reading, the bigger the less laser resolution
    min_range: 0.5 # Minimum meters below wich we consider we have crashed
    max_laser_value: 6 # Value considered Ok, no wall
    min_laser_value: 0 # Value considered there is an obstacle or crashed
    
    desired_pose:
      x: 5.0
      y: 0.0
      z: 0.0
    
    forwards_reward: 5 # Points Given to go forwards
    turn_reward: 1 # Points Given to turn as action
    end_episode_points: 200 # Points given when ending an episode

Then it will launch the node for qlearning, let’s create two files under the scripts folder. The first one called qlearn.py with the following content which is implemented by Victor Mayoral Vilches. We won’t go into detail here.

'''
Q-learning approach for different RL problems
as part of the basic series on reinforcement learning @
            							An introductory series to Reinforcement Learning (RL) with comprehensive step-by-step tutorials.							
        https://github.com/vmayoral/basic_reinforcement_learning
        369 forks.
        1,166 stars.
        3 open issues.
        
            Recent commits:
                                                                        Update tutorial1 to work with Python3Signed-off-by: Víctor Mayoral Vilches <v.mayoralv@gmail.com>, Víctor Mayoral Vilches                    
                                                            Minor changes to fix headlines, GitHub                    
                                                            Rendered TeX expressions in 50b3013c682b957e0ae851bad21f76d399c5059d, texify[bot]                    
                                                            Add a few additional bits to 11th's README, Victor Mayoral Vilches                    
                                                            Rendered TeX expressions in 2f6245c5a9825f37f06f3c84bb6784217648f4e6, texify[bot]                    
                            
        
    
 
Inspired by https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
 
        @author: Victor Mayoral Vilches <victor@erlerobotics.com>
'''
import random

class QLearn:
    def __init__(self, actions, epsilon, alpha, gamma):
        self.q = {}
        self.epsilon = epsilon  # exploration constant
        self.alpha = alpha      # discount constant
        self.gamma = gamma      # discount factor
        self.actions = actions

    def getQ(self, state, action):
        return self.q.get((state, action), 0.0)

    def learnQ(self, state, action, reward, value):
        '''
        Q-learning:
            Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))            
        '''
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def chooseAction(self, state, return_q=False):
        q = [self.getQ(state, a) for a in self.actions]
        maxQ = max(q)

        if random.random() < self.epsilon:
            minQ = min(q); mag = max(abs(minQ), abs(maxQ))
            # add random values to all the actions, recalculate maxQ
            q = [q[i] + random.random() * mag - .5 * mag for i in range(len(self.actions))] 
            maxQ = max(q)

        count = q.count(maxQ)
        # In case there're several state-action max values 
        # we select a random one among them
        if count > 1:
            best = [i for i in range(len(self.actions)) if q[i] == maxQ]
            i = random.choice(best)
        else:
            i = q.index(maxQ)

        action = self.actions[i]        
        if return_q: # if they want it, give it!
            return action, q
        return action

    def learn(self, state1, action1, reward, state2):
        maxqnew = max([self.getQ(state2, a) for a in self.actions])
        self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)

The next one is the training script called start_qlearning.py with the following content

#!/usr/bin/env python

import gym
import numpy
import time
import qlearn
from gym import wrappers
# ROS packages required
import rospy
import rospkg
# import our training environment
from openai_ros.turtlebot2 import turtlebot2_maze


if __name__ == '__main__':

    rospy.init_node('turtlebot2_maze_qlearn', anonymous=True, log_level=rospy.WARN)

    # Create the Gym environment
    env = gym.make('TurtleBot2Maze-v0')
    rospy.loginfo("Gym environment done")

    # Set the logging system
    rospack = rospkg.RosPack()
    pkg_path = rospack.get_path('my_turtlebot2_maze')
    outdir = pkg_path + '/training_results'
    env = wrappers.Monitor(env, outdir, force=True)
    rospy.loginfo("Monitor Wrapper started")

    last_time_steps = numpy.ndarray(0)

    # Loads parameters from the ROS param server
    # Parameters are stored in a yaml file inside the config directory
    # They are loaded at runtime by the launch file
    Alpha = rospy.get_param("/turtlebot2/alpha")
    Epsilon = rospy.get_param("/turtlebot2/epsilon")
    Gamma = rospy.get_param("/turtlebot2/gamma")
    epsilon_discount = rospy.get_param("/turtlebot2/epsilon_discount")
    nepisodes = rospy.get_param("/turtlebot2/nepisodes")
    nsteps = rospy.get_param("/turtlebot2/nsteps")

    running_step = rospy.get_param("/turtlebot2/running_step")

    # Initialises the algorithm that we are going to use for learning
    qlearn = qlearn.QLearn(actions=range(env.action_space.n),
                           alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
    initial_epsilon = qlearn.epsilon

    start_time = time.time()
    highest_reward = 0

    # Starts the main training loop: the one about the episodes to do
    for x in range(nepisodes):
        rospy.logdebug("############### START EPISODE=>" + str(x))

        cumulated_reward = 0
        done = False
        if qlearn.epsilon > 0.05:
            qlearn.epsilon *= epsilon_discount

        # Initialize the environment and get first state of the robot
        observation = env.reset()
        state = ''.join(map(str, observation))

        # Show on screen the actual situation of the robot
        # env.render()
        # for each episode, we test the robot for nsteps
        for i in range(nsteps):
            rospy.logwarn("############### Start Step=>" + str(i))
            # Pick an action based on the current state
            action = qlearn.chooseAction(state)
            rospy.logwarn("Next action is:%d", action)
            # Execute the action in the environment and get feedback
            observation, reward, done, info = env.step(action)

            rospy.logwarn(str(observation) + " " + str(reward))
            cumulated_reward += reward
            if highest_reward < cumulated_reward:
                highest_reward = cumulated_reward

            nextState = ''.join(map(str, observation))

            # Make the algorithm learn based on the results
            rospy.logwarn("# state we were=>" + str(state))
            rospy.logwarn("# action that we took=>" + str(action))
            rospy.logwarn("# reward that action gave=>" + str(reward))
            rospy.logwarn("# episode cumulated_reward=>" + str(cumulated_reward))
            rospy.logwarn("# State in which we will start next step=>" + str(nextState))
            qlearn.learn(state, action, reward, nextState)

            if not (done):
                rospy.logwarn("NOT DONE")
                state = nextState
            else:
                rospy.logwarn("DONE")
                last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
                break
            rospy.logwarn("############### END Step=>" + str(i))
            #raw_input("Next Step...PRESS KEY")
            # rospy.sleep(2.0)
        m, s = divmod(int(time.time() - start_time), 60)
        h, m = divmod(m, 60)
        rospy.logerr(("EP: " + str(x + 1) + " - [alpha: " + str(round(qlearn.alpha, 2)) + " - gamma: " + str(
            round(qlearn.gamma, 2)) + " - epsilon: " + str(round(qlearn.epsilon, 2)) + "] - Reward: " + str(
            cumulated_reward) + "     Time: %d:%02d:%02d" % (h, m, s)))

    rospy.loginfo(("\n|" + str(nepisodes) + "|" + str(qlearn.alpha) + "|" + str(qlearn.gamma) + "|" + str(
        initial_epsilon) + "*" + str(epsilon_discount) + "|" + str(highest_reward) + "| PICTURE |"))

    l = last_time_steps.tolist()
    l.sort()

    # print("Parameters: a="+str)
    rospy.loginfo("Overall score: {:0.2f}".format(last_time_steps.mean()))
    rospy.loginfo("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))

    env.close()

Line 20 is the most important part of this script. It will register the training environment.

From line 52 to line 102 is the training process of the algorithm. For each timestep, the algorithm will decide the action to take based on the current state. After an action is taken. The observation is measured and the reward is calculated. The algorithm will keep learning to get the highest possible reward.

After that, you can compile the package and launch the training with the following command.

cd ~/catkin_ws
catkin_make
source devel/setup.bash
roslaunch turtle2_openai_ros_example start_training.launch

You should see the turtlebot start to moving and reset if it touches the wall.

In this example, we use the environment defined by openai_ros package, but what if I want to do some different tasks? e.g. change the actions the robot can take?

Let’s create our own environment called my_turtlebot2_maze.py under the script folder with the following content

import rospy
import numpy
from gym import spaces
from openai_ros import turtlebot2_env
from gym.envs.registration import register

timestep_limit_per_episode = 10000 # Can be any Value

register(
        id='MyTurtleBot2Maze-v0',
        entry_point='my_turtlebot2_maze:MyTurtleBot2MazeEnv',
        timestep_limit=timestep_limit_per_episode,
    )

class MyTurtleBot2MazeEnv(turtlebot2_env.TurtleBot2Env):
    def __init__(self):
        """
        This Task Env is designed for having the TurtleBot2 in some kind of maze.
        It will learn how to move around the maze without crashing.
        """
        
        # Only variable needed to be set here
        number_actions = rospy.get_param('/turtlebot2/n_actions')
        self.action_space = spaces.Discrete(number_actions)
        
        # We set the reward range, which is not compulsory but here we do it.
        self.reward_range = (-numpy.inf, numpy.inf)
        
        
        #number_observations = rospy.get_param('/turtlebot2/n_observations')
        """
        We set the Observation space for the 6 observations
        cube_observations = [
            round(current_disk_roll_vel, 0),
            round(y_distance, 1),
            round(roll, 1),
            round(pitch, 1),
            round(y_linear_speed,1),
            round(yaw, 1),
        ]
        """
        
        # Actions and Observations
        self.linear_forward_speed = rospy.get_param('/turtlebot2/linear_forward_speed')
        self.linear_turn_speed = rospy.get_param('/turtlebot2/linear_turn_speed')
        self.angular_speed = rospy.get_param('/turtlebot2/angular_speed')
        self.init_linear_forward_speed = rospy.get_param('/turtlebot2/init_linear_forward_speed')
        self.init_linear_turn_speed = rospy.get_param('/turtlebot2/init_linear_turn_speed')
        
        self.new_ranges = rospy.get_param('/turtlebot2/new_ranges')
        self.min_range = rospy.get_param('/turtlebot2/min_range')
        self.max_laser_value = rospy.get_param('/turtlebot2/max_laser_value')
        self.min_laser_value = rospy.get_param('/turtlebot2/min_laser_value')
        
        
        
        # We create two arrays based on the binary values that will be assigned
        # In the discretization method.
        laser_scan = self._check_laser_scan_ready()
        num_laser_readings = len(laser_scan.ranges)/self.new_ranges
        high = numpy.full((num_laser_readings), self.max_laser_value)
        low = numpy.full((num_laser_readings), self.min_laser_value)
        
        # We only use two integers
        self.observation_space = spaces.Box(low, high)
        
        rospy.logdebug("ACTION SPACES TYPE===>"+str(self.action_space))
        rospy.logdebug("OBSERVATION SPACES TYPE===>"+str(self.observation_space))
        
        # Rewards
        self.forwards_reward = rospy.get_param("/turtlebot2/forwards_reward")
        self.turn_reward = rospy.get_param("/turtlebot2/turn_reward")
        self.end_episode_points = rospy.get_param("/turtlebot2/end_episode_points")

        self.cumulated_steps = 0.0

        # Here we will add any init functions prior to starting the MyRobotEnv
        super(MyTurtleBot2MazeEnv, self).__init__()

    def _set_init_pose(self):
        """Sets the Robot in its init pose
        """
        self.move_base( self.init_linear_forward_speed,
                        self.init_linear_turn_speed,
                        epsilon=0.05,
                        update_rate=10)

        return True


    def _init_env_variables(self):
        """
        Inits variables needed to be initialised each time we reset at the start
        of an episode.
        :return:
        """
        # For Info Purposes
        self.cumulated_reward = 0.0
        # Set to false Done, because its calculated asyncronously
        self._episode_done = False


    def _set_action(self, action):
        """
        This set action will Set the linear and angular speed of the turtlebot2
        based on the action number given.
        :param action: The action integer that set s what movement to do next.
        """
        
        rospy.logdebug("Start Set Action ==>"+str(action))
        # We convert the actions to speed movements to send to the parent class CubeSingleDiskEnv
        if action == 0: #FORWARD
            linear_speed = self.linear_forward_speed
            angular_speed = 0.0
            self.last_action = "FORWARDS"
        elif action == 1: #LEFT
            linear_speed = self.linear_turn_speed
            angular_speed = self.angular_speed
            self.last_action = "TURN_LEFT"
        elif action == 2: #RIGHT
            linear_speed = self.linear_turn_speed
            angular_speed = -1*self.angular_speed
            self.last_action = "TURN_RIGHT"
        elif action == 3: #RIGHT
            linear_speed = -self.linear_turn_speed
            angular_speed = 0.0
            self.last_action = "BACKWARDS"

        
        # We tell TurtleBot2 the linear and angular speed to set to execute
        self.move_base(linear_speed, angular_speed, epsilon=0.05, update_rate=10)
        
        rospy.logdebug("END Set Action ==>"+str(action))

    def _get_obs(self):
        """
        Here we define what sensor data defines our robots observations
        To know which Variables we have acces to, we need to read the
        TurtleBot2Env API DOCS
        :return:
        """
        rospy.logdebug("Start Get Observation ==>")
        # We get the laser scan data
        laser_scan = self.get_laser_scan()
        
        discretized_observations = self.discretize_observation( laser_scan,
                                                                self.new_ranges
                                                                )

        rospy.logdebug("Observations==>"+str(discretized_observations))
        rospy.logdebug("END Get Observation ==>")
        return discretized_observations
        

    def _is_done(self, observations):
        
        if self._episode_done:
            rospy.logerr("TurtleBot2 is Too Close to wall==>")
        else:
            rospy.logerr("TurtleBot2 is Ok ==>")

        return self._episode_done

    def _compute_reward(self, observations, done):

        if not done:
            if self.last_action == "FORWARDS":
                reward = self.forwards_reward
            elif self.last_action == "BACKWARDS":
                reward = -1*self.forwards_reward
            else:
                reward = self.turn_reward
        else:
            reward = -1*self.end_episode_points


        rospy.logdebug("reward=" + str(reward))
        self.cumulated_reward += reward
        rospy.logdebug("Cumulated_reward=" + str(self.cumulated_reward))
        self.cumulated_steps += 1
        rospy.logdebug("Cumulated_steps=" + str(self.cumulated_steps))
        
        return reward


    # Internal TaskEnv Methods
    
    def discretize_observation(self,data,new_ranges):
        """
        Discards all the laser readings that are not multiple in index of new_ranges
        value.
        """
        self._episode_done = False
        
        discretized_ranges = []
        mod = len(data.ranges)/new_ranges
        
        rospy.logdebug("data=" + str(data))
        rospy.logwarn("new_ranges=" + str(new_ranges))
        rospy.logwarn("mod=" + str(mod))
        
        for i, item in enumerate(data.ranges):
            if (i%mod==0):
                if item == float ('Inf') or numpy.isinf(item):
                    discretized_ranges.append(self.max_laser_value)
                elif numpy.isnan(item):
                    discretized_ranges.append(self.min_laser_value)
                else:
                    discretized_ranges.append(int(item))
                    
                if (self.min_range > item > 0):
                    rospy.logerr("done Validation >>> item=" + str(item)+"< "+str(self.min_range))
                    self._episode_done = True
                else:
                    rospy.logwarn("NOT done Validation >>> item=" + str(item)+"< "+str(self.min_range))
                    

        return discretized_ranges

Please notice the ID defined in line 10 is the one we are going to use

In line 15, you can see that the environment is inherited from the turtlebot2_env from the openai_ros package. By doing that, you don’t need to define how to move the robot or get the sensor data since the openai_ros package already defined it for you.

Then let’s set up a new training script and launch file. We’ll call the new training script my_start_qlearning_maze.py with same same content as start_qlearning.py but with the following changes

...
from gym import wrappers
# ROS packages required
import rospy
import rospkg
# import our training environment
import my_turtlebot2_maze


if __name__ == '__main__':

    rospy.init_node('example_turtlebot2_maze_qlearn', anonymous=True, log_level=rospy.WARN)

    # Create our own Gym environment
    env = gym.make('MyTurtleBot2Maze-v0')
    rospy.loginfo("Gym environment done")
...

Here we directly import the environment from the previous script and use the ID it defined. The last step is to create the start_training_maze.launch file under the launch folder with the following content.

<launch>
    <!-- This version uses the openai_ros environments -->
    <rosparam command="load" file="$(find turtle2_openai_ros_example)/config/turtlebot2_openai_qlearn_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="turtle2_openai_ros_example" name="example_turtlebot2_maze_qlearn" type="my_start_qlearning_maze.py" output="screen"/>
</launch>

This launch file will launch the node which uses the last script for training.

You can define different action or reward as you want, but remember to also change the n_actions parameter in the turtlebot2_openai_qlearn_params.yaml because it is used by the openai gym.

Have fun!

If you want to learn more about OpenAI gym, please check our OpenAI Gym for Robotics 101 course. You’ll learn how to create environments and train robot for different task in detail.