Getting started

The aim for BabyBench is to train MIMo to achieve the target behaviors (self-touch and hand regard) without any external supervision, i.e. without extrinsic rewards. How do you think infants learn these skills autonomously?

Table of Contents

  1. After installation
  2. Basic usage
    1. Initialization
    2. Simulations
    3. Wrapper functions
    4. Configuration files
      1. Behavior
      2. Scene
      3. Sensory modules
      4. Actuation module
      5. Other options
  3. Starter code examples
    1. Recommended configurations
    2. Random policies
    3. Intrinsic motivations

After installation

If you successfully installed BabyBench and ran the installation test, you should find a short video in the folder results/test_installation/videos of MIMo in his base environment.

You can also run a short evaluation of the installation test:

python evaluation.py --config=examples/config_test_installation.yml --episodes=1 --duration=100

Note that the evaluation script runs, by default, a random policy. You can change the policy by action selection in èvaluation.py.

We recommend you to take some time to explore the folder structure (descrbed here). Inside the results/test_installation folder, you will find 2 folders: videos and logs, and a scene.xml file (read the API page for details). The video folder should contain the video of the installation test and the video of the evaluation (if you ran it). The logs folder should contain a log file called training.pkl generated during the installation test and a log file called evaluation_0.pkl generated during the evaluation. These are pickled files; you can open them with the pickle module in Python, for example:

import pickle
with open('results/test_installation/logs/training.pkl', 'rb') as f:
    training_log = pickle.load(f)
print(training_log)

This should only print the following:

[{'steps': 0}]

In the future, these logs will contain a dictionary with the information collected during training. To know more about the logs, you can read the API page.

Basic usage

Initialization

You can initialize the base BabyBench environment as a regular Gymnasium environment. Create a file called test_env.py with the following code:

import gymnasium as gym
import mimoEnv

env = gym.make("BabyBench")
obs, _ = env.reset()
print(obs)

And then run the following command on your console to initialize the environment and print an observation:

python test_env.py

This will print an observation of the environment, which is a long dictionary with the following keys: observation, touch, eye_left, eye_right, and vestibular. Details about the observations are provided in the API page.

Simulations

BabyBench uses the standard Gymnasium API for simulations. You can add the following code to your script to run a full episode on the environment (note: this will run for a default of 1000 steps).

done = False
while not done:
    # TODO: REPLACE WITH YOUR ACTION SELECTION MODEL
    # action = select_action(obs)
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

Wrapper functions

You are not expected to modify any of the BabyBench or MIMo code, but rather to write your own training loops using the provided API to interact with the environments. In particular, the BabyBench environments always return extrinsic rewards of 0, because the focus of the competition is to train the agents to perform the desired behaviors without any external supervision. To use intrinsic rewards, you can use write a wrapper function:

class Wrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)

    def compute_intrinsic_reward(self, obs):
        # TODO: REPLACE WITH YOUR INTRINSIC MOTIVATION HERE
        intrinsic_reward = 0
        return intrinsic_reward

    def step(self, action):
        obs, extrinsic_reward, terminated, truncated, info = self.env.step(action)
        intrinsic_reward = self.compute_intrinsic_reward(obs)
        total_reward = intrinsic_reward + extrinsic_reward # extrinsic reward is always 0  
        return obs, total_reward, terminated, truncated, info

    def reset(self, **kwargs):
        return self.env.reset(**kwargs)

Other ideas, e.g. to learn goal-based behaviors, should also rely on wrapper functions rather than modifying the environments directly. You can read the Gymnasium documentation about wrappers and follow this tutorial.

Configuration files

To facilitate the initialization of BabyBench, we provide a configuration file for each environment. You can control the environment parameters using a configuration file, which is loaded as follows:

import yaml
import babybench.utils as bb_utils

with open('examples/config_test_installation.yml') as f:
    config = yaml.safe_load(f)

env = bb_utils.make_env(config)

Running this code creates the environment used for the installation test. You can find a list of all the configuration options in the API page. The main parameters you need to set are described next.

Behavior

The behavior parameter determines the target behavior that will be tracked in the environment. This parameter can be set to none, selftouch or handregard. Each behavior will also load a specific MIMo version: the none and selftouch behaviors use MIMo v1 (with mitten-like hands), and the handregard behavior uses MIMo v2 (with fingers).

Scene

The scene parameter determines the scene that will be used in the environment. Each scene has specific objects and introduces different constraints, such as fixing MIMo to a specific pose. This parameter can be set to base, crib or cubes. Examples of each scene are shown below. While both behaviors can be learned in eaither scene, we recommend using the crib scene for self-touch and the cubes scene for hand regard.

Base scene Crib scene Cubes scene
Starting positions in base, crib, and cubes scenes

Sensory modules

MIMo has four sensory modules: proprioception, vestibular system, vision, and touch. They can be enabled or disabled by setting the corresponding parameters, e.g. vision_active, to True or False. Additionally, each sensory module has specific parameters that can be changed in the confiuratno files, such as the resolution of the eye cameras set with vision_resolution.

Actuation module

The actuation_model parameter determines the actuation model that will be used to control MIMo. This parameter can be set to spring_damper, muscle, or positional. You can decide which body parts will be actuated by setting the corresponding parameters, e.g. act_hands, to True or False. Alternatively, you can decide if a body part should be locked in its original position by setting the corresponding parameter, e.g. lock_hands, to True or False.

Other options

The other options that can be set in the configuration files are the folder to saving folder (save_dir) and the parameters of the environment (max_episode_steps, frame_skip, render_size, save_logs_every), which are described in the API page.

Starter code examples

We provide some basic examples with starter code to help you get up and running with training MIMo using reinforcement learning. These examples are available in the examples folder of the starter kit.

Besides the aforementioned config_test_installation.yml file, we also provide a configuration files for each of the two behaviors. These are not mandatory to use, but rather recommendations that we think provide a good starting point for training MIMo.

Self-touch

The configuration file for self-touch is config_selftouch.yml. It uses the crib scene. All the proprioception settings are set to True, whereas the vestibular and vision settings are set to False. The touch observations are enabled everywhere in MIMo’s body except for his eyes. The actuation_model is set to spring_damper, and the actuators of MIMo’s hands, fingers, arms, legs, body, and head are enabled, but not his feet or eyes, which are locked.

A number of simplifications are available here. In particular, you can reduce the size of the action space by disabling some of the actuators, e.g. the body or legs. You can also reduce the size of the observation space by disabling some of the parameters of the proprioception module, e.g. torques and limits. Finally, you can simplify the touch observations by using a higher scale factor, which will reduce the number of touch sensors, e.g. setting touch_scale to 5.

Hand regard

The configuration file for self-touch is config_selftouch.yml. It uses the cubes scene. All the proprioception settings are set to True, whereas the vestibular and touch modalities are disabled. The visual sytem is active, with a resolution of 64 pixels. The actuation_model is set to spring_damper, and the actuators of MIMo’s hands, arms, and head are enabled, whereas his fingers, feet, legs, body, and eyes are locked.

As in the self-touch configuration, you can simplify the experiments by reducing the size of the action and observation spaces. In this case, you can also use the crib scene instead of the cubes scene. This will put MIMo in a supine position and with fewer objects in his field of view, which may make the learning of hand regard easier.

Random policies

In the examples folder of the starter kit, you can find two code examples that execute random policies for both self-touch and hand regard. These examples are only provided to show how you can use the API to implement your own policies. They can also serve as a baseline for both behaviors, i.e. how much of his own body does MIMo touch or how often does he fixate his own hand when he moves randomly. Running the scripts random_selftouch.py or random_handregard.py will execute the random policy for a total of 10,000 timesteps by default. These scripts make use of Python’s Argument Parser to parse the command line arguments. To run the script with a different number of timesteps, you can run the following command:

python examples/random_selftouch.py --train_for=100000

The argument is called --train_for for consistency with the other examples described next; there is no actual training for the random policy. After running the script, you can find the training log in the results/self_touch/logs folder (or the corresponding value of save_dir in the configuration file used). You can then run an evaluation of the training log to get an estimate of the performance of the random policy.

Intrinsic motivations

As mentioned above, a common expected approach for the competition is to use intrinsically-motivated reinforcement learning. The simplest way to do this is to use an environment wrapper, as described previously. The examples folder of the starter kit contains three scripts that can serve as starting points in this direction.

Template

The intrinsic_motivation_wrapper.py script is a template for learning an intrinsically-motivated policy. It contains the wrapper mentioned earlier, but also describes a common training loop with the Proximal Policy Optimization (PPO) algorithm of the Stable Baselines-3 library. The relevant code snippet is shown below:

from stable_baselines3 import PPO

model = PPO("MultiInputPolicy", wrapped_env, verbose=1)
model.learn(total_timesteps=args.train_for)
model.save(os.path.join(config["save_dir"], "model"))

Running this code will create an instance of the PPO algorithm that can accommodate the observation and action spaces of the environment. The model will then be trained for a total of 10,000 timesteps by default, and saved in the save_dir folder of the configuration file.

Since the value of the intrinsic reward of this wrapper is always 0, running this script will not produce any interesting behaviors. However, you can use this script as a starting point for your own implementations of intrinsically-motivated reinforcement learning. The next two examples show how you can use the wrapper to learn a self-touch policy and a hand regard policy.

Self-touch

The intrinsic_selftouch_count.py script adapts the intrinsic_motivation_wrapper.py script with a very simple approach towards learning self-touch: trying to maximize the fraction of active touch sensors in MIMo’s body. The relevant code snippet is shown below:

def compute_intrinsic_reward(self, obs):
    # Count the fraction of active touch sensors in MIMo's body
    intrinsic_reward = np.sum(obs['touch'] > 1e-6) / len(obs['touch'])
    return intrinsic_reward

Do you think this is a good approach? If you can come up with a better intrinsic reward, you can easily modify the script and train MIMo to learn self-touch.

Hand regard

The intrinsic_handregard_saliency.py script also adapts the intrinsic_motivation_wrapper.py script but aiming to learn the hand regard behavior. We postulate that the hand can be a salient object in MIMo’s field of view, and thus MIMo might learn to fixate his hand in order to maximize this saliency. We come up with a simple saliency metric:

from scipy.ndimage import convolve

def simple_saliency(rgb_img):
    gray_img = bb_utils.to_grayscale(rgb_img)
    # Define a simple Laplacian kernel
    laplacian_kernel = np.array([[0, -1, 0],
                                  [-1, 4, -1],
                                  [0, -1, 0]])
    # Apply the kernel using convolution
    edges = convolve(gray_img, laplacian_kernel, mode='reflect')
    # Compute saliency as sqrt of sum of squared edge intensities (normalized)
    saliency = np.sqrt(np.sum(edges**2)) / (gray_img.shape[0] * gray_img.shape[1])
    return saliency

And then the intrinsic reward is computed as:

def compute_intrinsic_reward(self, obs):
    return simple_saliency(obs['eye_left']) + simple_saliency(obs['eye_right'])

Again, you can come up with a better intrinsic rewards to explain the emergence of hand regard. We encourage you to try different approaches and make a submission to the competition.


Next page: API