Deep Reinforcement Learning Soccer Playing Robots DeepMind Experiment

Figure 1: AI-generated image of soccer playing robots [1].

Nowadays it is common to see in the news how AI and robots are becoming more and more integrated into our daily lives and what kind of jobs will be replaced by them, but what about sports? Can robots learn to play soccer? That is one of the questions that researchers at Google DeepMind set out to answer in their recent paper entitled Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning [2]. In this blog post, we are going to deepen our understanding of the different coordination and decision-making challenges and technology concepts that come with teaching robots to overcome these challenges and do not only appear when playing soccer but also in other fields like industrial robotics. We will start with a general overview and then further analyze technical aspects. So, whether you’re an AI student, enthusiast, or just someone who’s interested in robotics, this blog is for you!


Google DeepMind deals with a deep reinforcement learning (RL) approach to train a humanoid robot to play soccer, where the robot learns a wide range of skills required for playing soccer into long-horizon strategic behavior. In contrast, companies like Boston Dynamics (from which there are cool dancing videos [3]) use more pre-established programmed behavior movements for their robots. While pre-programmed movements can be effective for specific tasks, they may not be adaptable to new situations or environments. Deep RL, on the other hand, allows robots to learn and adapt to new situations and environments through trial and error. In this AI approach the goal is specified by the reward function, which acts as positive reinforcement or negative punishment depending on the performance of the robot with respect to the desired goal [4].

In this case it is not as easy as telling the robot: “Hey! You have to score a goal”; instead, by carefully defining a suitable reward function,  deep RL can be used to synthesize dynamic and agile context-adaptive movement skills such as walking, running, turning around, kicking and fall recovery. The resulting policy exhibits robust and dynamic movement skills, and transitions between them in a smooth, stable, and efficient manner. The agents also developed a basic strategic understanding of the game and learned to anticipate ball movements and to block opponent shots. We will get through from how the process was from the first stages of the experiment, training, simulation to real world transfer, overall results, and future possible scenarios.

The experiment was conducted in a simulated environment using deep RL to train miniature humanoid robots with 20 controllable joints to play a simplified one-vs-one (1v1) soccer game using proprioception (sense of the relative position of its own articulated parts [5]) and game state features as observations. The simulated environment is designed to mimic the real-world environment as closely as possible. The aim in the 1v1 soccer task is to score a goal while preventing the opponent from scoring. In simulations, the agent is awarded for scoring a goal when the center of the ball enters the goal. The episode terminates once a goal is scored or after 50 seconds and then by using a technique called domain randomization, the opponent’s and the ball’s location and orientation are randomly reset. If you are interested in robot hardware and software setup, consider reading further details in the paper.

Training the agents

Instead of simply letting the agents play against each other, the Google DeepMind team used several strategies to encourage learning, such as training individual skills in isolation and then composing those skills end-to-end in a self-play setting. Further challenges that had to be overcome in the training process include:

  • Synthesizing sophisticated and safe movement skills for a low-cost, miniature humanoid robot that can be composed into complex behavioral strategies in dynamic environments.
  • Developing a basic strategic understanding of the game and learning to anticipate ball movements and to block opponent shots.
  • Achieving good-quality transfer despite significant unmodeled effects and variations across robot instances by using a combination of high-frequency control, targeted dynamics randomization, and perturbations during training in simulation.
  • Learning safe and effective movements while still performing in a dynamic and agile way by making minor hardware modifications together with basic regularization of the behavior during training.

Despite the challenges, many deep RL techniques were used to overcome them, first, a soccer game environment was modeled as a Partially Observable Markov Decision Process (POMDP, i.e. a mathematical framework for decision-making problems where the state of the system is not fully observable). The state space of the environment includes the global location, orientation, joint angles, and velocities of the robot, as well as the location and velocity of the ball. At each time step, the robot observes features extracted from the state and outputs a 20-dimensional continuous action vector corresponding to the desired positions of its joints. The action is sampled from a stochastic policy based on the observation-action history, which compensates for the partial observability of the environment. After filtering, the action is executed in the environment, and the robot receives a reward based on the new state. The reward function is a weighted sum of multiple reward components and depends on the stage of training. These interactions give rise to a trajectory over a horizon length, and a policy is learned.

Two main stage policies (i.e. strategies that an agent uses in pursuit of goals [6]) were used during the training process: train-teacher policy and self-play policy.

Train-teacher policy:

Used to train individual skills in isolation, such as walking or kicking, using a reward function that encouraged the agent to perform the skill as quickly and efficiently as possible.

The training episodes end when the agent falls over, goes out of bounds, enters the goal penalty area, or the opponent scores. At the beginning of each episode, the agent, the opponent, and the ball are randomly placed on the pitch. Both players start in a default standing pose, and the opponent has an untrained policy, which makes it fall almost immediately and remain on the ground for the duration of each episode. Therefore, the agent learns to avoid the opponent at this stage, but no further complex opponent interactions occur.

The range of possible actions for each joint is limited to allow sufficient range of motion while minimizing the risk of self-collisions. Two shaping reward terms are included to improve sim-to-real transfer and reduce robot breakages. Highly dynamic gaits and kicks often lead to excessive stress on the knee joints from the impacts between the feet and the ground or the ball, which causes the gears to break. To mitigate this, the policies are regularized via a penalty term to minimize the time integral of torque peaks. In addition, a reward term is added for keeping an upright pose within the threshold of 11.5° to prevent the agent from leaning forward when walking, which can cause the robot to lose balance and fall forward when transferred to a real robot. Incorporating these two reward components leads to robust policies for sim to real transfer that rarely break knee gears and perform almost as well at scoring goals and defending against the opponent.

Self-play policy:

Used to compose the individual skills end-to-end in a self-play setting. The agents played against each other in a simplified one-versus-one (1v1) soccer game, where they learned to anticipate ball movements and to block opponent shots. The agents were rewarded for scoring goals and penalized for conceding goals. The setup for this stage is similar to the first stage where the agent was trained by a teacher policy. However, in this stage, the episodes only end when either the agent or the opponent scores a goal. If the agent is on the ground, out of bounds, or in the goal penalty area, it receives a fixed penalty per timestep, and all positive reward components are ignored. For example, if the agent is on the ground when a goal is scored, it receives a zero for the scoring reward component. At the beginning of each episode, the agent is initialized in one of three positions: laying on the ground on its front, on its back, or in a default standing pose, with equal probability. This stage resulted in a single agent that can perform a range of soccer skills and can compete against stronger opponents.

So, in a nutshell, the main difference between the two policies is that the teacher policy is a predefined policy that guides the learning process of the agent, while the self-play policy is learned through interactions with other agents. The teacher policy is used in the first stage of training to teach the agent basic skills such as walking, turning, and kicking. In contrast, the self-play policy is learned in the second stage of training, where the agent competes against increasingly stronger opponents and learns a range of soccer skills such as scoring and defending. The self-play policy emerges from the interactions between the agent and the opponents and is optimized to maximize the agent’s chances of winning the game.

Figure 2: Teacher and Self-Play stages [2]

Reinforcement learning with pure self-play can lead to unstable or cyclic behavior, and overfitting can make it exploitable. Therefore, the researchers trained the robot against a mixture of previous opponents to achieve stability and robustness. They used a continuous training epoch to train against a mixture of opponents, which improves efficiency but carries an increased risk of getting stuck in a local optima.

The researchers used established methods for multi-agent training, such as Fictitious Play, which converges to a Nash equilibrium for two-player, zero-sum games. This method has been generalized to extensive form games (a type of game in game theory that models the strategic interactions between players over time such as chess or poker) using reinforcement learning. Alternatively, Vinyals et al. [2] achieved stability and robustness by playing against a league of opponents. This is an efficient implementation of these ideas since they trained against a mixture of previous opponents using one continuous training epoch, rather than successive generations.

Self-play was also used, which provides a natural auto-curriculum in multi-agent RL. This can be important since finding the best response to a set of strong opponents from scratch can be challenging for RL in some domains. For example, in soccer, strong opponents could dominate play, and a learner might get very little experience of ball interaction. The researchers’ method effectively features a similar auto-curriculum property, since they trained in one continuous epoch in which the opponents are initially weak and subsequently automatically calibrated to the strength of the current agent as earlier agent checkpoints are successively added to the opponent pool.

More used techniques during training to improve robustness are explained in the following section.

Simulation to real world transfer

The agents were transferred from simulation to real world robots zero-shot (i.e, the trained policies were transferred from the simulated environment to real robots without any fine-tuning or modification).  You might be now wondering how they success, well, by including different deep RL techniques such as:

  • System identification: identifying the parameters of the robot’s dynamics model and tuning them to match the real robot’s behavior as closely as possible.
  • Domain randomization: randomly varying the physical properties of the
    simulated environment, such as the friction coefficients and mass of the
    objects, to expose the agent to a wide range of conditions. This helped
    the agent learn to adapt to different environments and generalize
    better to the real world.
  • Perturbations during training: adding random disturbances to the robot’s
    joints during training to simulate the effects of external forces and
    perturbations that the robot might experience.
  • Shaping reward terms: adding penalty terms to the reward function to
    discourage the agent from taking actions that could damage the robot,
    such as falling or colliding with obstacles.


The agent demonstrated object interaction skills such as ball control and shooting, kicking a moving ball, and blocking shots. Additionally, the agent displayed strategic behaviors such as defending by consistently placing itself between the attacking opponent and its own goal and protecting the ball with its body. During play, the agents transitioned between all these skills in a fluid way.

The success of the agents can be summarized by the fact that they were able to perform well in the real world, walking 156% faster, taking 63% less time to get up, and kicking 24% faster than a scripted baseline (pre-programmed behaviors that are typically hand-crafted by human experts and do not involve any learning or adaptation), while still achieving the longer-term objectives efficiently.

The authors also conducted a quantitative analysis of the agents’ performance, comparing their performance to a scripted baseline and a human player. The results showed that the agents outperformed the scripted baseline in terms of speed and agility to evaluate effectiveness and were able to compete with the human player in terms of scoring goals and blocking shots.

Where to go next

By the results of this research, the approach used in this study can be applied to other domains and tasks beyond soccer, such as locomotion, manipulation, and navigation. They also suggest that the approach can be extended to more complex multi-agent scenarios, such as team sports or collaborative tasks.

The paper is important for the future of robotics because it demonstrates the potential of using deep RL to synthesize sophisticated and safe movement skills for robots and helping them to learn complex behaviors and strategies in dynamic environments, which can be useful in various applications, such as manufacturing, healthcare, and search and rescue operations. This is a highlight on the importance of combining simulation and real-world training to achieve good-quality transfer, which can help reduce the cost and time required for robot development

Gaspar Garcia

I am an Artificial Intelligence bachelor student at Johannes Kepler University in Linz, Austria. I’m originally from Mexico. My passion for technology started when I was a kid, and I was fascinated by robotics and automation. As I did highschool with a focus on mechatronics, I learned the basics of mechanics, electricity and electronics. Now, I’m pursuing my AI degree with the goal of doing projects, participating in AI events and applying my skills in the robotics industry.


[1] DreamStudio, (accessed Jun. 9, 2023).
[2] T. Haarnoja et al., “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,”, Apr. 23, 2023, (accessed Jun. 9, 2023).
[3] “Do you love me?,” YouTube, (accessed Jun. 9, 2023).
[4] P. Kormushev, S. Calinon, and D. G. Caldwell, “Reinforcement learning in robotics: Applications and real-world challenges,” Robotics, (accessed Jun. 9, 2023).
[5] eeNews Europe, “Soft robotics proprioception: Let the machine sort it out,” eeNews Europe, (accessed Jun. 9, 2023).
[6] W. by: G. D. Luca, “What is a policy in reinforcement learning?,” Baeldung on Computer Science, (accessed Jun. 9, 2023).