Training intelligent tennis adversaries 
using self-play with ML agents
Author: Bakhtiyar Ospanov
Thesis Supervisor: M. Fatih Demirci


Introduction
Related Work – Machine Learning & Unity
Methodology 
Environment design
Training Configuration
Experiments and Results
Conclusion
Outline:
2


Introduction
Problem definition:
Training intelligent agents in a game of tennis:
Environment design in Unity
Reinforcement Learning for training
Setup:
Two-player game where agents control rackets to hit a ball over the net.
Goal: 
The agents must hit the ball so that the opponent cannot hit a valid return.


3


Introduction
Motivation:
Machine Learning disrupts industries
Game engines can be used as simulation environments and ML research platform
Explicitly programmed algorithms stall the industry
Playing against an intelligent, creative, and context-aware opponent is more engaging
4


Related Work
Machine Learning:
Supervised learning, unsupervised learning, and reinforcement learning

Deep Reinforcement Learning:
Agent with dynamic behavior
Keywords: 
Agent, environment, observation, action space, reward, policy
On-policy vs Off-policy
Similar works:
A few similar literature
Learning strategies in table tennis using inverse reinforcement learning [1]
Robotic Table Tennis with Model-Free Reinforcement Learning [2] – uses PPO but max 34% success return rate
RL in simulation: Solving rubik’s cube with a robot hand [3], outperforming human in Doom [4], and more

5


Methodology
Physical environment:
Collision tracking
Physic materials: bounciness, static & dynamic friction
Mass, angular drag, velocity
Default gravity, time scale x10 (for training), 0.02 sec. for physics updates
6


(a) Tennis Area prefab: 1 - agents, 2 - ball,
3 - net, 4 - court area, 5 - borders
(b) Training setup: 18 instances of the tennis area
Figure 1. Environment setup


Methodology
Learning environment:
Academy class communicates with Python API
The simulation process consists of episodes:
OnEpisodeBegin(): resetting and randomization
CollectObservations(): feature vector
OnActionReceived(): decision-making
OnEpisodeEnd()

7

Figure 2. Training setup


Methodology
Training Configuration:
Observation vector: 
 - position of an agent 
 - velocity of an agent 
- rotation of an agent 
 - position of a ball 
 - velocity of a ball 
estimatedSectorId

Network settings:
hidden_units: 256
num_layers: 2

8
Hyperparameters (total 27): 
battch_size = 2048
buffer_size = 20480
learning_rate = 0.0002 (constant)
beta = 0.003
num_epoch = 4

 Self-play setup:
team_change: 100000
window: 10
play_against_latest_model_ratio: 0.5


Methodology
Curriculum learning:
Large search space -> sparse reward signal
Progressive increase of task difficulty

The ‘estimatedSectorId’:
Landing sector prediction within 9x9 grid
9


(a) Limiting walls along court sidelines
(b) Landing sector prediction
Figure 3. The proposed measures to decrease a search space


Experiments and Results
The statistics during the training were recorded by ML-Agents Toolkit and tracked via TensorBoard
10


Sample Cumulative Reward 
over a test run
(b) Episodes Lengths for the various
configuration runs
Figure 4. Tracking of Environment metrics


Experiments and Results
11


(a) Entropy values for the various configuration runs
(b) Value loses for the various configuration runs
(c) Elo ratings for the various configuration runs.
(c) Elo rating for the Curriculum Learning run.
Figure 5. Tracking of Environment metrics


Experiments and Results
A survey:
13 participants
Check the ability to distinguish between an expert and ML agent behaviors
Evaluate behavior performance 
12
Questions:


Demo
Video #1
Video #2
13


Experiments and Results
14


(a) Video clips differentiation results
(b) Behavior resemblance evaluation
(c) Expert user performance evaluation
(c) ML agent performance evaluation
Figure 5. Tracking of Environment metrics


Demo
The agent is able to move along all three axes and rotate around z-axis
15


Live Demo
Test application:
The game is built for the WebGL platform
The neural network is processed by Unity Inference Engine
5-sets match in game of tennis
Available at https://bakhtiyar-ospanov.github.io/MLAT/index.html

16


Conclusion
The well-established environment facilitated the collection of accurate and versatile observations during the training

There are two satisfactory models which are produced as the output of this study:
Production-ready but limited in movement
Unrestricted version but unstable 
Contribution:
Physically and visually rich environment
Training setup
Trained model
17


Conclusion
Limitations:
A small number of questionnaire responders 
Unstable performance in fully unlocked mode due to:
complex interactions
large search space

Future research directions:
Expand pool of survey participants
Introduce imitation learning:
Generative Adversarial Imitation Learning
Behavior Cloning
Add intrinsic rewards:
Curiosity reward
Random Network Distillation

18


Thank you!
Contact info:
bakhtiyar.ospanov@nu.edu.kz 


www.nu.edu.kz

Reference list
[1] Muelling, K., Boularias, A., Mohler, B., Schölkopf, B., & Peters, J. (2014). Learning strategies in table tennis using inverse reinforcement learning. Biological cybernetics, 108(5), 603-619.
[2] Gao, W., Graesser, L., Choromanski, K., Song, X., Lazic, N., Sanketi, P., ... & Jaitly, N. (2020). Robotic Table Tennis with Model-Free Reinforcement Learning. arXiv preprint arXiv:2003.14398.
[3] I Akkaya OpenAI, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 10, 2019
[4] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

20


image1.png

image2.png

image3.png

image4.png

image5.png

image6.png

image7.png

image8.png

image9.png

image10.png

image11.png

image12.png

image13.png

image14.png

image15.png

image16.png

image17.png

image18.png

image19.png

media1.mp4

media2.mp4

image20.png

image21.png

image22.png

image23.png

image24.png

image25.png

media3.mp4

image26.png