Training intelligent tennis adversaries using self-play with ML agents Author: Bakhtiyar Ospanov Thesis Supervisor: M. Fatih Demirci Introduction Related Work – Machine Learning & Unity Methodology Environment design Training Configuration Experiments and Results Conclusion Outline: 2 Introduction Problem definition: Training intelligent agents in a game of tennis: Environment design in Unity Reinforcement Learning for training Setup: Two-player game where agents control rackets to hit a ball over the net. Goal: The agents must hit the ball so that the opponent cannot hit a valid return. 3 Introduction Motivation: Machine Learning disrupts industries Game engines can be used as simulation environments and ML research platform Explicitly programmed algorithms stall the industry Playing against an intelligent, creative, and context-aware opponent is more engaging 4 Related Work Machine Learning: Supervised learning, unsupervised learning, and reinforcement learning Deep Reinforcement Learning: Agent with dynamic behavior Keywords: Agent, environment, observation, action space, reward, policy On-policy vs Off-policy Similar works: A few similar literature Learning strategies in table tennis using inverse reinforcement learning [1] Robotic Table Tennis with Model-Free Reinforcement Learning [2] – uses PPO but max 34% success return rate RL in simulation: Solving rubik’s cube with a robot hand [3], outperforming human in Doom [4], and more 5 Methodology Physical environment: Collision tracking Physic materials: bounciness, static & dynamic friction Mass, angular drag, velocity Default gravity, time scale x10 (for training), 0.02 sec. for physics updates 6 (a) Tennis Area prefab: 1 - agents, 2 - ball, 3 - net, 4 - court area, 5 - borders (b) Training setup: 18 instances of the tennis area Figure 1. Environment setup Methodology Learning environment: Academy class communicates with Python API The simulation process consists of episodes: OnEpisodeBegin(): resetting and randomization CollectObservations(): feature vector OnActionReceived(): decision-making OnEpisodeEnd() 7 Figure 2. Training setup Methodology Training Configuration: Observation vector: - position of an agent - velocity of an agent - rotation of an agent - position of a ball - velocity of a ball estimatedSectorId Network settings: hidden_units: 256 num_layers: 2 8 Hyperparameters (total 27): battch_size = 2048 buffer_size = 20480 learning_rate = 0.0002 (constant) beta = 0.003 num_epoch = 4 Self-play setup: team_change: 100000 window: 10 play_against_latest_model_ratio: 0.5 Methodology Curriculum learning: Large search space -> sparse reward signal Progressive increase of task difficulty The ‘estimatedSectorId’: Landing sector prediction within 9x9 grid 9 (a) Limiting walls along court sidelines (b) Landing sector prediction Figure 3. The proposed measures to decrease a search space Experiments and Results The statistics during the training were recorded by ML-Agents Toolkit and tracked via TensorBoard 10 Sample Cumulative Reward over a test run (b) Episodes Lengths for the various configuration runs Figure 4. Tracking of Environment metrics Experiments and Results 11 (a) Entropy values for the various configuration runs (b) Value loses for the various configuration runs (c) Elo ratings for the various configuration runs. (c) Elo rating for the Curriculum Learning run. Figure 5. Tracking of Environment metrics Experiments and Results A survey: 13 participants Check the ability to distinguish between an expert and ML agent behaviors Evaluate behavior performance 12 Questions: Demo Video #1 Video #2 13 Experiments and Results 14 (a) Video clips differentiation results (b) Behavior resemblance evaluation (c) Expert user performance evaluation (c) ML agent performance evaluation Figure 5. Tracking of Environment metrics Demo The agent is able to move along all three axes and rotate around z-axis 15 Live Demo Test application: The game is built for the WebGL platform The neural network is processed by Unity Inference Engine 5-sets match in game of tennis Available at https://bakhtiyar-ospanov.github.io/MLAT/index.html 16 Conclusion The well-established environment facilitated the collection of accurate and versatile observations during the training There are two satisfactory models which are produced as the output of this study: Production-ready but limited in movement Unrestricted version but unstable Contribution: Physically and visually rich environment Training setup Trained model 17 Conclusion Limitations: A small number of questionnaire responders Unstable performance in fully unlocked mode due to: complex interactions large search space Future research directions: Expand pool of survey participants Introduce imitation learning: Generative Adversarial Imitation Learning Behavior Cloning Add intrinsic rewards: Curiosity reward Random Network Distillation 18 Thank you! Contact info: bakhtiyar.ospanov@nu.edu.kz www.nu.edu.kz Reference list [1] Muelling, K., Boularias, A., Mohler, B., Schölkopf, B., & Peters, J. (2014). Learning strategies in table tennis using inverse reinforcement learning. Biological cybernetics, 108(5), 603-619. [2] Gao, W., Graesser, L., Choromanski, K., Song, X., Lazic, N., Sanketi, P., ... & Jaitly, N. (2020). Robotic Table Tennis with Model-Free Reinforcement Learning. arXiv preprint arXiv:2003.14398. [3] I Akkaya OpenAI, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 10, 2019 [4] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017 20 image1.png image2.png image3.png image4.png image5.png image6.png image7.png image8.png image9.png image10.png image11.png image12.png image13.png image14.png image15.png image16.png image17.png image18.png image19.png media1.mp4 media2.mp4 image20.png image21.png image22.png image23.png image24.png image25.png media3.mp4 image26.png