View the example code

For my next project I want to introduce a new challenge for my reinforcement learning agents. Being the only agent in a fully predictable universe may be a bit boring. It's way more exciting if there are other agents who can aid or obstruct your moves.

To do this I need a simple framework for creating environment / agent interactions when multiple agents are in one environment.

The framework has the following requirements:

  • Multiple agents can interact in the same environment
  • Agents can take multiple turns / steps in a row
  • Every agent gets their own observations and rewards
  • Bonus: human players vs. machine learning agents

When I have a good framework it would be easier to train agents to play interesting multiplayer games.  

Single-agent environment

In my LunarLander post I used OpenAI's definition of an environment. This environment is built around a single agent getting observations, performing actions and getting rewards; the classic Agent-Environment loop.

The classic Agent-Environment loop

In code their environment definition looks like this:

class Env:

  def __init__(self):
    """
    Initialize the environment 
    """
    
  def step(self, action):
    """
    Perform the given action in the environment and return a tuple
    containing
    * observation: what does the agent observe? For example, 
      which cards are currently visible to the agent.
    * reward: amount of reward received after performing the previous action. 
      The goal is always to increase the total reward received. 
    * done: true if the game is finished, false otherwise.
    * info: debugging information.
    """
    
  def reset(self):
    """
    Resets the environment so the agent can start a new episode. 
    This returns an observation the agent can use to select its first move.
    """
    
  def render(self, mode='human'):
  	"""
    Render the environment, so the programmer can see what the agent is doing.
    """

The agent determines everything. The environment is like an object the agent can interact with. A game starts when the environment is reset. This returns an observation the agent can use to decide what its first move is going to be. The agent selects an action and sends it to the environment by calling step(action). The environment responds by returning a new observation and a reward.

So why can't we just use this environment definition for multi-agent games?

In theory it would be possible to create multi-agents games using this definition, but it would be tricky. You could have multiple agents interacting with one environment, while thinking they are the only one interacting with it. But then the environment needs to know which agent called the step function or the reset function so it can give a response for this particular agent. Also the response to the step function could be delayed, because another agent is choosing which moves to make.

To solve these issues I am going to define a framework more suited to multi-agent games.

Multi-agent environment

For the multi-agent environment I am going to take away a bit of the power of the agents. They don't get to decide everything anymore, because they have to share the environment with others. Instead the environment itself is going to lead the game.

Multi-agent dynamics

The environment definition is going to be pretty simple. The play method starts a game and runs it until it is finished. Inside the play method calls are made to the agents who are participating in the game to ask them which move they want to make. The environment decides whose turn it is, and it is possible that an agent gets multiple successive turns. When the game is finished the environment informs the agents and gives them the final rewards. A

Agents that are passed to the environment can be of any kind, even human players. The only requirement is that they can respond to action or done requests.

class Env:

    def __init__(self):
        """
        Initialize the environment
        """

    def play(self, agents, options):
        """
        Play a single game with a number of agents.

        When an agent has to decide on an action, the environment calls the
        agent.action(observation, allowed_actions, previous_reward) method.

        When the game is over for an agent the environment calls
        agent.done(reward).

        Options contains game specific settings.
        """

The agent definition is going to look as follows. The action function is a request for an action in a specific situation. The done function tells the agent that the game is finished.

When an environment asks an agent what it wants to do by calling the action method the agent gets an observation, a list of allowed actions and a previous reward. The observation can be specific for this agent. For example in a card game both agents can only see their own hands, but not the hand of the other player. Allowed actions is a list of actions the agent can choose from. Previous reward tells the agent if they got a reward or a penalty from its previous action.  

The done function tells the agent the game is finished. It can be that a game is finished just for one of the agents. The environment tells the agent what the final reward was for its last action. The agent can use this to wrap up after the game.

class Agent:

  def __init__(self):
    """
    Initialize the agent 
    """
    
  def action(self, observation, allowed_actions, previous_reward):
    """
    This method is called when an agent has to decide an action. The 
    chosen action is returned.
    """
    
  def done(self, previous_reward):
  	"""
    This method is called when a game is over for this agent. It
    includes the final reward for this agent.
    """

Simple example: rock, paper, scissors

To test the framework lets create a very simple game implementation for "rock, paper, scissors".

First let's implement the environment. Both players take turns simultaneously and have 3 possible moves: rock, paper or scissors. The agents respond by replying which of the options they choose.

class RockPaperScissorsEnv(Env):
    allowed_actions = [1, 2, 3]

    action_map = {
        1: "Rock",
        2: "Paper",
        3: "Scissors"
    }

    def __init__(self):
        super().__init__()

    def play(self, agents, options=None):
        winner = None
        while winner is None:
            action_0 = agents[0].action([], self.allowed_actions, 0)
            print(f"agent 0 picked action {self.action_map[action_0]}")
            action_1 = agents[1].action([], self.allowed_actions, 0)
            print(f"agent 1 picked action {self.action_map[action_1]}")
            winner = RockPaperScissorsEnv.determine_winner(action_0, action_1)
            if winner == 0:
                print(f"agent 0 won the game!")
                agents[0].done(1)
                agents[1].done(-1)
            elif winner == 1:
                print(f"agent 1 won the game!")
                agents[0].done(-1)
                agents[1].done(1)

    @staticmethod
    def determine_winner(action_0, action_1):
        if action_0 == action_1:
            return None
        if action_0 == 1:
            if action_1 == 2:
                return 1
            elif action_1 == 3:
                return 0
        elif action_0 == 2:
            if action_1 == 1:
                return 0
            elif action_1 == 3:
                return 1
        elif action_0 == 3:
            if action_1 == 1:
                return 1
            elif action_1 == 2:
                return 0

A very simple agent could just pick a random move.

import random

class RockPaperScissorsAgent(Agent):

    def __init__(self):
        super().__init__()

    def action(self, observation, allowed_actions, previous_reward):
        return random.choice(allowed_actions)

    def done(self, previous_reward):
        pass

A game is then played as follows

agent0 = RockPaperScissorsAgent()
agent1 = RockPaperScissorsAgent()
environment = RockPaperScissorsEnv()
environment.play([agent0, agent1])

This is an example of a game. The environment asks both agents to choose an action, and both agents choose rock. Because this results in a draw the environment asks both agents again. Now agent 0 chooses rock again, but agent 1 chooses paper. Agent 1 wins the game and receives a reward of 1, agent 0 loses the game and gets a penalty of -1.

agent 0 picked action Rock
agent 1 picked action Rock
agent 0 picked action Rock
agent 1 picked action Paper
agent 1 won the game!

What is next?

Now I have a simple framework for creating multi-agent games the next step is to implement a more interesting game and use a reinforcement learning algorithm to train multiple agents to play this game.