🔟LangChain: Multi-Agent Simulated Environment with PettingZoo

시뮬레이션 환경으로 멀티 에이전트 시뮬레이션을 정의하는 방법으로 외부에서 정의된 환경으로 에이전트-환경 루프를 만듭니다. 여러 에이전트를 사용하여 이러한 종류의 상호작용 루프를 구현한다는 것입니다. 여기서는 다중 에이전트 환경을 구현할 수 있는 Petting Zoo 라이브러리를 사용합니다.

About Petting Zoo

https://pettingzoo.farama.org

PettingZoo는 일반적인 다중 에이전트 강화 학습(MARL) 문제를 표현할 수 있는 간단한 파이토닉 인터페이스입니다. PettingZoo에는 다양한 참조 환경, 유용한 유틸리티, 사용자 지정 환경을 만들기 위한 도구가 포함되어 있습니다. 환경은 Gynasium과 유사한 인터페이스를 사용하여 상호 작용할 수 있습니다.

About Gymnasium

https://gymnasium.farama.org

Gymnasium은 OpenAI의 Gym 라이브러리의 유지 관리 포크입니다. Gymnasium인터페이스는 단순하고 파이토닉하며 일반적인 Reinforcement Learning 문제를 표현할 수 있고, 이전 gym 환경을 위한 호환성 래퍼가 있습니다.

Multi-Agent Simulated Environment: Petting Zoo

Setup Environments

import os
from dotenv import load_dotenv

# 토큰 정보로드
api_key = os.getenv("OPENAI_API_KEY")
load_dotenv()

#%pip install pettingzoo pygame rlcard

import collections
import inspect

import tenacity
from langchain.output_parsers import RegexParser
from langchain.schema import (
    HumanMessage,
    SystemMessage,
)
from langchain_openai import ChatOpenAI

`GymnasiumAgent`

여기서는 Gymnasium 예제에서 정의한 것과 동일한 GymnasiumAgent를 재현합니다. 여러 번 재시도한 후에도 유효한 동작을 수행하지 않으면 무작위 동작을 수행합니다.

class GymnasiumAgent:
    @classmethod
    def get_docs(cls, env):
        return env.unwrapped.__doc__

    def __init__(self, model, env):
        self.model = model
        self.env = env
        self.docs = self.get_docs(env)

        self.instructions = """
Your goal is to maximize your return, i.e. the sum of the rewards you receive.
I will give you an observation, reward, terminiation flag, truncation flag, and the return so far, formatted as:

Observation: <observation>
Reward: <reward>
Termination: <termination>
Truncation: <truncation>
Return: <sum_of_rewards>

You will respond with an action, formatted as:

Action: <action>

where you replace <action> with your actual action.
Do nothing else but return the action.
"""
        self.action_parser = RegexParser(
            regex=r"Action: (.*)", output_keys=["action"], default_output_key="action"
        )

        self.message_history = []
        self.ret = 0

    def random_action(self):
        action = self.env.action_space.sample()
        return action

    def reset(self):
        self.message_history = [
            SystemMessage(content=self.docs),
            SystemMessage(content=self.instructions),
        ]

    def observe(self, obs, rew=0, term=False, trunc=False, info=None):
        self.ret += rew

        obs_message = f"""
Observation: {obs}
Reward: {rew}
Termination: {term}
Truncation: {trunc}
Return: {self.ret}
        """
        self.message_history.append(HumanMessage(content=obs_message))
        return obs_message

    def _act(self):
        act_message = self.model.invoke(self.message_history)
        self.message_history.append(act_message)
        action = int(self.action_parser.parse(act_message.content)["action"])
        return action

    def act(self):
        try:
            for attempt in tenacity.Retrying(
                stop=tenacity.stop_after_attempt(2),
                wait=tenacity.wait_none(),  # No waiting time between retries
                retry=tenacity.retry_if_exception_type(ValueError),
                before_sleep=lambda retry_state: print(
                    f"ValueError occurred: {retry_state.outcome.exception()}, retrying..."
                ),
            ):
                with attempt:
                    action = self._act()
        except tenacity.RetryError:
            action = self.random_action()
        return action

Main loop

def main(agents, env):
    env.reset()

    for name, agent in agents.items():
        agent.reset()

    for agent_name in env.agent_iter():
        observation, reward, termination, truncation, info = env.last()
        obs_message = agents[agent_name].observe(
            observation, reward, termination, truncation, info
        )
        print(obs_message)
        if termination or truncation:
            action = None
        else:
            action = agents[agent_name].act()
        print(f"Action: {action}")
        env.step(action)
    env.close()

`PettingZooAgent`

PettingZooAgent는 GymnasiumAgent를 다중 에이전트 설정으로 확장한 것입니다. 주요 차이점은 다음과 같습니다:

PettingZooAgent는 여러 에이전트 중에서 식별하기 위해 이름 인수를 받습니다.
PettingZoo 리포지토리 구조가 Gymnasium 리포지토리와 구조가 다르기 때문에 get_docs 함수가 다르게 구현됩니다.

class PettingZooAgent(GymnasiumAgent):
    @classmethod
    def get_docs(cls, env):
        return inspect.getmodule(env.unwrapped).__doc__

    def __init__(self, name, model, env):
        super().__init__(model, env)
        self.name = name

    def random_action(self):
        action = self.env.action_space(self.name).sample()
        return action

Rock, Paper, Scissors

이제 PettingZooAgent를 사용하여 다중 에이전트 가위바위보 게임 시뮬레이션을 실행할 수 있습니다.

from pettingzoo.classic import rps_v2

env = rps_v2.env(
    max_cycles=3, 
    render_mode="human"
)
agents = {
    name: PettingZooAgent(name=name, model=ChatOpenAI(model='gpt-4o', temperature=1), env=env)
    for name in env.possible_agents
}
main(agents, env)

Observation: 3
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 0

Observation: 3
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 0

Observation: 0
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 1

Observation: 0
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 1

Observation: 1
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 2

Observation: 1
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 2

Observation: 2
Reward: 0
Termination: False
Truncation: True
Return: 0
        
Action: None

Observation: 2
Reward: 0
Termination: False
Truncation: True
Return: 0
        
Action: None

`ActionMaskAgent`

일부 PettingZo 환경에서는 에이전트에게 어떤 작업이 유효한지 알려주는 action_mas를 제공합니다`

ActionMaskAgent 서브클래스는 PettingZooAgent를 서브클래스화하여 action_mask의 정보를 사용하여 작업을 선택합니다.

class ActionMaskAgent(PettingZooAgent):
    def __init__(self, name, model, env):
        super().__init__(name, model, env)
        self.obs_buffer = collections.deque(maxlen=1)

    def random_action(self):
        obs = self.obs_buffer[-1]
        action = self.env.action_space(self.name).sample(obs["action_mask"])
        return action

    def reset(self):
        self.message_history = [
            SystemMessage(content=self.docs),
            SystemMessage(content=self.instructions),
        ]

    def observe(self, obs, rew=0, term=False, trunc=False, info=None):
        self.obs_buffer.append(obs)
        return super().observe(obs, rew, term, trunc, info)

    def _act(self):
        valid_action_instruction = "Generate a valid action given by the indices of the `action_mask` that are not 0, according to the action formatting rules."
        self.message_history.append(HumanMessage(content=valid_action_instruction))
        return super()._act()

Tic-Tac-Toe

다음은 ActionMaskAgent를 사용하는 틱택토 게임의 예시입니다.

from pettingzoo.classic import tictactoe_v3

env = tictactoe_v3.env(render_mode="human")
agents = {
    name: ActionMaskAgent(name=name, model=ChatOpenAI(model='gpt-4o', temperature=0.2), env=env)
    for name in env.possible_agents
}
main(agents, env)

Observation: {'observation': array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 0

Observation: {'observation': array([[[0, 1],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 1

Observation: {'observation': array([[[1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 2

Observation: {'observation': array([[[0, 1],
        [1, 0],
        [0, 1]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 1, 1, 1, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 3

Observation: {'observation': array([[[1, 0],
        [0, 1],
        [1, 0]],

       [[0, 1],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 4

Observation: {'observation': array([[[0, 1],
        [1, 0],
        [0, 1]],

       [[1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 5

Observation: {'observation': array([[[1, 0],
        [0, 1],
        [1, 0]],

       [[0, 1],
        [1, 0],
        [0, 1]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 6

Observation: {'observation': array([[[0, 1],
        [1, 0],
        [0, 1]],

       [[1, 0],
        [0, 1],
        [1, 0]],

       [[0, 1],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int8)}
Reward: -1
Termination: True
Truncation: False
Return: -1
        
Action: None

Observation: {'observation': array([[[1, 0],
        [0, 1],
        [1, 0]],

       [[0, 1],
        [1, 0],
        [0, 1]],

       [[1, 0],
        [0, 0],
        [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int8)}
Reward: 1
Termination: True
Truncation: False
Return: 1
        
Action: None

Texas Hold'em No Limit

다음은 ActionMaskAgent를 사용하는 텍사스 홀덤 노 리미트 게임의 예시입니다.

from pettingzoo.classic import texas_holdem_no_limit_v6

env = texas_holdem_no_limit_v6.env(num_players=4, render_mode="human")
agents = {
    name: ActionMaskAgent(name=name, model=ChatOpenAI(model='gpt-4o', temperature=0.2), env=env)
    for name in env.possible_agents
}
main(agents, env)

Observation: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
       0., 0., 2.], dtype=float32), 'action_mask': array([1, 1, 0, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 0

Observation: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 2.], dtype=float32), 'action_mask': array([1, 1, 0, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 0

Observation: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 2.], dtype=float32), 'action_mask': array([1, 1, 0, 1, 1], dtype=int8)}
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 0

Observation: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 2.], dtype=float32), 'action_mask': array([0, 0, 0, 0, 0], dtype=int8)}
Reward: -1
Termination: True
Truncation: False
Return: -1
        
Action: None

Observation: {'observation': array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 2., 2.], dtype=float32), 'action_mask': array([0, 0, 0, 0, 0], dtype=int8)}
Reward: 1
Termination: True
Truncation: False
Return: 1
        
Action: None

Observation: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
       0., 0., 2.], dtype=float32), 'action_mask': array([0, 0, 0, 0, 0], dtype=int8)}
Reward: 0
Termination: True
Truncation: False
Return: 0
        
Action: None

Observation: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 2.], dtype=float32), 'action_mask': array([0, 0, 0, 0, 0], dtype=int8)}
Reward: 0
Termination: True
Truncation: False
Return: 0
        
Action: None

PreviousLangChain: Multi-agent Authoritarian Speaker Selection NextMultimodal

Last updated 1 year ago

hashtagAbout Petting Zoo

hashtagAbout Gymnasium

hashtagMulti-Agent Simulated Environment: Petting Zoo

hashtagSetup Environments

hashtagGymnasiumAgent

hashtagMain loop

hashtagPettingZooAgent

hashtagRock, Paper, Scissors

hashtagActionMaskAgent

hashtagTic-Tac-Toe

hashtagTexas Hold'em No Limit