Theory of Mind Guided Strategy Adaptation for Zero-Shot Coordination

Explainable & Ethical AI
Published: arXiv: 2602.12458v1
Authors

Andrew Ni Simon Stepputtis Stefanos Nikolaidis Michael Lewis Katia P. Sycara Woojun Kim

Abstract

A central challenge in multi-agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero-shot fashion. Prior work in zero-shot coordination often follows a two-stage process, first generating a diverse training pool of partner agents, and then training a best-response agent to collaborate effectively with the entire training pool. While many previous works have achieved strong performance by devising better ways to diversify the partner agent pool, there has been less emphasis on how to leverage this pool to build an adaptive agent. One limitation is that the best-response agent may converge to a static, generalist policy that performs reasonably well across diverse teammates, rather than learning a more adaptive, specialist policy that can better adapt to teammates and achieve higher synergy. To address this, we propose an adaptive ensemble agent that uses Theory-of-Mind-based best-response selection to first infer its teammate's intentions and then select the most suitable policy from a policy ensemble. We conduct experiments in the Overcooked environment to evaluate zero-shot coordination performance under both fully and partially observable settings. The empirical results demonstrate the superiority of our method over a single best-response baseline.

Paper Summary

Problem
Effective coordination with previously unseen partners is a significant challenge in multi-agent systems. In zero-shot coordination, agents must work together without any additional learning or communication. However, agents trained in self-play tend to overfit to shared conventions and struggle to infer a new partner's intent, leading to coordination failures.
Key Innovation
This research proposes a new approach called Theory-of-Mind-based Best Response Selection (TBS) to enhance zero-shot coordination. TBS uses a combination of behavioral clustering and Theory-of-Mind-guided policy selection to adapt to unseen strategies. It infers a partner's behavioral intent and selects the most compatible best-response policy in real-time, enabling robust adaptation to diverse conditions.
Practical Impact
The TBS framework has the potential to improve coordination performance in various multi-agent systems, such as robotics, autonomous vehicles, and smart homes. By enabling agents to adapt to unseen strategies, TBS can improve the efficiency and effectiveness of these systems. Additionally, TBS can be applied in real-world settings where agents need to collaborate with novel partners without any prior knowledge or communication.
Analogy / Intuitive Explanation
Imagine you're playing a game with a new teammate. You need to figure out their playing style and adapt your strategy to work together effectively. TBS is like having a "team psychologist" that infers your teammate's intentions and selects the best strategy for you to follow. This way, you can improve your coordination and achieve better results together.
Paper Information
Categories:
cs.MA
Published Date:

arXiv ID:

2602.12458v1

Quick Actions