Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergence

Generative AI & LLMs
Published: arXiv: 2604.09495v1
Authors

Amit Sinha Matthieu Geist Aditya Mahajan

Abstract

Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are policies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.

Paper Summary

Problem
Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) are a type of complex decision-making problem that involves multiple agents with limited information and memory. In these problems, agents must coordinate their actions to achieve a shared goal, but they have asymmetric partial observations and limited memory, making it challenging to find optimal solutions. The main problem addressed in this paper is finding a way to solve Dec-POMDPs efficiently, especially when agents have limited memory and must make decisions based on incomplete information.
Key Innovation
The key innovation of this work is the development of an algorithm called Risk-Seeking Conservative Policy Iteration (RS-CPI), which combines risk-seeking and conservative policy iteration to find policies that operate with finite memory constraints. RS-CPI is an iterated best response style algorithm that guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. The algorithm uses a modified objective that incentivizes risk-seeking alongside conservative policy iteration updates.
Practical Impact
The practical impact of this research is significant, as it provides a novel way of incorporating memory constraints on agents in Dec-POMDP problems. The RS-CPI algorithm can be applied to various real-world applications, such as autonomous drone fleets, network load balancing, and multi-robot coordination, where agents must make decisions with limited information and memory. By finding efficient solutions to Dec-POMDPs, this research can help improve the performance and efficiency of these applications.
Analogy / Intuitive Explanation
Imagine you're navigating a maze with a group of friends, each with a limited view of the maze. You all need to work together to find the exit, but you can only see a small part of the maze at a time. The Dec-POMDP problem is like this maze, where agents have limited information and must coordinate their actions to achieve a shared goal. The RS-CPI algorithm is like a map that helps you navigate the maze efficiently, by combining risk-seeking and conservative policy iteration to find the best path to the exit.
Paper Information
Categories:
cs.MA
Published Date:

arXiv ID:

2604.09495v1

Quick Actions