Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

Generative AI & LLMs
Published: arXiv: 2601.23229v1
Authors

Ali Asadi Krishnendu Chatterjee Ehsan Goharshady Mehrdad Karrabi Alipasha Montaseri Carlo Pagano

Abstract

Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.

Paper Summary

Problem
Markov decision processes (MDPs) are a fundamental model in decision making, but they assume that the transition function is known. In reality, this assumption is not always justified, as MDPs are constructed from data and the transition functions are estimated with uncertainty. This issue has led to the study of Robust Markov decision processes (RMDPs), which weaken the assumption by only assuming knowledge of some uncertainty set containing the true transition function. The goal of solving RMDPs is to minimize the worst-case expected payoff with respect to all possible choices of transition functions belonging to the uncertainty set.
Key Innovation
The paper presents a novel potential function that tracks the effects of changing the optimal policy within the uncertainty set. This function, called fρ(s, s′, s′′), estimates how much the value of a policy can be improved by donating some probability mass from one state to another. The paper also shows bounds to relate the policy values and the defined potential function. Moreover, it proves a novel combinatorial result (Lemma 8) over the number of most significant bits in the binary representation of unitary signed subset sums of a finite set of real numbers. This result is a key component of the paper's algorithm.
Practical Impact
The paper's main contribution is to resolve a fundamental algorithmic open problem for discounted (s, a)-rectangular RMDPs with L∞uncertainty sets. The paper shows that a robust policy iteration algorithm terminates in strongly polynomial time when the discount factor is fixed. This result has important implications for decision-making in uncertain environments. By providing a polynomial-time algorithm for solving RMDPs, the paper opens up new possibilities for applying robust decision-making techniques in real-world applications.
Analogy / Intuitive Explanation
Imagine you are driving a car, and you're not sure which road to take because the traffic lights are uncertain. A Markov decision process would try to find the best route based on the expected traffic patterns. However, in reality, the traffic patterns are uncertain, and we need to account for the worst-case scenario. A Robust Markov decision process would try to find the best route by minimizing the worst-case expected payoff with respect to all possible traffic patterns. The paper's algorithm provides a way to efficiently solve this problem in polynomial time, which is essential for decision-making in uncertain environments.
Paper Information
Categories:
cs.AI cs.CC
Published Date:

arXiv ID:

2601.23229v1

Quick Actions