Partially Observable Markov Decision Processes (POMDPs) describe at a very high level the possible states of the world and the (probabilistic) transitions between them, given an action an agent performs. What separates them from simple Markov Decision Processes (MDPs) is that the agent also, after each action, gets an observation of the world, also probabilistically dependent on the state and action that was just executed, and does not know the resulting state other than what can be inferred from the observation. Solving a POMDP, then, requires analyzing all possible action sequences from all possible starting states over all possible observation and state outcomes to generate a policy that can recommend the optimal action in every circumstance. Optimality is defined as the highest expected accumulated reward from the current time to the end of time. Most methods look at a limited time horizon for computational reasons. Solving a POMDP can be reduced to solving an MDP, however, the dimensionality of the state space of that MDP (now called the belief space, as it represents a probability distribution over the state space) is the number of states in the POMDP. For continuous-state POMDPs, this is an infinite dimensional belief space!
Because of the observations, POMDPs can represent a very useful set of tasks for robotic planning. However, the entire history of observations can influence the optimal action policy. Therefore, the POMDP suffers not only from the curse of dimensionality, but also the curse of history. Many techniques have been proposed to solve these POMDP problems, however, the PSPACE-hard nature of the problem continues to be a challenge. Our work does not focus on new solvers, but instead on improving abstractions. Our initial research in this area has shown that the branching factor of the belief space, caused by the number of actions available to the agent, is a significant computational factor. We have proposed an iterative refinement technique, called Automated Model Approximation (AMA). Although we decrease the number of actions in a way so as to not reduce the number of accessible states, there is a huge computational speedup. Further, by analysis of the optimal policy in this simplified action space, a near-optimal action space can be constructed. Solving this new POMDP model can yield a near-optimal policy for the original space, even though much of the action space has been abstracted. Furthermore, the speed of computing the two simpler POMDP policies is generally significantly less than solving a POMDP with the true action space. AMA was validated in 8 simulated environments, solving a robotic navigation task under uncertainty in initial position and position sensing.
Cartoon representation of the AMA operation. A more complex 8-way connected action model is reduced to a 4-way connected model, with some areas benefiting from additional actions where analysis of the 4-way policy indicates they are useful.