Tour de Heuristics: MAA*
May-2014
Multiagent A* is a heuristic that takes the commonly used A* algorithm and applies it to Dec-POMDP’s. Let’s investigate how it works.
The Algorithm
def estimated_state_value(belief, action): """ The cornerstone of the A* algorithm is to have an optimistic estimator. Because an MDP assumes more information, it will always have at least as much value as a POMDP solution or a Dec-POMDP solution. For this, I have chosen mdp, as it is solveable in polynomial time. """ estimated_value = 0 for state, probability in belief: estimated_value += probability \ * calculate_mdp(state, action) return estimated_value def select_top_policy(policies): """ Select the policy currently valued highest This will be a mix of the actual policy value based on the current policy tree, and the estimated value of the policy based on an optimistic estimator. """ # Policies is most likely implemented as a tree # ... return top_policy def top_policy( candidate, best_policy, best_value): candidate_value = calculate_value(candidate) if candidate_value > best_value: return (candidate, candidate_value) else: return (best_policy, best_value) def multi_agent_astar(max_layers=10,initial_belief): best_policy_value = float("-inf") best_policy = Nil open_policies = actions_at_belief(initial_belief) while len(open_policies) > 0: candidate = select_top_policy(open_policies) expanded_policies = expand_child_nodes(candidate) (complete_policies, remaining_policies) = split_on_depth_of( expanded_policies, max_layers) for policy in complete_policies: (best_policy, best_policy_value) = top_policy( policy, best_policy, best_policy_value) open_policies.insert(remaining_policies) clean_out_policies_worse_than( open_policies, best_policy_value) return best_policy |
Analysis
The algorithm described above follows the traditional approach seen in any A*-inspired algorithm. It is optimal given an optimistic estimator, and it requires a defined start state. It is a top-down algorithm.
The part of this algorithm that makes it multi-agent is really just the implementation of the selector and the node expansion. This leverages a well known and extremely algorithmic tool.