Tour de Heuristics: MAA*

Multiagent A* is a heuristic that takes the commonly used A* algorithm and applies it to Dec-POMDP’s. Let’s investigate how it works.

The Algorithm

def estimated_state_value(belief, action):
    """
    The cornerstone of the A* algorithm
    is to have an optimistic estimator. Because
    an MDP assumes more information, it will
    always have at least as much value as a
    POMDP solution or a Dec-POMDP solution.
 
    For this, I have chosen mdp, as it is
    solveable in polynomial time.
    """
    estimated_value = 0
    for state, probability in belief:
        estimated_value += probability \
            * calculate_mdp(state, action)
    return estimated_value
 
def select_top_policy(policies):
    """
    Select the policy currently valued highest
    This will be a mix of the actual policy value
    based on the current policy tree, and the
    estimated value of the policy based on an
    optimistic estimator.
    """
    # Policies is most likely implemented as a tree
    # ...
    return top_policy
 
def top_policy(
    candidate,
    best_policy,
    best_value):
    candidate_value = calculate_value(candidate)
    if candidate_value > best_value:
        return (candidate, candidate_value) 
    else:
        return (best_policy, best_value)
 
 
def multi_agent_astar(max_layers=10,initial_belief):
    best_policy_value = float("-inf")
    best_policy = Nil
 
    open_policies = actions_at_belief(initial_belief)
    while len(open_policies) > 0:
        candidate = select_top_policy(open_policies)
        expanded_policies = expand_child_nodes(candidate)
        (complete_policies, remaining_policies) = split_on_depth_of(
            expanded_policies, max_layers)
        for policy in complete_policies:
            (best_policy, best_policy_value) = top_policy(
                policy,
                best_policy,
                best_policy_value)
 
        open_policies.insert(remaining_policies)
        clean_out_policies_worse_than(
            open_policies,
            best_policy_value)
    return best_policy

Analysis

The algorithm described above follows the traditional approach seen in any A*-inspired algorithm. It is optimal given an optimistic estimator, and it requires a defined start state. It is a top-down algorithm.

The part of this algorithm that makes it multi-agent is really just the implementation of the selector and the node expansion. This leverages a well known and extremely algorithmic tool.

Start the Discussion!Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.