The Bellman Equation
Jul-2014
Back in college, I learned about a tool called the “Bellman Equation”. It’s very nice because it turns into a local calculation for each node, and you only need to know about your neighbors’ previous values. It’s parallelizable.
(Do every node in parallel, sync, repeat, until convergence).
The only gotcha for using a bellman equation is that you have to be able to know what state you are in, so it’s only appropriate for problems which are fully observable.
Anyways, It’s easy, and it’s a great way to determine expected utility. If you want to know what the utility out to a specific horizon would be, just run it that many times.
Code example
So here we want to see the expected utility given:
- An action/node policy (this is just a map of “do action x when in node y”)
- A graph that describes states and contains transition probabilities (represented as “p” below) based on the action chosen in a given state
- Two optional parameters of the previous calculated utility and the decay for future utility
def bellman_values(policy, graph, util = {}, decay = 0.0): new_util = {} for node in graph.nodes: new_util[node] = node.reward(policy[node]) for neighbor, p in graph.transitions(node, policy[node]): new_util[node] += p * util.get(neighbor, 0.0) * decay return new_util |
If you can use that equation, it will converge eventually, so you could check for convergence based on biggest change, or you can run it T times, and have an expected utility for horizon T.