Partially observable upper confidence bound

Monte-Carlo tree search, first implemented in POMCP [silver_monte-carlo_2010] implementation where the components are:

  • root node construct: create_root_node_with_child_for_all_actions()
    • simply initiates a action node for each action

  • stop condition: has_simulated_n_times()
    • simulate exactly n times

  • leaf selection select_leaf_by_max_scores()
    • pick nodes according to their upper-confidence score

    • observation generation: simulator for observation generator

  • leaf expansion expand_node_with_all_actions():
    • expand a new node with references for all provided actions

  • leaf evaluation rollout():
    • rollout of some policy given a simulator (default is random)

  • back propagation backprop_running_q():
    • running Q-average

  • action selector max_q_action_selector():
    • picks the action with max q value

We provide a relatively easy way of constructing PO-UCT for your usage through

silver_monte-carlo_2010

Silver, David, and Joel Veness. “Monte-Carlo planning in large POMDPs.“ Advances in neural information processing systems. 2010.