Partially observable upper confidence bound¶
Monte-Carlo tree search, first implemented in POMCP [silver_monte-carlo_2010] implementation where the components are:
- root node construct:
create_root_node_with_child_for_all_actions() simply initiates a action node for each action
- root node construct:
- stop condition:
has_simulated_n_times() simulate exactly
ntimes
- stop condition:
- leaf selection
select_leaf_by_max_scores() pick nodes according to their upper-confidence score
observation generation: simulator for observation generator
- leaf selection
- leaf expansion
expand_node_with_all_actions(): expand a new node with references for all provided actions
- leaf expansion
- leaf evaluation
rollout(): rollout of some policy given a simulator (default is random)
- leaf evaluation
- back propagation
backprop_running_q(): running Q-average
- back propagation
- action selector
max_q_action_selector(): picks the action with max q value
- action selector
We provide a relatively easy way of constructing PO-UCT for your usage through
- silver_monte-carlo_2010
Silver, David, and Joel Veness. “Monte-Carlo planning in large POMDPs.“ Advances in neural information processing systems. 2010.