Thompson sampling

In artificial intelligence, Thompson sampling,[1] named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

Description

Consider a set of contexts \mathcal{X}, a set of actions \mathcal{A}, and rewards in \mathbb{R}. In each round, the player obtains a context x \in \mathcal{X}, plays an action a \in \mathcal{A} and receives a reward r \in \mathbb{R} following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.

The elements of Thompson sampling are as follows:

  1. a likelihood function P(r|\theta,a,x);
  2. a set \Theta of parameters \theta of the distribution of r;
  3. a prior distribution P(\theta) on these parameters;
  4. past observations triplets \mathcal{D} = \{(x; a; r)\};
  5. a posterior distribution P(\theta|\mathcal{D}) \propto P(\mathcal{D}|\theta)P(\theta), where P(\mathcal{D}|\theta) is the likelihood function.

Thompson sampling consists in playing the action a^\ast \in \mathcal{A} according to the probability that it maximizes the expected reward, i.e.

\int \mathbb{I}[\mathbb{E}(r|a,x,\theta) = \max_{a'} \mathbb{E}(r|a',x,\theta)] P(\theta|\mathcal{D}) \, d\theta,

where \mathbb{I} is the indicator function.

In practice, the rule is implemented by sampling, in each round, a parameter \theta^\ast from the posterior P(\theta|\mathcal{D}), and choosing the action a^\ast that maximizes \mathbb{E}[r|\theta^\ast,a^\ast,x], i.e. the expected reward given the parameter, the action and the current context. Conceptually, this means that the player instantiates his beliefs randomly in each round, and then he acts optimally according to them.

History

Thompson sampling was originally described in an article by Thompson from 1933 [1] but has been largely ignored by the artificial intelligence community. It was subsequently rediscovered numerous times independently in the context of reinforcement learning.[2][3][4][5][6][7] A first proof of convergence for the bandit case has been shown in 1997.[2] The first application to Markov decision processes was in 2000.[4] A related approach (see Bayesian control rule) was published in 2010.[3] In 2010 it was also shown that Thompson sampling is instantaneously self-correcting.[7] Asymptotic convergence results for contextual bandits were published in 2011.[5] Thompson sampling has also been applied to A/B testing in website design and online advertising.[8] Recently, Thompson sampling has formed the basis for accelerated learning in decentralized decision making.[9]

Properties

Convergence and Optimality

Proof is left to the reader.

Relationship to other approaches

Probability matching

Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.

Bayesian control rule

A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations.[3] In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.

The setup is as follows. Let a_1, a_2, \ldots, a_T be the actions issued by an agent up to time T, and let o_1, o_2, \ldots, o_T be the observations gathered by the agent up to time T. Then, the agent issues the action a_{T+1} with probability:[3]

P(a_{T+1}|\hat{a}_{1:T}, o_{1:T}),

where the "hat"-notation \hat{a}_t denotes the fact that a_t is a causal intervention (see Causality), and not an ordinary observation. If the agent holds beliefs \theta \in \Theta over its behaviors, then the Bayesian control rule becomes

P(a_{T+1}|\hat{a}_{1:T}, o_{1:T}) = \int_{\Theta} P(a_{T+1}|\theta, \hat{a}_{1:T}, o_{1:T}) P(\theta|\hat{a}_{1:T}, o_{1:T}) \, d\theta,

where P(\theta|\hat{a}_{1:T}, o_{1:T}) is the posterior distribution over the parameter \theta given actions a_{1:T} and observations o_{1:T}.

In practice, the Bayesian control amounts to sampling, in each time step, a parameter \theta^\ast from the posterior distribution P(\theta|\hat{a}_{1:T}, o_{1:T}), where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations o_1, o_2, \ldots, o_T and ignoring the (causal) likelihoods of the actions a_1, a_2, \ldots, a_T, and then by sampling the action a^\ast_{T+1} from the action distribution P(a_{T+1}|\theta^\ast,\hat{a}_{1:T},o_{1:T}).


References

  1. 1 2 Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika, 25(3–4):285–294, 1933.
  2. 1 2 J. Wyatt. Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.
  3. 1 2 3 4 P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research, 38, pages 475–511, 2010.
  4. 1 2 M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceeedings of the Seventeenth International Conference on Machine Learning, Stanford University, California, June 29–July 2, 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1701
  5. 1 2 B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.
  6. Chapelle O. and Li, L. "An Empirical Evaluation of Thompson Sampling". NIPS, 2011.
  7. 1 2 O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics, 3 (2), 2010, 207-234.
  8. Ian Clarke. "Proportionate A/B testing", September 22nd, 2011, http://blog.locut.us/2011/09/22/proportionate-ab-testing/
  9. Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence. doi:10.1007/s10489-012-0346-z.
This article is issued from Wikipedia - version of the Monday, April 11, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.