Gittins index

The Gittins index is a measure of the reward that can be achieved by a random process bearing a termination state and evolving from its present state onward, under the option of terminating the said process at every later stage with the accrual of the probabilistic expected reward from that stage up to the attainment of its termination state. It is a real scalar value associated to the state of a stochastic process with a reward function and a probability of termination.

Terminology

To illustrate the theory we can take two examples from a developing sector, such as from electricity generating technologies: wind power and wave power. If we are presented with the two technologies when they are both proposed as ideas we cannot say which will be better in the long run as we have no data, as yet, to base our judgments on.[1] It would be easy to say that wave power would be too problematic to develop as it seems easier to put up lots of wind turbines than to make the long floating generators, tow them out to sea and lay the cables necessary.

If we were to make a judgment call at that early time in development we could be condemning one technology to being put on the shelf and the other would be developed and put into operation. If we develop both technologies we would be able to make a judgment call on each by comparing the progress of each technology at a set time interval such as every three months. The decisions we make about investment in the next stage would be based on those results.[1]

In a paper in 1979 called Bandit Processes and Dynamic Allocation Indices John C. Gittins suggests a solution for problems such as this. He takes the two basic functions of a "Scheduling Problem" and a "Multi-armed bandit" problem[2] and shows how these problems can be solved using Dynamic allocation indices. He first takes the "Scheduling Problem" and reduces it to a machine which has to perform jobs and has a set time period, every hour or day for example, to finish each job in. The machine is given a reward value, based on finishing or not within the time period, and a probability value of whether it will finish or not for each job is calculated. The problem is "to decide which job to process next at each stage so as to maximize the total expected reward."[1] He then moves on to the "Multi–armed bandit problem" where each pull on a "one armed bandit" lever is allocated a reward function for a successful pull, and a zero reward for an unsuccessful pull. The sequence of successes forms a Bernoulli process and has an unknown probability of success. There are multiple "bandits" and the distribution of successful pulls is calculated and different for each machine. Gittins states that the problem here is "to decide which arm to pull next at each stage so as to maximize the total expected reward from an infinite sequence of pulls."[1]

Gittins says that "Both the problems described above involve a sequence of decisions, each of which is based on more information than its predecessors, and these both problems may be tackled by dynamic allocation indices."[2]

Definition

In applied mathematics, the "Gittins index" is a real scalar value associated to the state of a stochastic process with a reward function and with a probability of termination. It is a measure of the reward that can be achieved by the process evolving from that state on, under the probability that it will be terminated in future. The "index policy" induced by the Gittins index, consisting of choosing at any time the stochastic process with the currently highest Gittins index, is the solution of some stopping problems such as the one of dynamic allocation, where a decision-maker has to maximize the total reward by distributing a limited amount of effort to a number of competing projects, each returning a stochastic reward. If the projects are independent from each other and only one project at a time may evolve, the problem is called multi-armed bandit and the Gittins index policy is optimal. If multiple projects can evolve, the problem is called Restless bandit and the Gittins index policy is a known good heuristic but no optimal solution exists in general. In fact, in general this problem is NP-complete and it is generally accepted that no feasible solution can be found.

History

Questions about the optimal stopping policies in the context of clinical trials have been open from the 1940s and in the 1960s a few authors analyzed simple models leading to optimal index policies,[3] but it was only in the 1970s that Gittins and his collaborators demonstrated in a markovian framework that the optimal solution of the general case is an index policy whose "dynamic allocation index" is computable in principle for every state of each project as a function of the single project's dynamics.[2][4]

Soon after the seminal paper of Gittins, Peter Whittle[5] demonstrated that the index emerges as a lagrangian multiplier from a dynamic programming formulation of the problem called retirement process and conjectured that the same index would be a good heuristic in a more general setup named Restless bandit. The question of how to actually calculate the index for Markov chains was first addressed by Varaiya and his collaborators[6] with an algorithm that computes the indexes from the largest first down to the smallest and by Chen and Katehakis [7] who showed that standard LP could be used to calculate the index of a state without requiring its calculation for all states with higher index values. LCM Kallenberg [8] provided a parametric LP implementation to compute the indices for all states of a Markov chain. Further, Katehakis and Veinott[9] demonstrated that the index is the expected reward of a Markov decision process constructed over the Markov chain and known as Restart in State and can be calculated exactly by solving that problem with the policy iteration algorithm, or approximately with the value iteration algorithm. This approach also has the advantage of calculating the index for one specific state without having to calculate all the greater indexes and it is valid under more general state space conditions. A faster algorithm for the calculation of all indices has been obtained in 2004 by Sonin[10] as a consequence of his elimination algorithm for the optimal stopping of a Markov chain. In this algorithm the termination probability of the process may depend on the current state rather than being a fixed factor. A faster algorithm has been proposed in 2007 by Niño-Mora [11] by exploiting the structure of a parametric simplex to reduce the computational effort of the pivot steps and thereby achieving the same complexity as the Gaussian elimination algorithm. Cowan, W. and Katehakis (2014),[12] provide a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under frameworks in which, either the discount factors may be non-uniform and vary over time, or the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices.

Mathematical definition

Dynamic allocation index

The classical definition by Gittins et al. is:

\nu(i)=\sup_{\tau>0}\frac{
       \left\langle\sum_{t=0}^{\tau-1}\beta^t R[Z(t)]\right\rangle_{Z(0)=i}}{
       \left\langle\sum_{t=0}^{\tau-1}\beta^t        \right\rangle_{Z(0)=i}}

where Z(\cdot) is a stochastic process, R(i) is the utility (also called reward) associated to the discrete state i, \beta<1 is the probability that the stochastic process does not terminate, and \langle\cdot\rangle_c is the conditional expectation operator given c:

\langle X\rangle_c \doteq \sum_{x\in\chi}x P\{X=x|c\}

with \chi being the range of X.

Retirement process formulation

The dynamic programming formulation in terms of retirement process, given by Whittle, is:

w(i)=\inf\{k:v(i,k)=k\}

where v(i,k) is the value function

v(i,k)=\sup_{\tau>0} \left\langle \sum_{t=0}^{\tau-1} \beta^t R[Z(t)] + \beta^t k \right\rangle_{Z(0) = i}

with the same notation as above. It holds that

\nu(i)=(1-\beta)w(i).

Restart-in-state formulation

If Z(\cdot) is a Markov chain with rewards, the interpretation of Katehakis and Veinott (1987) associates to every state the action of restarting from one arbitrary state i, thereby constructing a Markov decision process M_i.

The Gittins Index of that state i is the highest total reward which can be achieved on M_i if one can always choose to continue or restart from that state i.

h(i)=\sup_\pi \left\langle \sum_{t=0}^{\tau-1} \beta^t R[Z^\pi(t)] \right\rangle_{Z(0) = i}

where \pi indicates a policy over M_i. It holds that

h(i)=w(i).

Generalized index

If the probability of termination \beta(i) depends on the state i, a generalization introduced by Sonin (2008) defines the Gittins index \alpha(i) as the maximum discounted total reward per chance of termination.

\alpha(i)=\sup_{\tau>0} \frac{R^\tau(i)}{Q^\tau(i)}

where

R^\tau(i)=\left\langle \sum_{t=0}^{\tau-1} R[Z(t)] \right\rangle_{Z(0) = i}
Q^\tau(i)=\left\langle \sum_{t=0}^{\tau-1} \prod_{j=0}^t \beta[Z(j)] \right\rangle_{Z(0) = i}

If \beta^t is replaced by \prod_{j=0}^t \beta[Z(j)] in the definitions of \nu(i), w(i) and h(i), then it holds that

\alpha(i)=h(i)=w(i)
\alpha(i)\neq k \nu(i), \forall k

this observation leads Sonin to conclude that \alpha(i) and not \nu(i) is the "true meaning" of the Gittins index.

Notes

  1. 1 2 3 4 Cowan, Robin (July 1991). "Tortoises and Hares: Choice among technologies of unknown merit" 101 (407): 801–814.
  2. 1 2 3 J. C. Gittins, Bandit Processes and Dynamic Allocation Indices, Journal of the Royal Statistical Society, Series B, Vol. 41, No. 2. (1979), pp. 148177.
  3. Mitten, L. (1960). "An Analytic Solution to the Least Cost Testing Sequence Problem." J. of Industrial Eng., 11, 1, 17.
  4. J. C. Gittins, D. M. Jones, A Dynamic Allocation Index for the Discounted Multiarmed Bandit Problem, Biometrika, Vol 66, No. 3. (1979), pp. 561565.
  5. Whittle, Peter (1980). "Multi-armed Bandits and the Gittins Index". Journal of the Royal Statistical Society, Series B 42 (2): 143–149.
  6. Varaiya, P., Walrand J., Buyukkoc C. (1985). "Extensions of the Multiarmed Bandit Problem." IEEE Trans. Autom. Control, AC-30, 426439
  7. Chen, Y.R. and Katehakis, M.N. (1986). "Linear programming for finite state multi-armed bandit problems", Math. Oper. Res., 11(1), 262268
  8. Kallenberg, L.C.M.(1986). "A Note on MN Katehakis' and Y.-R. Chen's Computation of the Gittins Index", Math. Oper. Res., 11(1), 262268
  9. Katehakis, M., Veinott, A. (1987). "The multi-armed bandit problem: decomposition and computation." Math. Oper. Res., 12(2), 262268
  10. Sonin, I. (2008). "A generalized Gittins index for a Markov chain and its recursive calculation." Statistics and Probability Letters, 78, 15261533.
  11. Niño-Mora, J. (2007). "A (2/3)^n Fast-Pivoting Algorithm for the Gittins Index and Optimal Stopping of a Markov Chain." INFORMS Journal of Computing, 19(4), 596606.
  12. Cowan, W. and M.N. Katehakis (2014). "Multi-armed Bandits under General Depreciation and Commitment". Probability in the Engineering and Informational Sciences. doi:10.1017/S0269964814000217

References

External links

This article is issued from Wikipedia - version of the Wednesday, March 23, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.