A Linear Response Bandit Problem

Research output: Contribution to journalArticlepeer-review

Abstract

We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon n and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like n−−√.
Original languageEnglish
Pages (from-to)230-261
Number of pages32
JournalStochastic Systems
Volume3
Issue number1
DOIs
StatePublished - 2013

Fingerprint

Dive into the research topics of 'A Linear Response Bandit Problem'. Together they form a unique fingerprint.

Cite this