Document worth reading: “Preference-based Online Learning with Dueling Bandits: A Survey”

In machine finding out, the notion of multi-armed bandits refers to a class of on-line finding out points, by which an agent is supposed to concurrently uncover and exploit a given set of different alternate choices in the course of a sequential alternative course of. In the standard setting, the agent learns from stochastic recommendations inside the kind of real-valued rewards. In many capabilities, nonetheless, numerical reward alerts is not going to be out there — in its place, solely weaker information is provided, particularly relative preferences inside the kind of qualitative comparisons between pairs of alternate choices. This commentary has motivated the look at of variants of the multi-armed bandit downside, by which further widespread representations are used every for the form of recommendations to be taught from and the objective of prediction. The objective of this paper is to supply a survey of the state-of-the-art on this self-discipline, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an abstract of points which had been thought-about throughout the literature along with methods for tackling them. Our taxonomy is principally primarily based totally on the assumptions made by these methods regarding the data-generating course of and, related to this, the properties of the preference-based recommendations. Preference-based Online Learning with Dueling Bandits: A Survey