Deriving User- and Content-specific Rewards for Contextual Bandits

Abstract

Bandit algorithms have gained increased attention in recommender systems, as they provide effective and scalable recommendations. These algorithms use reward functions, usually based on a numeric variable such as click-through rates, as the basis for optimization. On a popular music streaming service, a contextual bandit algo- rithm is used to decide which content to recommend to users, where the reward function is a binarization of a numeric variable that de- fines success based on a static threshold of user streaming time: 1 if the user streamed for at least 30 seconds and 0 otherwise. We explore alternative methods to provide a more informed reward function, based on the assumptions that streaming time distribu- tion heavily depends on the type of user and the type of content being streamed. To automatically extract user and content groups from streaming data, we employ “co-clustering”, an unsupervised learning technique to simultaneously extract clusters of rows and columns from a co-occurrence matrix. The streaming distributions within the co-clusters are then used to define rewards specific to each co-cluster. Our proposed co-clustered based reward functions lead to improvement of over 25% in expected stream rate, compared to the standard binarized rewards.

Related

April 2021 | AISTATS

Collaborative Classification from Noisy Labels

Lucas Maystre, Nagarjuna Kumarappan, Judith Bütepage, Mounia Lalmas

March 2021 | WSDM

Shifting Consumption towards Diverse Content on Music Streaming Platforms

Christian Hansen, Rishabh Mehrotra, Casper Hansen, Brian Brost, Lucas Maystre, Mounia Lalmas

October 2020 | CIKM

Query Understanding for Surfacing Under-served Music Content

Federico Tomasi, Rishabh Mehrotra, Aasish Pappu, Judith Bütepage, Brian Brost, Hugo Galvão, Mounia Lalmas