Deriving User- and Content-specific Rewards for Contextual Bandits

Abstract

Bandit algorithms have gained increased attention in recommender systems, as they provide effective and scalable recommendations. These algorithms use reward functions, usually based on a numeric variable such as click-through rates, as the basis for optimization. On a popular music streaming service, a contextual bandit algo- rithm is used to decide which content to recommend to users, where the reward function is a binarization of a numeric variable that de- fines success based on a static threshold of user streaming time: 1 if the user streamed for at least 30 seconds and 0 otherwise. We explore alternative methods to provide a more informed reward function, based on the assumptions that streaming time distribu- tion heavily depends on the type of user and the type of content being streamed. To automatically extract user and content groups from streaming data, we employ “co-clustering”, an unsupervised learning technique to simultaneously extract clusters of rows and columns from a co-occurrence matrix. The streaming distributions within the co-clusters are then used to define rewards specific to each co-cluster. Our proposed co-clustered based reward functions lead to improvement of over 25% in expected stream rate, compared to the standard binarized rewards.

Related

September 2022 | RecSys

Identifying New Podcasts with High General Appeal Using a Pure Exploration Infinitely-Armed Bandit Strategy

Maryam Aziz, Jesse Anderton, Kevin Jamieson, Alice Wang, Hugues Bouchard, Javed Aslam

July 2022 | SIGIR

What Makes a Good Podcast Summary?

Rezvaneh Rezapour, Sravana Reddy, Ann Clifton, Rosie Jones