Federico Tomasi, Rishabh Mehrotra, Aasish Pappu, Judith Bütepage, Brian Brost, Hugo Galvão, Mounia Lalmas
Deriving User- and Content-specific Rewards for Contextual Bandits
Bandit algorithms have gained increased attention in recommender systems, as they provide effective and scalable recommendations. These algorithms use reward functions, usually based on a numeric variable such as click-through rates, as the basis for optimization. On a popular music streaming service, a contextual bandit algo- rithm is used to decide which content to recommend to users, where the reward function is a binarization of a numeric variable that de- fines success based on a static threshold of user streaming time: 1 if the user streamed for at least 30 seconds and 0 otherwise. We explore alternative methods to provide a more informed reward function, based on the assumptions that streaming time distribu- tion heavily depends on the type of user and the type of content being streamed. To automatically extract user and content groups from streaming data, we employ “co-clustering”, an unsupervised learning technique to simultaneously extract clusters of rows and columns from a co-occurrence matrix. The streaming distributions within the co-clusters are then used to define rewards specific to each co-cluster. Our proposed co-clustered based reward functions lead to improvement of over 25% in expected stream rate, compared to the standard binarized rewards.
Casper Hansen, Christian Hansen, Lucas Maystre, Rishabh Mehrotra, Brian Brost, Federico Tomasi, Mounia Lalmas
Inferring the Causal Impact of New Track Releases on Music Recommendation Platforms through Counterfactual Predictions
Rishabh Mehrotra, Prasanta Bhattacharya, Mounia Lalmas