Towards a Fair Marketplace: Trade-off between Relevance, Fairness & Satisfaction in RecSys

Two-sided marketplaces act as intermediaries that help facilitate economic interaction between two sets of agents; for example, consumers and suppliers, or users and advertisers. Such platforms have customers not only on the consumer side (e.g. users), but also on the supplier side (e.g. retailer, artists). While traditional recommender systems focus specifically on increasing consumer satisfaction by providing relevant content to consumers, two-sided marketplaces need to additionally optimize for supplier preferences

Motivated by this challenge, our research focuses on the relationship between objectives of consumers and suppliers in the case of music streaming services, and consider the balance between relevance of recommendations to the consumer (i.e. listener) and fairness of representation of suppliers (i.e. artists), measuring their impact on consumer satisfaction. Here we’ll describe a conceptual and computational framework using counterfactual estimation techniques to understand and evaluate different recommendation policies, specifically in the context of a two-sided marketplace.

Superstar Economics

The attention given to suppliers as a result of predicted relevance is often vastly unequal. A relatively small group of “superstar” suppliers receive a large portion of attention in many marketplaces, while the majority of suppliers in the long tail receive very little. For example, Figure 1 shows the relative exposure given to artist playlists in a music app across the popularity spectrum.

attachment_7c7bda0a1ddc6878bacdf75fdbed25aa — Figure 1: Exposure of artist playlists on a music app. A small number of artists receive the highest relevance score for most users.

Prior user familiarity and exposure outside the online marketplace (e.g. through advertising, word of mouth) are likely the main reason for such disparities. In addition to pre-existing familiarity, we identify a second contributing factor to this attention disparity—the recommendation strategies powering two-sided marketplaces. Recommendation systems inherently tend to follow the pattern of "superstar economics": rankings have a top and a tail end, not just for popularity, but also for relevance, as is evident in Figure 1.

Key Definitions

We define key concepts of user relevance, supplier fairness and user satisfaction, which are used throughout the post.

Relevance: we operationalize the notion of relevance for the user and identify a recommendation as relevant if it closely resembles the user's interest profile.

Satisfaction: we consider satisfaction from the consumer perspective and define it as the subjective measure on the utility of recommendations. Satisfaction is measured as the number of tracks the user listens to in a recommended set.

Fairness: We operationalize the concept of fairness for the supplier (i.e. artists), using the popularity of the supplier and concept of group fairness: a set of tracks is “fair” if it contains tracks from artists that belong to different popularity groups. We compute the group fairness measure (ψ) of a set of tracks (s) as:

where K is the number of popularity bins considered, Pi is the set of artists belonging to the popularity bin i, s is the recommended set, and A(s) is the collection of artists in the sets, with aj being the j-th track in the set. ψ(s) rewards sets that are diverse in terms of the different artist popularity bins represented and, as per the current definition, fair to different popularity bins of suppliers. We contend that our definition of group fairness is one of the many different possible fairness definitions, and the framework presented in this work is amenable to other interpretations and definitions of fairness.

Optimizing for User Relevance

To better understand the need for considering relevance and fairness of sets (i.e. playlists), we begin by providing a descriptive summary of their scores on a random collection of playlists.

attachment_f02457443ffb259b7734b083427c5a93 — Figure 2: Interplay between relevance scores and fairness scores of sets (playlists). Left: The distribution of normalised relevance scores vs normalized fairness estimates for all recommended sets. We observe that very few sets have higher relevance and higher fairness. Right: The distribution of fairness estimates for all recommended sets, compared with the fairness estimates for relevant sets. The relevance sets have lower fairness scores compared with the general fairness scores for all sets.

We observe that very few recommended sets have both high relevance & high fairness (figure on left). Thus, optimizing for relevance without explicitly considering fairness could have an adverse impact on supplier fairness. As can be seen in figure on right, average fairness for all sets is almost twice as high as the mean of fairness for top relevant sets, which motivates the need for jointly considering relevance & fairness when recommending content.

Balance between Relevance & Fairness

To accommodate the needs of both consumers and suppliers in a two-sided marketplace, recommender systems need to strike a balance in terms of the relevance of recommended content to its consumers, and their fairness in terms of opportunity of surfacing different suppliers. We present a number of recommendation policies wherein the system could trade-off relevance and fairness.

While policies I-V present different optimization objectives to jointly consider relevance and fairness estimates in order to decide which playlists to surface to users, they fail to capture user-level heterogeneity. We conjecture that users have varying extent of sensitivity towards fair content, with some users only interested in a particular group of suppliers, while others being more flexible around the distribution of suppliers exposed in the recommended content. Such user-level disposition toward fairness motivates the need to develop a user-affinity aware recommendation policy which computes a user’s affinity towards different types of content (especially fair sets), and adaptively leverages such user-level affinity to recommend content. Adaptive policies (policy VI & VII) help us capitalize on such insights so as to develop a user-aware recommendation policy.

Findings

Estimating user satisfaction is a non-trivial problem, with most solutions based on running a controlled experiment (i.e. an A/B test). However, these experiments usually require non-trivial engineering resources and impact live users, which can ultimately change the user experience. To this end, we advocate the use of counterfactual estimation techniques for unbiased offline evaluation of user satisfaction in recommendation setting.

attachment_0dcb04d69d41c0bc54c0483b881638c3 — Figure 3: satisfaction estimates for the Interpolated recommendation policy (left), probabilistic policy (middle) and guaranteed relevance policy (right). As we weight relevance more importantly than fairness, we observed consistent increase in user satisfaction.

Comparing Policies

Our findings confirm our initial hypothesis: recommending content weighted for relevance to the listener has a positive impact on customer satisfaction, while recommending content weighted for fairness to the supplier does not achieve the same result. For the recommendation policies, the parameter β lets us control the importance given to relevance vs fairness estimates (as shown in the policy summary figure above). We observe a relative decline of 35% in satisfaction on moving from a focus on relevance to a focus on fairness. As we vary the β parameter from 0 to 1, to differentially weight relevance and fairness estimates for each set, we observe a gradual improvement in user satisfaction metric. Further, we observe a sharp increase in user satisfaction when we increase the importance given to relevance to 0.7 and beyond. We do not observe significant improvements in user satisfaction beyond β = 0.8, which suggests that recommender systems could easily give 20% importance to fairness of sets without severe impact on user satisfaction

Impact of User Affinity

The policies evaluated so far have not incorporated user-level traits, specifically user’s affinity for fair content. Adaptive policies consider user’s affinity towards fair content, and adaptively recommends fair content to users who have a positive affinity towards fair content, and only recommends relevant content to users who do not prefer fair content.

attachment_723aa460cf954c44fac1c936720696e1 — Table 1: User satisfaction estimates for a subset of recommendation policies considered.

We observe that the adaptive policies perform better than the best performing case in both interpolated policy and probabilistic policy. This suggests that users who have a higher affinity towards fair content, indeed have higher satisfaction when presented with fair content. Among the two variants of the adaptive policies proposed, we observe that the normalized affinity score based policy performs better than the simple extreme case policy. These results highlight that personalizing the recommendation policy and adapting based on user level affinity is better than globally balancing relevance and fairness.

Cost vs. Benefit Analysis

We next consider a detailed analysis of the percentage gain in fairness, relevance and satisfaction observed for the different policies. This enables us to select cases wherein the loss observed in fairness and relevance is minimized while maximizing the gains in satisfaction.

We observe that while the interpolated policy suffers from low loss in relevance and satisfaction, it severely impacts fairness with losses ranging from 42 to 64%. Guaranteed relevance policy, on the other hand, witnesses a positive gain in user satisfaction, while suffering from significant losses in fairness estimates. Adapting to user’s affinity towards fair content gives us the best trade-off between fairness and relevance without negatively impacting satisfaction. We observe substantially low losses in fairness (15% - 17%), while positively impacting user satisfaction, with gains of 9-21% in satisfaction estimates.

Overall, our findings indicate that adaptive policies provide the best middle ground without severely impacting relevance, and positively impacting fairness and satisfaction.

Conclusions

From our research, we can infer that appropriate definitions of fairness and satisfaction in different two-sided marketplaces would enable system designers to better understand the interplay between the different factors. Different proposed policies enable system designers to jointly optimize for fairness and relevance, which can ultimately achieve substantial improvement in supplier fairness without a decline in user satisfaction. Our findings provide insight that can be useful in guiding future research on sophisticated algorithms for joint optimization of user relevance, satisfaction and fairness.