Image Score: Learning and Evaluating Human Preferences for Mercari Search

Chingis Oinar; Miao Cao; Shanshan Fu

Image Score: Learning and Evaluating Human Preferences for Mercari Search

Chingis Oinar, Miao Cao, Shanshan Fu

TL;DR

Mercari search results are heavily influenced by image quality when relevance is similar. The authors present Image Score, a cost-efficient pipeline that uses LLM-generated image aesthetics labels to supervise a CLIP-based ranking model, which is deployed in Elasticsearch with Triton-backed inference. Offline experiments on a proprietary dataset show Image Score (with Focal Loss) achieving higher OPA and CA than baselines, while online A/B tests on the web platform yield a ~7% increase in Average Transaction per User, validating practical impact for search optimization. The work also discusses limitations, notably AI-generated images scoring highly and the need for continuous monitoring and platform-specific tuning to sustain gains.

Abstract

Mercari is the largest C2C e-commerce marketplace in Japan, having more than 20 million active monthly users. Search being the fundamental way to discover desired items, we have always had a substantial amount of data with implicit feedback. Although we actively take advantage of that to provide the best service for our users, the correlation of implicit feedback for such tasks as image quality assessment is not trivial. Many traditional lines of research in Machine Learning (ML) are similarly motivated by the insatiable appetite of Deep Learning (DL) models for well-labelled training data. Weak supervision is about leveraging higher-level and/or noisier supervision over unlabeled data. Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We present how we leverage a Chain-of-Thought (CoT) to enable LLM to produce image aesthetics labels that correlate well with human behavior in e-commerce settings. Leveraging LLMs is more cost-effective compared to explicit human judgment, while significantly improving the explainability of deep image quality evaluation which is highly important for customer journey optimization at Mercari. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings, which is very convenient for proof-of-concept testing. We show that our LLM-produced labels correlate with user behavior on Mercari. Finally, we show our results from an online experimentation, where we achieved a significant growth in sales on the web platform.

Image Score: Learning and Evaluating Human Preferences for Mercari Search

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 10 figures, 2 tables)

This paper contains 33 sections, 5 equations, 10 figures, 2 tables.

Introduction
Related Work
Weak Supervision via LLMs
Image Quality and Aesthetics Evaluation
Image Quality Evaluation in Online Marketplaces
Image Score: Learning and Evaluating Human Preferences for Mercari Search
Data Collection and Label Generation
Data Collection
Annotation with LLM
Dataset Analysis
Proposed Model
Preference Learning
Deployment
Image Score Component
Offline Prediction
...and 18 more sections

Figures (10)

Figure 1: The search result page user interface in the Mercari app. The figure shows how we display items to users, in a grid layout. The users only see images and prices at first. We study the importance of image quality when looking at similarly relevant and similarly priced items.
Figure 2: The data collection and processing pipeline. Price filtering and position windowing are applied to SERPs to get similar item pairs.
Figure 3: The prompt for image aesthetic evaluation and image batch examples. The images with green borders are clicked items and others are not clicked. $en\_query$ denotes a SERP query translated into English.
Figure 4: The score distributions for clicked items and not clicked items. In the chart on the right-hand side, the colors blue and red represent items that have been clicked and not clicked respectively.
Figure 5: Most common adjectives from LLM analysis that are unique to clicked and not clicked items on the left and right side, respectively. The green bar represents positive adjectives, the red bar represents negative adjectives, and the yellow bar represents neutral adjectives.
...and 5 more figures

Image Score: Learning and Evaluating Human Preferences for Mercari Search

TL;DR

Abstract

Image Score: Learning and Evaluating Human Preferences for Mercari Search

Authors

TL;DR

Abstract

Table of Contents

Figures (10)