Function-based Labels for Complementary Recommendation: Definition, Annotation, and LLM-as-a-Judge
Chihiro Yamasaki, Kai Sugahara, Yuma Nagi, Kazushi Okamoto
TL;DR
This work tackles the ambiguity of complementary item relationships by introducing Function-Based Labels (FBLs) that anchor relationships in item functions rather than historical co-purchase data. A human-annotated dataset of 2,759 FBL-labeled item pairs demonstrates that FBLs cover diverse relations with moderate inter-annotator agreement, and enables reliable evaluation of learning methods. Machine learning models achieve strong Macro-F1 scores around 0.82 when predicting three-class labels from FBLs, while LLMs (e.g., GPT-4o-mini) using a fine-grained 9-class FBL taxonomy reach Macro-F1 up to 0.849 and high consistency (≈0.989), indicating LLMs can effectively emulate human judgments under this framework. The study argues that FBLs provide a robust, interpretable basis for automated labeling and improved complementary recommendations, while outlining future work on dataset expansion, multilinguality, and cost-efficient labeling via active learning.
Abstract
Complementary recommendations enhance the user experience by suggesting items that are frequently purchased together while serving different functions from the query item. Inferring or evaluating whether two items have a complementary relationship requires complementary relationship labels; however, defining these labels is challenging because of the inherent ambiguity of such relationships. Complementary labels based on user historical behavior logs attempt to capture these relationships, but often produce inconsistent and unreliable results. Recent efforts have introduced large language models (LLMs) to infer these relationships. However, these approaches provide a binary classification without a nuanced understanding of complementary relationships. In this study, we address these challenges by introducing Function-Based Labels (FBLs), a novel definition of complementary relationships independent of user purchase logs and the opaque decision processes of LLMs. We constructed a human-annotated FBLs dataset comprising 2,759 item pairs and demonstrated that it covered possible item relationships and minimized ambiguity. We then evaluated whether some machine learning (ML) methods using annotated FBLs could accurately infer labels for unseen item pairs, and whether LLM-generated complementary labels align with human perception. Our results demonstrate that even with limited data, ML models, such as logistic regression and SVM achieve high macro-F1 scores (approximately 0.82). Furthermore, LLMs, such as gpt-4o-mini, demonstrated high consistency (0.989) and classification accuracy (0.849) under the detailed definition of FBLs, indicating their potential as effective annotators that mimic human judgment. Overall, our study presents FBLs as a clear definition of complementary relationships, enabling more accurate inferences and automated labeling of complementary recommendations.
