Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

Qing Wang; Chong-Wah Ngo; Ee-Peng Lim

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

Qing Wang, Chong-Wah Ngo, Ee-Peng Lim

TL;DR

This work tackles bias in cross-modal food image-to-recipe retrieval by introducing a causality-guided framework that treats ingredients $Ing$ as confounders between image $I$ and recipe $R$. It derives a backdoor-adjusted similarity $P(S|do(I),R)$ and implements a neural debiasing module that estimates $P(ing|I)$ to form a debiased image embedding $\tilde{e}_I = e_I + \mathbb{E}_{[ing|I]}[e_{ing}]$, yielding a final similarity $e_R \cdot \tilde{e}_I$. The approach is integrated as a plug-in component with SoTA models on Recipe1M, using a Transformer-based multi-label ingredient classifier and a 500-ingredient dictionary, and optimized with a combination of bi-directional triplet loss and an asymmetric classification loss. Empirically, the method achieves near-oracle retrieval performance and state-of-the-art results, including strong zero-shot and scalability performance, while incurring minimal computational overhead. This work demonstrates that causal interventions can substantially improve cross-modal multimodal retrieval by mitigating spurious correlations due to dish ingredients and presentation.

Abstract

This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

TL;DR

Abstract

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)