Table of Contents
Fetching ...

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

Qing Wang, Chong-Wah Ngo, Ee-Peng Lim

TL;DR

This work tackles bias in cross-modal food image-to-recipe retrieval by introducing a causality-guided framework that treats ingredients $Ing$ as confounders between image $I$ and recipe $R$. It derives a backdoor-adjusted similarity $P(S|do(I),R)$ and implements a neural debiasing module that estimates $P(ing|I)$ to form a debiased image embedding $\tilde{e}_I = e_I + \mathbb{E}_{[ing|I]}[e_{ing}]$, yielding a final similarity $e_R \cdot \tilde{e}_I$. The approach is integrated as a plug-in component with SoTA models on Recipe1M, using a Transformer-based multi-label ingredient classifier and a 500-ingredient dictionary, and optimized with a combination of bi-directional triplet loss and an asymmetric classification loss. Empirically, the method achieves near-oracle retrieval performance and state-of-the-art results, including strong zero-shot and scalability performance, while incurring minimal computational overhead. This work demonstrates that causal interventions can substantially improve cross-modal multimodal retrieval by mitigating spurious correlations due to dish ingredients and presentation.

Abstract

This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

TL;DR

This work tackles bias in cross-modal food image-to-recipe retrieval by introducing a causality-guided framework that treats ingredients as confounders between image and recipe . It derives a backdoor-adjusted similarity and implements a neural debiasing module that estimates to form a debiased image embedding , yielding a final similarity . The approach is integrated as a plug-in component with SoTA models on Recipe1M, using a Transformer-based multi-label ingredient classifier and a 500-ingredient dictionary, and optimized with a combination of bi-directional triplet loss and an asymmetric classification loss. Empirically, the method achieves near-oracle retrieval performance and state-of-the-art results, including strong zero-shot and scalability performance, while incurring minimal computational overhead. This work demonstrates that causal interventions can substantially improve cross-modal multimodal retrieval by mitigating spurious correlations due to dish ingredients and presentation.

Abstract

This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

Paper Structure

This paper contains 18 sections, 6 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Left: A causal graph depicting how cross-modal similarity is affected by the spurious correlation in learning image and recipe representations due to the confounder $Ing$. Right: Debiasing by backdoor adjustment to cutoff the incoming edges to the image.
  • Figure 2: There are three components in our framework: image embedding $e_I$ generated by the image encoder, recipe embedding $e_R$ generated by the recipe encoder, and debised image embedding $\widetilde{e}_I$. The triplet loss $L_{triplet}$ is applied on $\widetilde{e}_I$ and $e_R$. The proposed retrieval debiasing module is illustrated on the right. We utilize the Transformer decoder for our ingredient classification, which takes image embedding $e_I$ as key and value, and each ingredient label embedding as a query. We apply the sigmoid function to the output embedding from the last layer of the Transformer decoder and obtain ingredient prediction probabilities $P_{ing}$. By multiplying the probabilities with the ingredients in dictionary $D_{ing}$, we get the expectation of ingredient embedding $\mathbb{E}_{[ing|I]}[D_{ing}]$. The debiasing image embedding $\widetilde{e}_I$ is obtained by adding $e_I$ and $\mathbb{E}_{[ing|I]}[D_{ing}]$. We train the Transformer decoder using asymmetric loss Ridnik_2021_ICCV.
  • Figure 3: Recall@1 for image-to-recipe retrieval on 50K test set by varying the accuracy of ingredient prediction. The solid lines are oracle runs. The dotted lines show the performance of SoTA methods without debiasing.
  • Figure 4: Two examples providing insights on the debiasing mechanism: query image (a), predicted ingredient (b), the retrieved recipes (c)-(e). The ground-truth recipes are boxed in blue. The correct predicted ingredients are marked in red. H-T ranks the ground-truth recipes at the ranks more than 100. Best viewed in color.
  • Figure 5: Causal graph with both ingredients and cooking actions as the confounder sources.
  • ...and 7 more figures