Table of Contents
Fetching ...

Efficient Prompt Tuning for Hierarchical Ingredient Recognition

Yinxuan Gui, Bin Zhu, Jingjing Chen, Chong-Wah Ngo

TL;DR

Fine-grained ingredient recognition is hindered by diverse visual appearances and label granularity. The authors present an efficient prompt-tuning framework that adapts pretrained visual-language models to ingredient recognition without full finetuning and introduce a three-level ingredient hierarchy to enable coarse-to-fine evaluation. They implement a two-stage cross-hierarchy training where stage1 trains prompts per hierarchy level and stage2 jointly optimizes a weighted combination of stage1 losses, $L_{ ext{stage2}} = \lambda_1 L_{ ext{stage1}}^{1} + \lambda_2 L_{ ext{stage1}}^{2} + \lambda_3 L_{ ext{stage1}}^{3}$, updating only prompt parameters. Experiments on VireoFood172 show improved hierarchical recognition with substantially fewer trainable parameters than full finetuning, and zero-shot analyses indicate that embedding hierarchical priors enhances generalization across ingredient granularity.

Abstract

Fine-grained ingredient recognition presents a significant challenge due to the diverse appearances of ingredients, resulting from different cutting and cooking methods. While existing approaches have shown promising results, they still require extensive training costs and focus solely on fine-grained ingredient recognition. In this paper, we address these limitations by introducing an efficient prompt-tuning framework that adapts pretrained visual-language models (VLMs), such as CLIP, to the ingredient recognition task without requiring full model finetuning. Additionally, we introduce three-level ingredient hierarchies to enhance both training performance and evaluation robustness. Specifically, we propose a hierarchical ingredient recognition task, designed to evaluate model performance across different hierarchical levels (e.g., chicken chunks, chicken, meat), capturing recognition capabilities from coarse- to fine-grained categories. Our method leverages hierarchical labels, training prompt-tuned models with both fine-grained and corresponding coarse-grained labels. Experimental results on the VireoFood172 dataset demonstrate the effectiveness of prompt-tuning with hierarchical labels, achieving superior performance. Moreover, the hierarchical ingredient recognition task provides valuable insights into the model's ability to generalize across different levels of ingredient granularity.

Efficient Prompt Tuning for Hierarchical Ingredient Recognition

TL;DR

Fine-grained ingredient recognition is hindered by diverse visual appearances and label granularity. The authors present an efficient prompt-tuning framework that adapts pretrained visual-language models to ingredient recognition without full finetuning and introduce a three-level ingredient hierarchy to enable coarse-to-fine evaluation. They implement a two-stage cross-hierarchy training where stage1 trains prompts per hierarchy level and stage2 jointly optimizes a weighted combination of stage1 losses, , updating only prompt parameters. Experiments on VireoFood172 show improved hierarchical recognition with substantially fewer trainable parameters than full finetuning, and zero-shot analyses indicate that embedding hierarchical priors enhances generalization across ingredient granularity.

Abstract

Fine-grained ingredient recognition presents a significant challenge due to the diverse appearances of ingredients, resulting from different cutting and cooking methods. While existing approaches have shown promising results, they still require extensive training costs and focus solely on fine-grained ingredient recognition. In this paper, we address these limitations by introducing an efficient prompt-tuning framework that adapts pretrained visual-language models (VLMs), such as CLIP, to the ingredient recognition task without requiring full model finetuning. Additionally, we introduce three-level ingredient hierarchies to enhance both training performance and evaluation robustness. Specifically, we propose a hierarchical ingredient recognition task, designed to evaluate model performance across different hierarchical levels (e.g., chicken chunks, chicken, meat), capturing recognition capabilities from coarse- to fine-grained categories. Our method leverages hierarchical labels, training prompt-tuned models with both fine-grained and corresponding coarse-grained labels. Experimental results on the VireoFood172 dataset demonstrate the effectiveness of prompt-tuning with hierarchical labels, achieving superior performance. Moreover, the hierarchical ingredient recognition task provides valuable insights into the model's ability to generalize across different levels of ingredient granularity.

Paper Structure

This paper contains 13 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The examples of our proposed three-level ingredient hierarchy.
  • Figure 2: The overview of our proposed two-stage cross-hierarchy training method. In the first stage, we train three prompt tuning models of different hierarchies separately. In the second stage, the losses are combined as $L_{stage2}$ to train three models together, leveraging the ingredient hierarchy.
  • Figure 3: Qualitative examples of hierarchical ingredient recognition on VireoFood172. False negatives removed by our method are marked in red with underlines. Additionally, true positives complemented by our method are marked in blue with underlines.
  • Figure 4: F1 score (%) of hierarchical ingredient recognition at different levels based on zero-shot evaluation of LLaVA.