Table of Contents
Fetching ...

RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yugang Jiang

TL;DR

The paper tackles the lack of a unified, nutrition-rich dataset for food-related vision-language tasks and the task-conflict challenges in fine-tuning large multi-modal models. It introduces Uni-Food, a 100k-image dataset with categories, ingredients, recipes, and ingredient-level nutrition, and RoDE, a Linear Rectified Mixture of Diverse Experts that uses heterogeneous LoRA modules and a linear rectified router to enable sparse, efficient multi-task learning. Empirical results show RoDE achieving state-of-the-art performance across ingredient recognition, recipe generation, and nutrition estimation on Uni-Food and other benchmarks, with strong memory-efficiency and training-speed characteristics. The work demonstrates the value of fine-grained skill modules and sparse routing for robust, nutrition-aware food LMMs in real-world applications.

Abstract

Large Multi-modal Models (LMMs) have significantly advanced a variety of vision-language tasks. The scalability and availability of high-quality training data play a pivotal role in the success of LMMs. In the realm of food, while comprehensive food datasets such as Recipe1M offer an abundance of ingredient and recipe information, they often fall short of providing ample data for nutritional analysis. The Recipe1M+ dataset, despite offering a subset for nutritional evaluation, is limited in the scale and accuracy of nutrition information. To bridge this gap, we introduce Uni-Food, a unified food dataset that comprises over 100,000 images with various food labels, including categories, ingredients, recipes, and ingredient-level nutritional information. Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain. To mitigate the conflicts arising from multi-task supervision during fine-tuning of LMMs, we introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach. RoDE utilizes a diverse array of experts to address tasks of varying complexity, thereby facilitating the coordination of trainable parameters, i.e., it allocates more parameters for more complex tasks and, conversely, fewer parameters for simpler tasks. RoDE implements linear rectification union to refine the router's functionality, thereby enhancing the efficiency of sparse task allocation. These design choices endow RoDE with features that ensure GPU memory efficiency and ease of optimization. Our experimental results validate the effectiveness of our proposed approach in addressing the inherent challenges of food-related multitasking.

RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

TL;DR

The paper tackles the lack of a unified, nutrition-rich dataset for food-related vision-language tasks and the task-conflict challenges in fine-tuning large multi-modal models. It introduces Uni-Food, a 100k-image dataset with categories, ingredients, recipes, and ingredient-level nutrition, and RoDE, a Linear Rectified Mixture of Diverse Experts that uses heterogeneous LoRA modules and a linear rectified router to enable sparse, efficient multi-task learning. Empirical results show RoDE achieving state-of-the-art performance across ingredient recognition, recipe generation, and nutrition estimation on Uni-Food and other benchmarks, with strong memory-efficiency and training-speed characteristics. The work demonstrates the value of fine-grained skill modules and sparse routing for robust, nutrition-aware food LMMs in real-world applications.

Abstract

Large Multi-modal Models (LMMs) have significantly advanced a variety of vision-language tasks. The scalability and availability of high-quality training data play a pivotal role in the success of LMMs. In the realm of food, while comprehensive food datasets such as Recipe1M offer an abundance of ingredient and recipe information, they often fall short of providing ample data for nutritional analysis. The Recipe1M+ dataset, despite offering a subset for nutritional evaluation, is limited in the scale and accuracy of nutrition information. To bridge this gap, we introduce Uni-Food, a unified food dataset that comprises over 100,000 images with various food labels, including categories, ingredients, recipes, and ingredient-level nutritional information. Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain. To mitigate the conflicts arising from multi-task supervision during fine-tuning of LMMs, we introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach. RoDE utilizes a diverse array of experts to address tasks of varying complexity, thereby facilitating the coordination of trainable parameters, i.e., it allocates more parameters for more complex tasks and, conversely, fewer parameters for simpler tasks. RoDE implements linear rectification union to refine the router's functionality, thereby enhancing the efficiency of sparse task allocation. These design choices endow RoDE with features that ensure GPU memory efficiency and ease of optimization. Our experimental results validate the effectiveness of our proposed approach in addressing the inherent challenges of food-related multitasking.
Paper Structure (32 sections, 2 equations, 11 figures, 8 tables)

This paper contains 32 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: RoDE's Emphasis on Food-Related VQA Tasks. RoDE primarily targets multi-task learning specific to food, i.e., food classification, ingredient recognition, and nutrition estimation.
  • Figure 2: Composition of Labels in the UniFood Dataset. The annotation consists of category, ingredient list, recipe instructions, and nutrition information.
  • Figure 3: The nutritional statistics (per 100g) and ingredient statistics for UniFood.
  • Figure 4: UniFood Dataset Statistics. The category distribution of the dataset.
  • Figure 5: The illustration of our proposed Linear Rectified Mixture of Diverse Experts (RoDE) approach. We use LLaVA liu2023llava as the foundational LMM. RoDE module is incorporated into both the query projection layer and the value projection layer of each transformer block.
  • ...and 6 more figures