Table of Contents
Fetching ...

CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models

Dongyu Yao, Keling Yao, Junhong Zhou, Yinghao Zhang

TL;DR

CaLoRAify tackles the challenge of estimating calories from a single food image by combining a vision-language backbone with LoRA fine-tuning and retrieval-augmented generation. It introduces CalData, a 330K image-text dataset enriched with nutrition facts, and a training pipeline that first identifies ingredients and quantities before grounding nutrition data through a USDA-based RAG system to produce calorie estimates. The work contributes a domain-specific dataset, a ViT-LLaMA-2 based VLM framework, and an end-to-end inference flow that supports interactive dialogue while avoiding heavy multi-view requirements. The results show improved accuracy over baselines and demonstrate the practical value of grounding in external nutrition knowledge for dietary management, with open-source release for reproducibility and community impact.

Abstract

The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient recognition and calorie estimation via training with visual-text pairs. During inference, users only need a single monocular food image to estimate calories while retaining the flexibility of agent-based conversational interaction. With Low-rank Adaptation (LoRA) and Retrieve-augmented Generation (RAG) techniques, our system enhances the performance of foundational VLMs in the vertical domain of calorie estimation. Our code and data are fully open-sourced at https://github.com/KennyYao2001/16824-CaLORAify.

CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models

TL;DR

CaLoRAify tackles the challenge of estimating calories from a single food image by combining a vision-language backbone with LoRA fine-tuning and retrieval-augmented generation. It introduces CalData, a 330K image-text dataset enriched with nutrition facts, and a training pipeline that first identifies ingredients and quantities before grounding nutrition data through a USDA-based RAG system to produce calorie estimates. The work contributes a domain-specific dataset, a ViT-LLaMA-2 based VLM framework, and an end-to-end inference flow that supports interactive dialogue while avoiding heavy multi-view requirements. The results show improved accuracy over baselines and demonstrate the practical value of grounding in external nutrition knowledge for dietary management, with open-source release for reproducibility and community impact.

Abstract

The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient recognition and calorie estimation via training with visual-text pairs. During inference, users only need a single monocular food image to estimate calories while retaining the flexibility of agent-based conversational interaction. With Low-rank Adaptation (LoRA) and Retrieve-augmented Generation (RAG) techniques, our system enhances the performance of foundational VLMs in the vertical domain of calorie estimation. Our code and data are fully open-sourced at https://github.com/KennyYao2001/16824-CaLORAify.

Paper Structure

This paper contains 14 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: A user interface example of Caloraify
  • Figure 2: The workflow is similar to Chen2023MiniGPTv2LL, beginning with the pre-trained Vision Transformer (ViT) processing the input dish image to extract tokenized visual representations, which capture key features of the dish. Guided by the [vqa] identifier, the LLaMA-2 module formulates a structured question, such as “What ingredients and quantities are required for this recipe?”, to direct subsequent tasks. This query is sent to the Retrieval-Augmented Generation (RAG) module, which retrieves relevant information, including ingredients and their nutritional values, from an external database. Finally, LLaMA-2 integrates the retrieved text and visual features to generate comprehensive outputs, such as ingredient quantities and calorie estimates, presented in an interpretable format.
  • Figure 3: Qualitative results of the model output