Table of Contents
Fetching ...

Multi-Modal Fusion of In-Situ Video Data and Process Parameters for Online Forecasting of Cookie Drying Readiness

Shichen Li, Chenhui Shao

TL;DR

The paper tackles online forecasting of cookie-drying readiness by fusing in-situ video with process parameters in an end-to-end encoder–decoder framework. It introduces modality-specific encoders and a transformer-based decoder to extract robust cross-modal representations, evaluated under leave-one-group-out cross-validation on a sugar cookie dataset. The approach achieves a mean time-to-ready error of about $15\text{s}$ and significantly outperforms both baseline fusion and video-only models, while maintaining efficiency for industrial settings. This work demonstrates the practicality of multi-modal fusion for real-time process monitoring and paves the way for broader industrial adoption and transferability to other modality configurations.

Abstract

Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.

Multi-Modal Fusion of In-Situ Video Data and Process Parameters for Online Forecasting of Cookie Drying Readiness

TL;DR

The paper tackles online forecasting of cookie-drying readiness by fusing in-situ video with process parameters in an end-to-end encoder–decoder framework. It introduces modality-specific encoders and a transformer-based decoder to extract robust cross-modal representations, evaluated under leave-one-group-out cross-validation on a sugar cookie dataset. The approach achieves a mean time-to-ready error of about and significantly outperforms both baseline fusion and video-only models, while maintaining efficiency for industrial settings. This work demonstrates the practicality of multi-modal fusion for real-time process monitoring and paves the way for broader industrial adoption and transferability to other modality configurations.

Abstract

Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.

Paper Structure

This paper contains 18 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Schematic of the multi-modal real-time forecasting of cookie drying readiness framework.
  • Figure 2: Schematic setup of the cookie drying experiment.
  • Figure 3: Video data preparation process.
  • Figure 4: Examples of cookies at different time-to-ready moments.
  • Figure 5: Overall architecture of multi-modal fusion model for online forecasting.
  • ...and 7 more figures