Table of Contents
Fetching ...

Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications

Md. Adnanul Islam, Wasimul Karim, Md Mahbub Alam, Subhey Sadi Rahman, Md. Abdur Rahman, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan, Kheng Cher Yeo, Deepika Mathur, Sami Azam

TL;DR

Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height is proposed, and a physically grounded explanation module is incorporated to provide clear, human-readable explanations for each prediction.

Abstract

Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.

Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications

TL;DR

Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height is proposed, and a physically grounded explanation module is incorporated to provide clear, human-readable explanations for each prediction.

Abstract

Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.
Paper Structure (34 sections, 11 equations, 9 figures, 8 tables, 2 algorithms)

This paper contains 34 sections, 11 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Representative samples from the Waste-Weight-10K dataset. The figure illustrates the diversity in object shapes, textures, and lighting conditions captured across different commercial & industrial environments.
  • Figure 2: Distribution of Waste Categories. The dataset includes many types of metal scrap collected from C&I environments.
  • Figure 3: Weight Distribution of the Dataset. The chart on the left shows the raw weights, while the chart on the right shows the weights after a log transformation, which simplifies the training process for the model.
  • Figure 4: Scatter plot of Volume ($m^3$) versus Weight (kg). The distribution highlights the complexity of the problem: small dense objects can be heavier than large voluminous ones, necessitating a multimodal approach.
  • Figure 5: Overview of the proposed Multimodal Weight Predictor (MWP) framework. It integrates visual features with metadata through a Mutual Attention Fusion mechanism.
  • ...and 4 more figures