Table of Contents
Fetching ...

MDiFF: Exploiting Multimodal Score-based Diffusion Models for New Fashion Product Performance Forecasting

Andrea Avogaro, Luigi Capogrosso, Franco Fummi, Marco Cristani

TL;DR

The paper addresses New Fashion Product Performance Forecasting (NFPPF) for unreleased items, where historical sales data are unavailable and domain shifts hinder traditional forecasts. It proposes MDiFF, a two-stage pipeline that first uses a multimodal score-based diffusion model conditioned on product imagery and release timing to generate multiple future sales signals, followed by a lightweight MLP that refines these signals into a final forecast. Evaluated on the VISUELLE dataset, MDiFF achieves state-of-the-art performance without relying on Google Trends or textual descriptions, demonstrating robustness to out-of-distribution items. The approach has practical implications for reducing overproduction and environmental impact in fast fashion by improving forecast accuracy for new products through diffusion-based uncertainty modeling and efficient refinement.

Abstract

The fast fashion industry suffers from significant environmental impacts due to overproduction and unsold inventory. Accurately predicting sales volumes for unreleased products could significantly improve efficiency and resource utilization. However, predicting performance for entirely new items is challenging due to the lack of historical data and rapidly changing trends, and existing deterministic models often struggle with domain shifts when encountering items outside the training data distribution. The recently proposed diffusion models address this issue using a continuous-time diffusion process. This allows us to simulate how new items are adopted, reducing the impact of domain shift challenges faced by deterministic models. As a result, in this paper, we propose MDiFF: a novel two-step multimodal diffusion models-based pipeline for New Fashion Product Performance Forecasting (NFPPF). First, we use a score-based diffusion model to predict multiple future sales for different clothes over time. Then, we refine these multiple predictions with a lightweight Multi-layer Perceptron (MLP) to get the final forecast. MDiFF leverages the strengths of both architectures, resulting in the most accurate and efficient forecasting system for the fast-fashion industry at the state-of-the-art. The code can be found at https://github.com/intelligolabs/MDiFF.

MDiFF: Exploiting Multimodal Score-based Diffusion Models for New Fashion Product Performance Forecasting

TL;DR

The paper addresses New Fashion Product Performance Forecasting (NFPPF) for unreleased items, where historical sales data are unavailable and domain shifts hinder traditional forecasts. It proposes MDiFF, a two-stage pipeline that first uses a multimodal score-based diffusion model conditioned on product imagery and release timing to generate multiple future sales signals, followed by a lightweight MLP that refines these signals into a final forecast. Evaluated on the VISUELLE dataset, MDiFF achieves state-of-the-art performance without relying on Google Trends or textual descriptions, demonstrating robustness to out-of-distribution items. The approach has practical implications for reducing overproduction and environmental impact in fast fashion by improving forecast accuracy for new products through diffusion-based uncertainty modeling and efficient refinement.

Abstract

The fast fashion industry suffers from significant environmental impacts due to overproduction and unsold inventory. Accurately predicting sales volumes for unreleased products could significantly improve efficiency and resource utilization. However, predicting performance for entirely new items is challenging due to the lack of historical data and rapidly changing trends, and existing deterministic models often struggle with domain shifts when encountering items outside the training data distribution. The recently proposed diffusion models address this issue using a continuous-time diffusion process. This allows us to simulate how new items are adopted, reducing the impact of domain shift challenges faced by deterministic models. As a result, in this paper, we propose MDiFF: a novel two-step multimodal diffusion models-based pipeline for New Fashion Product Performance Forecasting (NFPPF). First, we use a score-based diffusion model to predict multiple future sales for different clothes over time. Then, we refine these multiple predictions with a lightweight Multi-layer Perceptron (MLP) to get the final forecast. MDiFF leverages the strengths of both architectures, resulting in the most accurate and efficient forecasting system for the fast-fashion industry at the state-of-the-art. The code can be found at https://github.com/intelligolabs/MDiFF.

Paper Structure

This paper contains 15 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: MDiFF: a two-stage pipeline for NFPPF. Starting from multiple signals of a single fashion product, we build a multimodal score-based diffusion model to generate an initial prediction of the sales, addressing potential objects with features beyond the training distribution. Then, we refine the Diffusion output using a lightweight MLP in order to obtain the final prediction.
  • Figure 2: An overview of our multimodal score-based diffusion model. The diffusion basic block is taken from TS-Diff kollovieh2024predict (grey square), modified to be injected with the output of the transformer decoder layer, a module responsible for producing an embedding representing the two modalities of input related to the item. Each block contains two outputs: one for the subsequent block and another for a skip connection. The summation of all skip connections forms the model's final output. The primary component of each block is typically an S4 block gu2021efficiently, chosen by the authors of kollovieh2024predict for its efficiency when it comes to time series and structured data. The input of the MDiFF is noisy data, and the output is the denoised sample.
  • Figure 3: In the figures above are presented some visual representations of the multimodal score-based diffusion model output. In particular, the red region represents the output distribution of the Diffusion model given a certain sample. The red area is obtained by computing the weekly quantiles among the 50 outputs. The Prediction line, on the other hand, is the output of the refinement MLP, i.e., the final prediction. The forecasting period is for 6 weeks from the date of release. The y-axis shows the number of units sold of a specific garment in the chain's various shops.