Table of Contents
Fetching ...

MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

Liman Wang, Hanyang Zhong, Tianyuan Wang, Shan Luo, Jihong Zhu

TL;DR

MLLM-Fabric addresses the challenge of fabric selection by reframing it as property-specific pairwise ranking using a multimodal large language model. The framework fuses RGB vision, GelSight visuotactile data, and force signals, and trains with supervised preferences plus explanation-guided distillation to produce interpretable, abstention-aware decisions. A real-world dataset of 220 fabrics with co-registered RGB, GelSight, and pressure data supports reproducible benchmarking, and Fabric-Llama-90B demonstrates superior attribute ranking and selection reliability compared with baselines. The work advances robotic material understanding by linking perceptual cues to functional properties and decision-making, with implications for automated textile manufacturing and smart retail.

Abstract

Choosing appropriate fabrics is critical for meeting functional and quality demands in robotic textile manufacturing, apparel production, and smart retail. We propose MLLM-Fabric, a robotic framework leveraging multimodal large language models (MLLMs) for fabric sorting and selection. Built on a multimodal robotic platform, the system is trained through supervised fine-tuning and explanation-guided distillation to rank fabric properties. We also release a dataset of 220 diverse fabrics, each with RGB images and synchronized visuotactile and pressure data. Experiments show that our Fabric-Llama-90B consistently outperforms pretrained vision-language baselines in both attribute ranking and selection reliability. Code and dataset are publicly available at https://github.com/limanwang/MLLM-Fabric.

MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

TL;DR

MLLM-Fabric addresses the challenge of fabric selection by reframing it as property-specific pairwise ranking using a multimodal large language model. The framework fuses RGB vision, GelSight visuotactile data, and force signals, and trains with supervised preferences plus explanation-guided distillation to produce interpretable, abstention-aware decisions. A real-world dataset of 220 fabrics with co-registered RGB, GelSight, and pressure data supports reproducible benchmarking, and Fabric-Llama-90B demonstrates superior attribute ranking and selection reliability compared with baselines. The work advances robotic material understanding by linking perceptual cues to functional properties and decision-making, with implications for automated textile manufacturing and smart retail.

Abstract

Choosing appropriate fabrics is critical for meeting functional and quality demands in robotic textile manufacturing, apparel production, and smart retail. We propose MLLM-Fabric, a robotic framework leveraging multimodal large language models (MLLMs) for fabric sorting and selection. Built on a multimodal robotic platform, the system is trained through supervised fine-tuning and explanation-guided distillation to rank fabric properties. We also release a dataset of 220 diverse fabrics, each with RGB images and synchronized visuotactile and pressure data. Experiments show that our Fabric-Llama-90B consistently outperforms pretrained vision-language baselines in both attribute ranking and selection reliability. Code and dataset are publicly available at https://github.com/limanwang/MLLM-Fabric.

Paper Structure

This paper contains 29 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The robotic system uses multimodal perception and reasoning to select fabrics based on user needs, choosing pure cotton seersucker (the white one on the hanger) as the optimal summer clothing. GelSight images are sequential frames with corresponding pressure sensor data. The figure shows a pressure trend graph for one fabric as an example.
  • Figure 2: Fabric selection robotic system workflow. The system follows a four-stage pipeline. First, a Visual Perception Module captures fabric IDs, RGB images, and 3D positions. Second, a Tactile Perception Module records GelSight image sequences and pressure values. Third, an MLLM Reasoning Module processes multimodal data to compare fabrics. Finally, a Task Execution Module selects and retrieves the goal fabric using stored IDs and positions. Fabric data is archived for efficient decision-making.
  • Figure 3: (A) Overview of the tactile sensing setup, including the GelSight Mini sensor and a planar pressure sensor mounted on a hard surface. (B) The layered structure during pressing: (a) gel layer, (b) fabric sample, (c) hard surface, and (d) pressure sensor. (C) Synchronization of tactile image sequences from the GelSight sensor with corresponding pressure measurements at a frequency of 25 Hz. (D) A selection of ten representative fabric samples from a dataset containing 220 unique fabrics.
  • Figure 4: Multimodal Reasoning and Knowledge Distillation Pipeline. The upper section illustrates Multimodal Explanation-Guided Knowledge Distillation, where a teacher model generates post hoc explanations that are subsequently distilled into a student model. The lower section depicts the fabric property comparison and selection process, where the fine-tuned model performs ad hoc reasoning to compare attributes and issue fabric selection commands.
  • Figure 5: Ablation study of input modalities for pairwise property comparison using GPT-4o (top) and Llama3.2-Vision-90B (bottom). Bars show attribute-wise accuracy (%), dashed lines indicate mean accuracy, and dotted lines show prediction skewness (SK; lower is better). Error bars denote standard error over three runs. “Single GelSight” = single-frame GelSight; “GelSight Seq” = Gelsight sequential frames.