Table of Contents
Fetching ...

Vision-Language-Action Models for Selective Robotic Disassembly: A Case Study on Critical Component Extraction from Desktops

Chang Liu, Sibo Tian, Sara Behdad, Xiao Liang, Minghui Zheng

TL;DR

This study probes the viability of end-to-end vision-language-action models for selective robotic disassembly of EoL desktops, focusing on RAM removal and CPU bracket unlocking. A teleoperation-based dataset was created and two VLA models (OpenVLA and OpenVLA-OFT) were fine-tuned to perform the tasks. The results show that while VLA models can learn early, high-level disassembly steps, they struggle with precise, dexterous subtasks, and a simple hybrid with a rule-based controller is required to complete the full operation. The findings highlight current limitations of VLA in contact-rich manipulation and suggest directions for improving data diversity, multi-view perception, tactile feedback, and reinforcement learning integration.

Abstract

Automating disassembly of critical components from end-of-life (EoL) desktops, such as high-value items like RAM modules and CPUs, as well as sensitive parts like hard disk drives, remains challenging due to the inherent variability and uncertainty of these products. Moreover, their disassembly requires sequential, precise, and dexterous operations, further increasing the complexity of automation. Current robotic disassembly processes are typically divided into several stages: perception, sequence planning, task planning, motion planning, and manipulation. Each stage requires explicit modeling, which limits generalization to unfamiliar scenarios. Recent development of vision-language-action (VLA) models has presented an end-to-end approach for general robotic manipulation tasks. Although VLAs have demonstrated promising performance on simple tasks, the feasibility of applying such models to complex disassembly remains largely unexplored. In this paper, we collected a customized dataset for robotic RAM and CPU disassembly and used it to fine-tune two well-established VLA approaches, OpenVLA and OpenVLA-OFT, as a case study. We divided the whole disassembly task into several small steps, and our preliminary experimental results indicate that the fine-tuned VLA models can faithfully complete multiple early steps but struggle with certain critical subtasks, leading to task failure. However, we observed that a simple hybrid strategy that combines VLA with a rule-based controller can successfully perform the entire disassembly operation. These findings highlight the current limitations of VLA models in handling the dexterity and precision required for robotic EoL product disassembly. By offering a detailed analysis of the observed results, this study provides insights that may inform future research to address current challenges and advance end-to-end robotic automated disassembly.

Vision-Language-Action Models for Selective Robotic Disassembly: A Case Study on Critical Component Extraction from Desktops

TL;DR

This study probes the viability of end-to-end vision-language-action models for selective robotic disassembly of EoL desktops, focusing on RAM removal and CPU bracket unlocking. A teleoperation-based dataset was created and two VLA models (OpenVLA and OpenVLA-OFT) were fine-tuned to perform the tasks. The results show that while VLA models can learn early, high-level disassembly steps, they struggle with precise, dexterous subtasks, and a simple hybrid with a rule-based controller is required to complete the full operation. The findings highlight current limitations of VLA in contact-rich manipulation and suggest directions for improving data diversity, multi-view perception, tactile feedback, and reinforcement learning integration.

Abstract

Automating disassembly of critical components from end-of-life (EoL) desktops, such as high-value items like RAM modules and CPUs, as well as sensitive parts like hard disk drives, remains challenging due to the inherent variability and uncertainty of these products. Moreover, their disassembly requires sequential, precise, and dexterous operations, further increasing the complexity of automation. Current robotic disassembly processes are typically divided into several stages: perception, sequence planning, task planning, motion planning, and manipulation. Each stage requires explicit modeling, which limits generalization to unfamiliar scenarios. Recent development of vision-language-action (VLA) models has presented an end-to-end approach for general robotic manipulation tasks. Although VLAs have demonstrated promising performance on simple tasks, the feasibility of applying such models to complex disassembly remains largely unexplored. In this paper, we collected a customized dataset for robotic RAM and CPU disassembly and used it to fine-tune two well-established VLA approaches, OpenVLA and OpenVLA-OFT, as a case study. We divided the whole disassembly task into several small steps, and our preliminary experimental results indicate that the fine-tuned VLA models can faithfully complete multiple early steps but struggle with certain critical subtasks, leading to task failure. However, we observed that a simple hybrid strategy that combines VLA with a rule-based controller can successfully perform the entire disassembly operation. These findings highlight the current limitations of VLA models in handling the dexterity and precision required for robotic EoL product disassembly. By offering a detailed analysis of the observed results, this study provides insights that may inform future research to address current challenges and advance end-to-end robotic automated disassembly.

Paper Structure

This paper contains 7 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Manipulation task comparison. In most traditional VLA applications, the environment is clean and structured, and the target object is visually distinct. However, in robotic disassembly, the target is often small, difficult to isolate visually, and demands a much higher level of precision.
  • Figure 2: Comparison between traditional multi-stage disassembly approaches and end-to-end vision–language–action methods.
  • Figure 3: The teleoperation setup for imitation data collection.
  • Figure 4: The data processing steps for fine-tuning models and the utilization of VLA models in experimental tasks.
  • Figure 5: Training loss of two VLA models on two disassembly tasks.
  • ...and 1 more figures