Table of Contents
Fetching ...

IRIS: Learning-Driven Task-Specific Cinema Robot Arm for Visuomotor Motion Control

Qilong Cheng, Matthew Mackay, Ali Bereyhi

TL;DR

The Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control, is presented, a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers.

Abstract

Robotic camera systems enable dynamic, repeatable motion beyond human capabilities, yet their adoption remains limited by the high cost and operational complexity of industrial-grade platforms. We present the Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control. IRIS integrates a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). The system learns object-aware and perceptually smooth camera trajectories directly from human demonstrations, eliminating the need for explicit geometric programming. The complete platform costs under $1,000 USD, supports a 1.5 kg payload, and achieves approximately 1 mm repeatability. Real-world experiments demonstrate accurate trajectory tracking, reliable autonomous execution, and generalization across diverse cinematic motions.

IRIS: Learning-Driven Task-Specific Cinema Robot Arm for Visuomotor Motion Control

TL;DR

The Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control, is presented, a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers.

Abstract

Robotic camera systems enable dynamic, repeatable motion beyond human capabilities, yet their adoption remains limited by the high cost and operational complexity of industrial-grade platforms. We present the Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control. IRIS integrates a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). The system learns object-aware and perceptually smooth camera trajectories directly from human demonstrations, eliminating the need for explicit geometric programming. The complete platform costs under $1,000 USD, supports a 1.5 kg payload, and achieves approximately 1 mm repeatability. Real-world experiments demonstrate accurate trajectory tracking, reliable autonomous execution, and generalization across diverse cinematic motions.
Paper Structure (27 sections, 7 equations, 8 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 7 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: Tabletop deployment of the IRIS prototype performing a real-world demonstration. An end-effector-mounted camera captures a target object (cup) using visuomotor control.
  • Figure 2: Overview of the IRIS system pipeline. Cinema task objectives guide task-specific hardware design. Training data are collected exclusively from real-world human demonstrations, while classical planner trajectories generated in simulation are used for analysis and comparison. A ROS-based low-level control stack executes inverse-dynamics control, and a goal-conditioned imitation learning policy (ACT) is trained on human data and deployed on the physical robot, enabling smooth, obstacle-aware cinematic motion via sim-to-real transfer.
  • Figure 3: IRIS hardware overview: lightweight task-specific architecture with relocated actuation and a differential wrist.
  • Figure 4: Sim-to-real execution of planner-generated trajectories. A classical potential-field planner generates collision-free reference paths in simulation (left), which are then executed on the physical IRIS robot (right) via a ROS-based control stack for validation and comparison.
  • Figure 5: IRIS policy architecture. During training (left), the model conditions on observation history, a goal image, and a CVAE-encoded latent style token $z$ derived from the ground-truth future trajectory. Visual inputs pass through a shared ResNet-18 and Spatial Softmax to preserve spatial coordinates, then fuse with proprioception into temporal tokens. A transformer decoder predicts the 15-step joint trajectory $\hat{q}_{t+1:t+H}$. At inference (right), the CVAE branch is replaced with $z=0$ for deterministic execution.
  • ...and 3 more figures