Table of Contents
Fetching ...

FlexiFly: Interfacing the Physical World with Foundation Models Empowered by Reconfigurable Drone Systems

Minghui Zhao, Junxi Xia, Kaiyuan Hou, Yanchen Liu, Stephen Xia, Xiaofan Jiang

TL;DR

FlexiFly addresses the gap between foundation models and physical-world interaction by coupling a novel ARCK-Means segmentation approach with a modular, reconfigurable drone platform. ARCK-Means produces compact, rectilinear segments from SAM outputs to improve object grounding, while the drone platform autonomously attaches task-relevant sensors/actuators and reconfigures mid-mission to zoom in on areas of interest. In real smart-home deployments, this combination substantially enhances task success (up to 85% improvement) over camera-only baselines, demonstrating a practical path for FM/LLM-enabled agents to sense, reason, and actuate in physical spaces. The work reduces deployment overhead, enables new applications, and points toward scalable intelligent environments where humans and robotic systems coexist with flexible sensing capabilities.

Abstract

Foundation models (FM) have shown immense human-like capabilities for generating digital media. However, foundation models that can freely sense, interact, and actuate the physical domain is far from being realized. This is due to 1) requiring dense deployments of sensors to fully cover and analyze large spaces, while 2) events often being localized to small areas, making it difficult for FMs to pinpoint relevant areas of interest relevant to the current task. We propose FlexiFly, a platform that enables FMs to ``zoom in'' and analyze relevant areas with higher granularity to better understand the physical environment and carry out tasks. FlexiFly accomplishes by introducing 1) a novel image segmentation technique that aids in identifying relevant locations and 2) a modular and reconfigurable sensing and actuation drone platform that FMs can actuate to ``zoom in'' with relevant sensors and actuators. We demonstrate through real smart home deployments that FlexiFly enables FMs and LLMs to complete diverse tasks up to $85\%$ more successfully. FlexiFly is critical step towards FMs and LLMs that can naturally interface with the physical world.

FlexiFly: Interfacing the Physical World with Foundation Models Empowered by Reconfigurable Drone Systems

TL;DR

FlexiFly addresses the gap between foundation models and physical-world interaction by coupling a novel ARCK-Means segmentation approach with a modular, reconfigurable drone platform. ARCK-Means produces compact, rectilinear segments from SAM outputs to improve object grounding, while the drone platform autonomously attaches task-relevant sensors/actuators and reconfigures mid-mission to zoom in on areas of interest. In real smart-home deployments, this combination substantially enhances task success (up to 85% improvement) over camera-only baselines, demonstrating a practical path for FM/LLM-enabled agents to sense, reason, and actuate in physical spaces. The work reduces deployment overhead, enables new applications, and points toward scalable intelligent environments where humans and robotic systems coexist with flexible sensing capabilities.

Abstract

Foundation models (FM) have shown immense human-like capabilities for generating digital media. However, foundation models that can freely sense, interact, and actuate the physical domain is far from being realized. This is due to 1) requiring dense deployments of sensors to fully cover and analyze large spaces, while 2) events often being localized to small areas, making it difficult for FMs to pinpoint relevant areas of interest relevant to the current task. We propose FlexiFly, a platform that enables FMs to ``zoom in'' and analyze relevant areas with higher granularity to better understand the physical environment and carry out tasks. FlexiFly accomplishes by introducing 1) a novel image segmentation technique that aids in identifying relevant locations and 2) a modular and reconfigurable sensing and actuation drone platform that FMs can actuate to ``zoom in'' with relevant sensors and actuators. We demonstrate through real smart home deployments that FlexiFly enables FMs and LLMs to complete diverse tasks up to more successfully. FlexiFly is critical step towards FMs and LLMs that can naturally interface with the physical world.
Paper Structure (26 sections, 11 figures, 5 tables)

This paper contains 26 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: FlexiFly enables FMs to "zoom in" to areas of interest with reconfigurable drones to better interface with physical environments.
  • Figure 2: System architecture of intelligent assistant with FlexiFly.
  • Figure 3: Preliminary study: (a) task completion rate of standard FMs leveraging VLM (camera) and dense sensor networks, compared to FlexiFly; (b) Example showing that the VLM could not detect the phone in plain sight unless "zoomed in".
  • Figure 4: Segmentation and clustering to break down scenes into smaller more manageable pieces for LLaVA and DINO. (a) Object masks after applying Segment Anything Model (SAM); (b) Extracted frames after clustering object masks based on K-Means; (c) Hierarchical clustering, and (d) ARCK-Means. For ARCK-Means, we constrain the aspect ratio of extracted frames to be between 0.67 and 1.5.
  • Figure 5: Different clustering methods and thresholding evaluated for segmentation. Prompt 1: Describe the image in detail. Prompt 2: Is there a {object name} in the image?
  • ...and 6 more figures