Table of Contents
Fetching ...

AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems

Rajat Bhattacharjya, Sing-Yao Wu, Hyunwoo Oh, Chaewon Nam, Suyeon Koo, Mohsen Imani, Elaheh Bozorgzadeh, Nikil Dutt

TL;DR

This work addresses the challenge of delivering semantically rich, queryable perception for disaster-response UAVs without overburdening onboard resources or relying on unreliable networks. It introduces AVERY, a cognitive-inspired adaptive split computing framework with a dual-stream VLM architecture: a fast Context stream for real-time awareness and a high-fidelity Insight stream for deep analysis, orchestrated by a lightweight on-board self-aware controller. By splitting the VLM early (split@1), employing activation compression, and using a LUT-guided adaptation between HighAccuracy, Balanced, and HighThroughput modes, AVERY achieves substantial energy savings (≈93.98% vs full-edge) while maintaining near- HighAccuracy accuracy (within ≈0.75%) under fluctuating network conditions. The approach enables real-time, open-vocabulary reasoning for disaster scenarios (validated on Flood-ReasonSeg with LISA-7B), offering practical, scalable, and robust VLM-enabled perception for resource-constrained UAVs.

Abstract

Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that on-board CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth networks common in disaster zones. We present AVERY, a framework that enables VLM deployment through adaptive split computing. We advance the split computing paradigm beyond traditional depth-wise partitioning by introducing a functional, cognitive-inspired dual-stream split that separates the VLM into a high-frequency, low-resolution "context stream" for real-time awareness and a low-frequency, high-fidelity "insight stream" for deep analysis. A lightweight, self-aware on-board controller manages this architecture, monitoring network conditions and operator intent to dynamically select from pre-trained compression models, navigating the fundamental accuracy-throughput trade-off. Evaluated using the VLM LISA-7B across an edge-cloud scenario under fluctuating network conditions, AVERY consistently outperforms static configurations, achieving 11.2% higher accuracy than raw image compression and 93.98% lower energy consumption compared to full-edge execution, thereby enhancing mission efficiency and enabling real-time, queryable intelligence on resource-constrained platforms in dynamic environments.

AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems

TL;DR

This work addresses the challenge of delivering semantically rich, queryable perception for disaster-response UAVs without overburdening onboard resources or relying on unreliable networks. It introduces AVERY, a cognitive-inspired adaptive split computing framework with a dual-stream VLM architecture: a fast Context stream for real-time awareness and a high-fidelity Insight stream for deep analysis, orchestrated by a lightweight on-board self-aware controller. By splitting the VLM early (split@1), employing activation compression, and using a LUT-guided adaptation between HighAccuracy, Balanced, and HighThroughput modes, AVERY achieves substantial energy savings (≈93.98% vs full-edge) while maintaining near- HighAccuracy accuracy (within ≈0.75%) under fluctuating network conditions. The approach enables real-time, open-vocabulary reasoning for disaster scenarios (validated on Flood-ReasonSeg with LISA-7B), offering practical, scalable, and robust VLM-enabled perception for resource-constrained UAVs.

Abstract

Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that on-board CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth networks common in disaster zones. We present AVERY, a framework that enables VLM deployment through adaptive split computing. We advance the split computing paradigm beyond traditional depth-wise partitioning by introducing a functional, cognitive-inspired dual-stream split that separates the VLM into a high-frequency, low-resolution "context stream" for real-time awareness and a low-frequency, high-fidelity "insight stream" for deep analysis. A lightweight, self-aware on-board controller manages this architecture, monitoring network conditions and operator intent to dynamically select from pre-trained compression models, navigating the fundamental accuracy-throughput trade-off. Evaluated using the VLM LISA-7B across an edge-cloud scenario under fluctuating network conditions, AVERY consistently outperforms static configurations, achieving 11.2% higher accuracy than raw image compression and 93.98% lower energy consumption compared to full-edge execution, thereby enhancing mission efficiency and enabling real-time, queryable intelligence on resource-constrained platforms in dynamic environments.

Paper Structure

This paper contains 19 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: The Motivation and AVERY Paradigm. (a) Conventional CNNs cannot support the distinct, prompt-based reasoning needed for multi-level disaster response, such as broad Context queries for triage and specific Insight queries for investigation. (b) While a full on-device VLM can process these queries, its prohibitive energy cost makes it infeasible for UAV deployment. (c) Naive cloud offloading fails under the unreliable, low-bandwidth networks found in disaster zones. (d) AVERY resolves these conflicts with a dual-stream (context and insight) split computing architecture, intelligently transmitting either lightweight context features or compressed insight activations to enable efficient, multi-level intelligence.
  • Figure 2: The AVERY Architecture. The system splits the VLM between on-board (UAV) and remote server processing. On-board, a dual-vision pipeline processes a captured image( ) to generate features for two streams: a high-fidelity Insight Stream (in bright yellow) from the SAM Vision Backbone( )+ CLIP Encoder( ) and a lightweight Context Stream (in purple) from only the CLIP Encoder( ). The on-board Split Controller( ) makes a selection between these streams and, for the Insight Stream, applies an appropriate compression ratio from a predefined LUT based on operator intent and network conditions. It then packetizes this data( ) for transmission. On the remote server( ), the VLM combines the SAM features( ) and CLIP features with an operator's prompt( ) for reasoning in the Multi-Modal LLM( ), and a final Decoder( ) generates the precise segmentation mask.
  • Figure 3: The AVERY Split (across model depth) and Compression Mechanism. We insert a trainable bottleneck matsubara2022bottlefit (encoder-decoder pair) after the first ViT block ('split@1') to compress the large SAM activation tensor (10.49 MB). Each pre-trained bottleneck model provides a different accuracy-throughput trade-off, forming the operational tiers for runtime controller selection.
  • Figure 4: Runtime results of AVERY's Insight Stream in "Prioritize Accuracy" Mode (20 minutes). (a) Bandwidth variation over time; (b) Runtime Tier Switching between High Accuracy and Balanced Modes; (c) Accuracy Comparison for both original and fine-tuned models; (d) Throughput comparison among different tiers and AVERY.
  • Figure 5: Trade-off Analysis. Average Accuracy vs. Average Throughput for different operational tiers (Insight Stream, AVERY is in "Prioritize Accuracy" Mode. Results are for the Original model).