AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems
Rajat Bhattacharjya, Sing-Yao Wu, Hyunwoo Oh, Chaewon Nam, Suyeon Koo, Mohsen Imani, Elaheh Bozorgzadeh, Nikil Dutt
TL;DR
This work addresses the challenge of delivering semantically rich, queryable perception for disaster-response UAVs without overburdening onboard resources or relying on unreliable networks. It introduces AVERY, a cognitive-inspired adaptive split computing framework with a dual-stream VLM architecture: a fast Context stream for real-time awareness and a high-fidelity Insight stream for deep analysis, orchestrated by a lightweight on-board self-aware controller. By splitting the VLM early (split@1), employing activation compression, and using a LUT-guided adaptation between HighAccuracy, Balanced, and HighThroughput modes, AVERY achieves substantial energy savings (≈93.98% vs full-edge) while maintaining near- HighAccuracy accuracy (within ≈0.75%) under fluctuating network conditions. The approach enables real-time, open-vocabulary reasoning for disaster scenarios (validated on Flood-ReasonSeg with LISA-7B), offering practical, scalable, and robust VLM-enabled perception for resource-constrained UAVs.
Abstract
Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that on-board CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth networks common in disaster zones. We present AVERY, a framework that enables VLM deployment through adaptive split computing. We advance the split computing paradigm beyond traditional depth-wise partitioning by introducing a functional, cognitive-inspired dual-stream split that separates the VLM into a high-frequency, low-resolution "context stream" for real-time awareness and a low-frequency, high-fidelity "insight stream" for deep analysis. A lightweight, self-aware on-board controller manages this architecture, monitoring network conditions and operator intent to dynamically select from pre-trained compression models, navigating the fundamental accuracy-throughput trade-off. Evaluated using the VLM LISA-7B across an edge-cloud scenario under fluctuating network conditions, AVERY consistently outperforms static configurations, achieving 11.2% higher accuracy than raw image compression and 93.98% lower energy consumption compared to full-edge execution, thereby enhancing mission efficiency and enabling real-time, queryable intelligence on resource-constrained platforms in dynamic environments.
