TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices

Hasib-Al Rashid; Argho Sarkar; Aryya Gangopadhyay; Maryam Rahnemoonfar; Tinoosh Mohsenin

TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices

Hasib-Al Rashid, Argho Sarkar, Aryya Gangopadhyay, Maryam Rahnemoonfar, Tinoosh Mohsenin

TL;DR

This work addresses delivering multimodal Visual Question Answering on resource-constrained edge devices for disaster-response scenarios. It introduces TinyVQA, a memory-aware compact multimodal DNN trained via knowledge distillation and 8-bit quantization with supervised attention. Evaluations on FloodNet show TinyVQA achieving 79.5% accuracy with a 339 KB footprint, while maintaining usability on a Crazyflie 2.0 drone with 56 ms latency and 0.7 W power. These results demonstrate practical on-device VQA enabling real-time, energy-efficient situational awareness in environments with limited connectivity.

Abstract

Traditional machine learning models often require powerful hardware, making them unsuitable for deployment on resource-limited devices. Tiny Machine Learning (tinyML) has emerged as a promising approach for running machine learning models on these devices, but integrating multiple data modalities into tinyML models still remains a challenge due to increased complexity, latency, and power consumption. This paper proposes TinyVQA, a novel multimodal deep neural network for visual question answering tasks that can be deployed on resource-constrained tinyML hardware. TinyVQA leverages a supervised attention-based model to learn how to answer questions about images using both vision and language modalities. Distilled knowledge from the supervised attention-based VQA model trains the memory aware compact TinyVQA model and low bit-width quantization technique is employed to further compress the model for deployment on tinyML devices. The TinyVQA model was evaluated on the FloodNet dataset, which is used for post-disaster damage assessment. The compact model achieved an accuracy of 79.5%, demonstrating the effectiveness of TinyVQA for real-world applications. Additionally, the model was deployed on a Crazyflie 2.0 drone, equipped with an AI deck and GAP8 microprocessor. The TinyVQA model achieved low latencies of 56 ms and consumes 693 mW power while deployed on the tiny drone, showcasing its suitability for resource-constrained embedded systems.

TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 6 figures, 2 tables)

This paper contains 13 sections, 4 equations, 6 figures, 2 tables.

Introduction
TinyVQA Model Architecture
Baseline VQA Model Design
Memory-Aware Compact VQA Model Design
TinyVQA Evaluation
Dataset Description
Evaluation Results and Analysis
TinyVQA Deployment on Resource Constrained Hardware
Hardware Architecture
Hardware Implementation Details
Deployment Results and Analysis
Conclusion
ACKNOWLEDGMENT

Figures (6)

Figure 1: (a) Highlevel overview of proposed TinyVQA system. Rescuer can acquire effective information about the affected area by asking questions when a drone coupled with a VQA system captures images from the hurricane-stricken area from a high altitude. (b) The flow diagram of proposed TinyVQA system. Proposed TinyVQA is the sequential combination of the steps shown in the diagram.
Figure 2: (a) Overview of our proposed TinyVQA model where the baseline VQA model uses VGG-16 and a one-layer LSTM to obtain the image feature matrix and question feature, respectively. We then consider MFB pooling to obtain a fine-grained multimodal representation. A softmax function is applied to that joint representation to estimate attention weights from the images for given questions. Finally, we calculate two loss functions: one minimizes the distance between the visual mask and the estimated visual attention weight, and the other minimizes the loss between the ground-truth answer and the predicted answer from the VQA classifier. Memory aware compact VQA model is designed with 3 CNN layers and 1 LSTM layer for each of the image and text modality feature extraction. Distilled knowledge is used from the baseline model to have the final result. (b) Detailed structure of the MFB Fusion block.
Figure 3: Accuracy and Model Size Correlation for Baseline VQA and TinyVQA for FloodNet floodnet dataset. Baseline model achieved 81% accuracy with 479 MB model size whereas final TinyVQA model achieved 79.5% accuracy with 339 KB model size.
Figure 4: Derived visual attentions for given questions from TinyVQA model. The yellowish tone in the image denotes higher attention weight. Attention learned with visual supervision (the last column) emphasizes the relevant image portions (buildings and roads in this case) to address the questions from the top and bottom images, respectively.
Figure 5: (a) Detailed block diagram of Crazyflie AI-deck powered by GAP8 microprocessor (b) Memory Hierarchy for GAP8 microprocessor. GAP 8 microprocessor has L1 Memory of 100 KB (80 KB shared in compute engine + 20 KB for low power MCU.), L2 memory of 512 KB and 8MB of DRAM (c) TinyVQA flow; Left: the DMA manages L2 -L1 communication using double-buffering. Right: the cluster executes PULP-NN on tile stored in one of the L1 buffers.
...and 1 more figures

TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices

TL;DR

Abstract

TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (6)