DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

Yoonsung Kim; Changhun Oh; Jinwoo Hwang; Wonung Kim; Seongryong Oh; Yubin Lee; Hardik Sharma; Amir Yazdanbakhsh; Jongse Park

DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

Yoonsung Kim, Changhun Oh, Jinwoo Hwang, Wonung Kim, Seongryong Oh, Yubin Lee, Hardik Sharma, Amir Yazdanbakhsh, Jongse Park

TL;DR

DaCapo tackles the resource bottleneck of continuous learning for video analytics on battery‑powered autonomous systems by co‑designing a spatially partitionable, precision‑flexible accelerator with a spatiotemporal resource allocation algorithm. The two‑sub‑accelerator architecture (T‑SA and B‑SA) and the MX‑based Dot‑Product Engine enable concurrent execution of inference, labeling, and retraining, while runtime drift detection guides dynamic labeling emphasis to maintain accuracy. Empirical results on the BDD100K driving dataset show DaCapo achieving 6.5% and 5.5% higher accuracy than Ekya and EOMU, respectively, with up to 254× energy savings versus a GPU baseline. This work demonstrates a practical path to deploying continuously learning video analytics on autonomous systems by tightly integrating hardware and algorithmic strategies to balance resource use and model accuracy.

Abstract

Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a larger "teacher" model for labeling sampled data (labeling), and continuously retrains the student model to adapt to changing scenarios (retraining). This paper highlights the limitations in state-of-the-art continuous learning systems: (1) they focus on computations for retraining, while overlooking the compute needs for inference and labeling, (2) they rely on power-hungry GPUs, unsuitable for battery-operated autonomous systems, and (3) they are located on a remote centralized server, intended for multi-tenant scenarios, again unsuitable for autonomous systems due to privacy, network availability, and latency concerns. We propose a hardware-algorithm co-designed solution for continuous learning, DaCapo, that enables autonomous systems to perform concurrent executions of inference, labeling, and training in a performant and energy-efficient manner. DaCapo comprises (1) a spatially-partitionable and precision-flexible accelerator enabling parallel execution of kernels on sub-accelerators at their respective precisions, and (2) a spatiotemporal resource allocation algorithm that strategically navigates the resource-accuracy tradeoff space, facilitating optimal decisions for resource allocation to achieve maximal accuracy. Our evaluation shows that DaCapo achieves 6.5% and 5.5% higher accuracy than a state-of-the-art GPU-based continuous learning systems, Ekya and EOMU, respectively, while consuming 254x less power.

DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

TL;DR

Abstract

Paper Structure (27 sections, 12 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 12 figures, 4 tables, 1 algorithm.

Introduction
Background
Video Analytics in Autonomous Systems
Continuously Learning for Video Analytics at Edge
Continuous Learning Systems for Video Analytics
Challenges and Opportunities
Unveiling the Dilemma in Continuous Learning
Workload Characterization of the Three Kernels
Opportunities from Low-Precision Arithmetics
Overview of DaCapo's System Workflow
DaCapo Accelerator Architecture
Spatially-Partitionable Architecture
Precision-Flexible Dot-Product Engine
Precision-Conversion Unit
Spatiotemporal Resource Allocation Algorithm
...and 12 more sections

Figures (12)

Figure 1: Overview of continuously learning video analytics on autonomous systems. To address privacy, networking cost, and latency concerns, autonomous systems exclusively use constrained computing resources to concurrently execute the three continuous learning kernels -- (1) inference, (2) retraining, and (3) labeling -- which presents a performance challenge.
Figure 2: Accuracy comparisons of Ekya versus student and teacher models. Student and teacher models are non-continuous learning cases. The experiments are conducted on RTX 3090 and Jetson Orin. The accuracy gap between the two GPUs is attributed to inevitable frame drops due to a lack of computing resources.
Figure 3: MAC operation breakdown of the three kernels and total FLOPS for the entire experiment runs.
Figure 4: Workflow of a DaCapo-based continuously learning video analytics system.
Figure 5: Overall architecture of DaCapo.
...and 7 more figures

DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

TL;DR

Abstract

DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

Authors

TL;DR

Abstract

Table of Contents

Figures (12)