Query Time Optimized Deep Learning Based Video Inference System
Mingren Shen, Shuoxuan Dong, Xiuyuan He
TL;DR
This work addresses the latency and cost challenges of video inference by extending the Focus framework with feature-map reuse across a cheap ingest CNN and a more expensive query CNN. By sharing intermediate representations from common network blocks (e.g., the first 4 or 7 blocks of ResNet-50/152) and retraining the remaining layers, the approach achieves substantial query-time savings while controlling accuracy loss. Empirical results show 9.6% to 17.5% latency reductions depending on the reuse depth and retraining, with accuracy preserved in deeper cuts. The study demonstrates a practical, scalable path to low-cost, low-latency video queries and provides a foundation for broader exploration of cross-model layer sharing in CNN-based video inference.
Abstract
This is a project report about how we tune Focus[1], a video inference system that provides low cost and low latency, through two phases. In this report, we will decrease the query time by saving the middle layer output of the neural network. This is a trade-off strategy that involves using more space to save time. We show how this scheme works using prototype systems, and it saves 20% of the time. The code repository URL is here, https://github.com/iphyer/CS744 FocousIngestOpt.
