Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Ashish Shenoy, Yichao Lu, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Abhay Harpale, Vikas Bhardwaj, Di Xu, Shicong Zhao, Longfang Zhao, Ankit Ramchandani, Xin Luna Dong, Anuj Kumar
TL;DR
Lumos tackles the challenge of enabling text understanding in multimodal QA by combining on-device scene-text recognition with a cloud-based multimodal LLM. The system introduces a four-stage on-device STR pipeline (ROI detection, text detection, text recognition, reading-order reconstruction) and a hybrid architecture that parallelizes STR with image transfer to minimize latency, using ROI-based cropping and hardware acceleration to meet tight device constraints. Empirical results show that on-device STR substantially boosts end-to-end QA accuracy (from roughly 52% to around 78–79%), while maintaining low latency and modest power usage; the approach also achieves competitive STR quality (WER) with a small device footprint (~8 Mb) and significant efficiency gains when using hardware accelerators. These findings demonstrate the practicality of device-assisted text understanding for real-world MM-LLM applications and point to future work in further end-to-end optimization and extended multimodal capabilities.
Abstract
We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.
