Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Ashish Shenoy; Yichao Lu; Srihari Jayakumar; Debojeet Chatterjee; Mohsen Moslehpour; Pierce Chuang; Abhay Harpale; Vikas Bhardwaj; Di Xu; Shicong Zhao; Longfang Zhao; Ankit Ramchandani; Xin Luna Dong; Anuj Kumar

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Ashish Shenoy, Yichao Lu, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Abhay Harpale, Vikas Bhardwaj, Di Xu, Shicong Zhao, Longfang Zhao, Ankit Ramchandani, Xin Luna Dong, Anuj Kumar

TL;DR

Lumos tackles the challenge of enabling text understanding in multimodal QA by combining on-device scene-text recognition with a cloud-based multimodal LLM. The system introduces a four-stage on-device STR pipeline (ROI detection, text detection, text recognition, reading-order reconstruction) and a hybrid architecture that parallelizes STR with image transfer to minimize latency, using ROI-based cropping and hardware acceleration to meet tight device constraints. Empirical results show that on-device STR substantially boosts end-to-end QA accuracy (from roughly 52% to around 78–79%), while maintaining low latency and modest power usage; the approach also achieves competitive STR quality (WER) with a small device footprint (~8 Mb) and significant efficiency gains when using hardware accelerators. These findings demonstrate the practicality of device-assisted text understanding for real-world MM-LLM applications and point to future work in further end-to-end optimization and extended multimodal capabilities.

Abstract

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

TL;DR

Abstract

Paper Structure (16 sections, 9 figures, 7 tables)

This paper contains 16 sections, 9 figures, 7 tables.

Introduction
Previous work
Overall Architecture
Scene-Text Recognition
ROI Detection
Text Detection
Text Recognition
Reading Order Reconstruction
On-Device Export
Experimental Results
Experiment Setup
End-to-End Quality
STR quality
STR Efficiency
Conclusion
...and 1 more sections

Figures (9)

Figure 1: Text based use cases that Lumos supports.
Figure 2: Lumos Quality metrics
Figure 3: Overall architecture of Lumos. The width of the blocks on device are roughly represents runtime latency. The arrow width roughly represents to the size of the payload being transferred. Blue blocks indicate models using hardware acceleration.
Figure 4: On-device STR component flow of Lumos.
Figure 5: Left: Word bounding boxes. Right: Paragraphs from out Reading Order Reconstruction component
...and 4 more figures

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

TL;DR

Abstract

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (9)