A comprehensive study of on-device NLP applications -- VQA, automated Form filling, Smart Replies for Linguistic Codeswitching

Naman Goyal

A comprehensive study of on-device NLP applications -- VQA, automated Form filling, Smart Replies for Linguistic Codeswitching

Naman Goyal

TL;DR

This work proposes 3 new experiences which can be extended are smart replies to support for multilingual speakers with code-switching, and tasks and solutions to each of them to bridge the gap between latest research and real world impact of the research in on-device applications.

Abstract

Recent improvement in large language models, open doors for certain new experiences for on-device applications which were not possible before. In this work, we propose 3 such new experiences in 2 categories. First we discuss experiences which can be powered in screen understanding i.e. understanding whats on user screen namely - (1) visual question answering, and (2) automated form filling based on previous screen. The second category of experience which can be extended are smart replies to support for multilingual speakers with code-switching. Code-switching occurs when a speaker alternates between two or more languages. To the best of our knowledge, this is first such work to propose these tasks and solutions to each of them, to bridge the gap between latest research and real world impact of the research in on-device applications.

A comprehensive study of on-device NLP applications -- VQA, automated Form filling, Smart Replies for Linguistic Codeswitching

TL;DR

Abstract

Paper Structure (44 sections, 3 equations, 22 figures, 3 tables)

This paper contains 44 sections, 3 equations, 22 figures, 3 tables.

Introduction
Screen Understanding
Introduction
Related Work
Tasks
Visual Question Answering for on screen context
Data and Challenge
Training Pipeline
Question Types
title
phone number
email
url
address
'DateTime'
...and 29 more sections

Figures (22)

Figure 1: Families of Document AI model based on information
Figure 2: VQA task Problem statement: Build a system which can answer a natural language query from a given app view (screenshot + text). E.g. Question (input): When is the daily show? Answer (output): 7:45pm and 8:45pm
Figure 3: Label generation step 1 via extracting predefined data types (aka values)
Figure 4: Step 2 of label generation, finding nearest parent text element, and frames questions based on the same. Here we could extract 2 addresses and then had 2 questions in training data. Question 1: what is the fremont address? Answer: 5355 Mowry Ave, Fremont, CA 94538 Question 2: what is the sunnyvale address? Answer: 976 East El Camino Real, Sunnyvale, CA 94087
Figure 5: Training pipeline (1) Start with the pretraining task of layoutLMv3 (2) Add Question answering (QA) head on top LayoutLMv3 (3) Initialize training of QA head on DocVQA dataset (4) Finetune on weak label generated internal apps dataset using incremental learning.
...and 17 more figures

A comprehensive study of on-device NLP applications -- VQA, automated Form filling, Smart Replies for Linguistic Codeswitching

TL;DR

Abstract

A comprehensive study of on-device NLP applications -- VQA, automated Form filling, Smart Replies for Linguistic Codeswitching

Authors

TL;DR

Abstract

Table of Contents

Figures (22)