Federated Document Visual Question Answering: A Pilot Study

Khanh Nguyen; Dimosthenis Karatzas

Federated Document Visual Question Answering: A Pilot Study

Khanh Nguyen, Dimosthenis Karatzas

TL;DR

FeDocVQA tackles privacy-sensitive DocVQA by federated training across $K$ clients with datasets $D_k$, optimizing $f(\theta) = \sum_{k=1}^{K} p_k F_k(\theta)$ where $p_k = n_k / \sum_j n_j$ to handle non‑IID data. To address data heterogeneity and privacy, the authors combine three DocVQA datasets into a realistic non‑IID FL setting, employ a T5‑based multimodal backbone with Layout‑Induced Vision‑Text Embedding, and introduce Federated Self‑Pretraining (FSP) with Text Modeling, Layout Modeling, and Text‑Layout Modeling objectives, plus adaptive server optimization via FedAdam. Empirical results across $K \in \{3,10,30\}$ and varying client participation demonstrate that FSP improves over FedAvg by up to about $3$ points and that FedAdam generally outperforms FedAvgM in heterogeneous settings, with performance approaching centralized training. Overall, the work shows that privacy-preserving, federated training can harness heterogeneous private documents to yield generalized DocVQA models without centralized data collection, enabling scalable collaboration across institutions.

Abstract

An important handicap of document analysis research is that documents tend to be copyrighted or contain private information, which prohibits their open publication and the creation of centralised, large-scale document datasets. Instead, documents are scattered in private data silos, making extensive training over heterogeneous data a tedious task. In this work, we explore the use of a federated learning (FL) scheme as a way to train a shared model on decentralised private document data. We focus on the problem of Document VQA, a task particularly suited to this approach, as the type of reasoning capabilities required from the model can be quite different in diverse domains. Enabling training over heterogeneous document datasets can thus substantially enrich DocVQA models. We assemble existing DocVQA datasets from diverse domains to reflect the data heterogeneity in real-world applications. We explore the self-pretraining technique in this multi-modal setting, where the same data is used for both pretraining and finetuning, making it relevant for privacy preservation. We further propose combining self-pretraining with a Federated DocVQA training method using centralized adaptive optimization that outperforms the FedAvg baseline. With extensive experiments, we also present a multi-faceted analysis on training DocVQA models with FL, which provides insights for future research on this task. We show that our pretraining strategies can effectively learn and scale up under federated training with diverse DocVQA datasets and tuning hyperparameters is essential for practical document tasks under federation.

Federated Document Visual Question Answering: A Pilot Study

TL;DR

FeDocVQA tackles privacy-sensitive DocVQA by federated training across

clients with datasets

, optimizing

where

to handle non‑IID data. To address data heterogeneity and privacy, the authors combine three DocVQA datasets into a realistic non‑IID FL setting, employ a T5‑based multimodal backbone with Layout‑Induced Vision‑Text Embedding, and introduce Federated Self‑Pretraining (FSP) with Text Modeling, Layout Modeling, and Text‑Layout Modeling objectives, plus adaptive server optimization via FedAdam. Empirical results across

and varying client participation demonstrate that FSP improves over FedAvg by up to about

points and that FedAdam generally outperforms FedAvgM in heterogeneous settings, with performance approaching centralized training. Overall, the work shows that privacy-preserving, federated training can harness heterogeneous private documents to yield generalized DocVQA models without centralized data collection, enabling scalable collaboration across institutions.

Abstract

Paper Structure (18 sections, 1 equation, 6 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 6 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Document Visual Question Answering
Cross-silo Federated Learning
Self-Pretraining
Methodology
Problem Statement
Federated Learning
Federated Self-Pretraining (FSP)
Experimental Setup
Dataset Selection
Data Partition
Model
Results
Baselines
...and 3 more sections

Figures (6)

Figure 1: The general scheme of Federated Document Visual Question Answering. The fundamental concept involves collaborative training of a DocVQA model under the coordination of the server. The training takes place locally at each client and only model updates are communicated between client and server, ensuring that private data is never shared, thus preserving privacy.
Figure 2: The overall architecture of the proposed model in our experiments.
Figure 3: FedAvg training progress with Adam optimizer as CLIENTOPT. This plot presents the validation loss (left) and metric (right) curves for $K=3$ while varying $C$. We fit a quadratic regression model for better visualization.
Figure 4: Breakdown of per-dataset metrics on validation set over FedAvg training, with $K=3$ and $C=0.35$.
Figure 5: Validation/Test performance vs. communication rounds $T$. (Left) validation metric of FedAvg baseline while varying T in finetuning, (Right) test metric of FSP+FedAvg while varying T in pretraining.
...and 1 more figures

Federated Document Visual Question Answering: A Pilot Study

TL;DR

Abstract

Federated Document Visual Question Answering: A Pilot Study

Authors

TL;DR

Abstract

Table of Contents

Figures (6)