Table of Contents
Fetching ...

GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

Ayan Banerjee, Sanket Biswas, Josep Lladós, Umapada Pal

TL;DR

GraphKD introduces a graph-based knowledge distillation framework to efficiently transfer knowledge from large teachers to lightweight students for document object detection. By constructing structured RoI-based graphs with nodes representing instances and edges encoding relations, and by applying adaptive text-bias mitigation and a graph distillation loss that combines node and edge imitation, GraphKD enables heterogeneous distillation (e.g., ViT to CNN) and achieves competitive performance with far fewer parameters. Extensive ablations show the importance of edge structure and non-text node distillation, while comparative studies demonstrate superiority over several KD baselines. The work advances edge-deployable document understanding by preserving structural insights through graph topology, though cross-architecture distillation with transformers remains a challenging direction for future work.

Abstract

Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: https://github.com/ayanban011/GraphKD.

GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

TL;DR

GraphKD introduces a graph-based knowledge distillation framework to efficiently transfer knowledge from large teachers to lightweight students for document object detection. By constructing structured RoI-based graphs with nodes representing instances and edges encoding relations, and by applying adaptive text-bias mitigation and a graph distillation loss that combines node and edge imitation, GraphKD enables heterogeneous distillation (e.g., ViT to CNN) and achieves competitive performance with far fewer parameters. Extensive ablations show the importance of edge structure and non-text node distillation, while comparative studies demonstrate superiority over several KD baselines. The work advances edge-deployable document understanding by preserving structural insights through graph topology, though cross-architecture distillation with transformers remains a challenging direction for future work.

Abstract

Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: https://github.com/ayanban011/GraphKD.
Paper Structure (23 sections, 2 equations, 5 figures, 10 tables, 2 algorithms)

This paper contains 23 sections, 2 equations, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: GraphKD creates a graph from RoI pooled features of both teacher and student networks and utilize graph distillation loss from knowledge transfer. Finally, a Graph Convolution Network has been used to predict the object classes.
  • Figure 1: Understanding global relationship of the class instances through UMAP.
  • Figure 2: Structured graph creation: Here first we extracted the RoI pooled features and classified them into "Text" and "Non-text" based on their covariance. Then we initialize the node in the identified RoI regions and define the adjacency edges. Lastly, we iteratively merge the text node with an adaptive sample mining strategy to reduce text bias.
  • Figure 2: Graph creation without node indexing: Here one node represents the whole instances of each class as the nodes are developed on the feature embedding space.
  • Figure 3: Qualitative analysis with various distilled networks on PRIMA dataset (left: predicted; right: ground-truth).