Table of Contents
Fetching ...

ViRED: Prediction of Visual Relations in Engineering Drawings

Chao Gu, Ke Lin, Yiyang Luo, Jiahui Hou, Xiang-Yang Li

TL;DR

A vision-based relation detection model, named ViRED, is proposed to identify the associations between tables and circuits in electrical engineering drawings to identify the associations between tables and circuits in electrical engineering drawings.

Abstract

To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

ViRED: Prediction of Visual Relations in Engineering Drawings

TL;DR

A vision-based relation detection model, named ViRED, is proposed to identify the associations between tables and circuits in electrical engineering drawings to identify the associations between tables and circuits in electrical engineering drawings.

Abstract

To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
Paper Structure (22 sections, 8 equations, 5 figures, 3 tables)

This paper contains 22 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An example of electrical engineering drawing. There are 6 circuits and 6 tables in this electrical engineering drawing. The engineering drawing has been modified to prevent copyright infringements.
  • Figure 2: Overview of the general pipeline of ViRED. (a) Engineering drawings are processed through the Vision Encoder, Object Encoder, Relation Decoder, and Relation Prediction Model. (b) The Object Encoder converts the instance masks and types into mask and type embeddings, which are then aggregated to form the object tokens. (c) The Relation Decoder utilizes the object tokens as inputs and integrates them with the image features from the Vision Encoder through a cross-attention mechanism. Residual connections between layers are ignored for simplicity. (d) While pretraining, the model encodes the document images and position masks, and after decoding through the relation decoder, it predicts the image classification of the position where the mask is located.
  • Figure 3: The implementation of relation prediction model. The semi-transparent tokens represent the filtered parts, which do not participate in the relationship prediction computation.
  • Figure 4: Qualitative result of our model. Caption X.1 and X.2 denotes the original electrical engineering drawings and the model prediction results, respectively.
  • Figure 5: FLOPs of different models with respect to different numbers of objects.