Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

Jiahao Wang; Mingxuan Li; Haichen Luo; Jinguo Zhu; Aijun Yang; Mingzhe Rong; Xiaohua Wang

Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

Jiahao Wang, Mingxuan Li, Haichen Luo, Jinguo Zhu, Aijun Yang, Mingzhe Rong, Xiaohua Wang

TL;DR

Power-LLaVA, the first large language and vision assistant designed to offer professional and reliable inspection services for power transmission line by engaging in dialogues with humans, is introduced and a large-scale and high-quality dataset specialized for the inspection task is constructed.

Abstract

The inspection of power transmission line has achieved notable achievements in the past few years, primarily due to the integration of deep learning technology. However, current inspection approaches continue to encounter difficulties in generalization and intelligence, which restricts their further applicability. In this paper, we introduce Power-LLaVA, the first large language and vision assistant designed to offer professional and reliable inspection services for power transmission line by engaging in dialogues with humans. Moreover, we also construct a large-scale and high-quality dataset specialized for the inspection task. By employing a two-stage training strategy on the constructed dataset, Power-LLaVA demonstrates exceptional performance at a comparatively low training cost. Extensive experiments further prove the great capabilities of Power-LLaVA within the realm of power transmission line inspection. Code shall be released.

Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 5 figures, 2 tables)

This paper contains 13 sections, 4 equations, 5 figures, 2 tables.

Introduction
Related Works
Method
Model Architecture
Dataset Construction
Training Objective
Training Strategy
EXPERIMENTS
Evaluation Benchmark
Setup
Main Results
Ablation Studies
Conclusion

Figures (5)

Figure 1: Comparison of LLaVA, GPT-4V and Power-LLaVA. Power-LLaVA demonstrates the most comprehensive and specialized response towards power transmission line inspection.
Figure 2: Overview of our model. Initially, the vision encoder processes the input image and extracts its feature as visual embeddings. These embeddings are then aligned with the word embeddings of the LLM via the projection module. Subsequently, the LLM module processes both the visual embeddings derived from the image and the word embeddings from the text in a unified manner, ultimately generating the text response.
Figure 3: Construction pipeline of our proposed dataset. For each image obtained from real-world power transmission line scenarios, we annotate four captions and object detection labels by utilizing state-of-the-art Vision-Language (VL) models and detection models for each image, respectively. Building upon the captions, object detection labels, and templates provided by human annotators, ChatGPT is employed to generate a specialized high-quality dataset for instruction tuning.
Figure 4: Example of superior visual understanding and reasoning capability of Power-LLaVA in comparison to other models. Power-LLaVA exhibits the ability to interpret images and instructions with a high level of professionalism. Additionally, it is capable of executing multi-round dialogues and complex reasoning tasks.
Figure 5: The performance of our Power-LLaVA and LLaVA when varying the scale of instruction finetuning dataset.

Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

TL;DR

Abstract

Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)