Table of Contents
Fetching ...

IFShip: Interpretable Fine-grained Ship Classification with Domain Knowledge-Enhanced Vision-Language Models

Mingning Guo, Mengwei Wu, Yuxiang Shen, Haifeng Li, Chao Tao

TL;DR

The paper tackles the interpretability gap in remote-sensing fine-grained ship classification by introducing domain knowledge–enhanced Chain-of-Thought prompts and TITANIC-FGS, a task-specific instruction-following dataset. Fine-tuning a Vision-Language Model (IFShip) on TITANIC-FGS, built atop LLaVA with LoRA, yields improved accuracy and a transparent reasoning process, augmented by an FGSC visual chatbot that operates in a two-stage, explanation-rich workflow. Across TITANIC-FGS and FGSCR42, IFShip outperforms state-of-the-art FGSC methods and general VLMs, achieving strong per-category accuracy and robust interpretability demonstrated through caption and VQA tasks with substantially fewer hallucinations. The approach emphasizes the value of domain knowledge in instruction tuning, enabling better feature discrimination under varying imaging conditions and offering a practical, interpretable tool for fine-grained remote-sensing classification with potential applicability to other domains.

Abstract

End-to-end interpretation currently dominates the remote sensing fine-grained ship classification (RS-FGSC) task. However, the inference process remains uninterpretable, leading to criticisms of these models as "black box" systems. To address this issue, we propose a domain knowledge-enhanced Chain-of-Thought (CoT) prompt generation mechanism, which is used to semi-automatically construct a task-specific instruction-following dataset, TITANIC-FGS. By training on TITANIC-FGS, we adapt general-domain vision-language models (VLMs) to the FGSC task, resulting in a model named IFShip. Building upon IFShip, we develop an FGSC visual chatbot that redefines the FGSC problem as a step-by-step reasoning task and conveys the reasoning process in natural language. Experimental results show that IFShip outperforms state-of-the-art FGSC algorithms in both interpretability and classification accuracy. Furthermore, compared to VLMs such as LLaVA and MiniGPT-4, IFShip demonstrates superior performance on the FGSC task. It provides an accurate chain of reasoning when fine-grained ship types are recognizable to the human eye and offers interpretable explanations when they are not. Our dataset is publicly available at: https://github.com/lostwolves/IFShip.

IFShip: Interpretable Fine-grained Ship Classification with Domain Knowledge-Enhanced Vision-Language Models

TL;DR

The paper tackles the interpretability gap in remote-sensing fine-grained ship classification by introducing domain knowledge–enhanced Chain-of-Thought prompts and TITANIC-FGS, a task-specific instruction-following dataset. Fine-tuning a Vision-Language Model (IFShip) on TITANIC-FGS, built atop LLaVA with LoRA, yields improved accuracy and a transparent reasoning process, augmented by an FGSC visual chatbot that operates in a two-stage, explanation-rich workflow. Across TITANIC-FGS and FGSCR42, IFShip outperforms state-of-the-art FGSC methods and general VLMs, achieving strong per-category accuracy and robust interpretability demonstrated through caption and VQA tasks with substantially fewer hallucinations. The approach emphasizes the value of domain knowledge in instruction tuning, enabling better feature discrimination under varying imaging conditions and offering a practical, interpretable tool for fine-grained remote-sensing classification with potential applicability to other domains.

Abstract

End-to-end interpretation currently dominates the remote sensing fine-grained ship classification (RS-FGSC) task. However, the inference process remains uninterpretable, leading to criticisms of these models as "black box" systems. To address this issue, we propose a domain knowledge-enhanced Chain-of-Thought (CoT) prompt generation mechanism, which is used to semi-automatically construct a task-specific instruction-following dataset, TITANIC-FGS. By training on TITANIC-FGS, we adapt general-domain vision-language models (VLMs) to the FGSC task, resulting in a model named IFShip. Building upon IFShip, we develop an FGSC visual chatbot that redefines the FGSC problem as a step-by-step reasoning task and conveys the reasoning process in natural language. Experimental results show that IFShip outperforms state-of-the-art FGSC algorithms in both interpretability and classification accuracy. Furthermore, compared to VLMs such as LLaVA and MiniGPT-4, IFShip demonstrates superior performance on the FGSC task. It provides an accurate chain of reasoning when fine-grained ship types are recognizable to the human eye and offers interpretable explanations when they are not. Our dataset is publicly available at: https://github.com/lostwolves/IFShip.
Paper Structure (16 sections, 3 equations, 10 figures, 10 tables)

This paper contains 16 sections, 3 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Comparison between traditional end-to-end interpretation and large vision-language model based interpretation.
  • Figure 2: Comparison of Instruction Generation Methods: Prompt-Template Filling vs. Domain Knowledge-Enhanced CoT Prompts.
  • Figure 3: Feature information for 16 fine-grained ship categories in the FGSC domain knowledge base.
  • Figure 4: Examples of descriptions of discriminative features for different ship targets.
  • Figure 5: Principles of CoT instruction generation.
  • ...and 5 more figures