Table of Contents
Fetching ...

Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding

Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, Khoa Luu

TL;DR

This paper tackles the lack of domain-specific knowledge for insects in vision-language models by introducing Insect-LLaVA, a multimodal foundation model trained on a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data. It advances a dedicated Insect Foundation Model that emphasizes micro-feature learning through Patch-wise Relevant Attention and a Description Consistency loss, enabling fine-grained insect understanding for tasks like classification, detection, and VQA. The approach yields state-of-the-art results on insect benchmarks, demonstrated through comprehensive Insect-VQA and zero-shot evaluations, and is supported by a scalable data-and-model release to bolster precision agriculture research. Overall, the work provides a pathway to robust, domain-specific vision-language systems for entomology and agricultural applications, anchored by a rich, hierarchical insect dataset and targeted micro-feature learning techniques.

Abstract

Multimodal conversational generative AI has shown impressive capabilities in various vision and language understanding through learning massive text-image data. However, current conversational models still lack knowledge about visual insects since they are often trained on the general knowledge of vision-language data. Meanwhile, understanding insects is a fundamental problem in precision agriculture, helping to promote sustainable development in agriculture. Therefore, this paper proposes a novel multimodal conversational model, Insect-LLaVA, to promote visual understanding in insect-domain knowledge. In particular, we first introduce a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data that enables the capability of learning the multimodal foundation models. Our proposed dataset enables conversational models to comprehend the visual and semantic features of the insects. Second, we propose a new Insect-LLaVA model, a new general Large Language and Vision Assistant in Visual Insect Understanding. Then, to enhance the capability of learning insect features, we develop an Insect Foundation Model by introducing a new micro-feature self-supervised learning with a Patch-wise Relevant Attention mechanism to capture the subtle differences among insect images. We also present Description Consistency loss to improve micro-feature learning via text descriptions. The experimental results evaluated on our new Visual Insect Question Answering benchmarks illustrate the effective performance of our proposed approach in visual insect understanding and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks.

Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding

TL;DR

This paper tackles the lack of domain-specific knowledge for insects in vision-language models by introducing Insect-LLaVA, a multimodal foundation model trained on a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data. It advances a dedicated Insect Foundation Model that emphasizes micro-feature learning through Patch-wise Relevant Attention and a Description Consistency loss, enabling fine-grained insect understanding for tasks like classification, detection, and VQA. The approach yields state-of-the-art results on insect benchmarks, demonstrated through comprehensive Insect-VQA and zero-shot evaluations, and is supported by a scalable data-and-model release to bolster precision agriculture research. Overall, the work provides a pathway to robust, domain-specific vision-language systems for entomology and agricultural applications, anchored by a rich, hierarchical insect dataset and targeted micro-feature learning techniques.

Abstract

Multimodal conversational generative AI has shown impressive capabilities in various vision and language understanding through learning massive text-image data. However, current conversational models still lack knowledge about visual insects since they are often trained on the general knowledge of vision-language data. Meanwhile, understanding insects is a fundamental problem in precision agriculture, helping to promote sustainable development in agriculture. Therefore, this paper proposes a novel multimodal conversational model, Insect-LLaVA, to promote visual understanding in insect-domain knowledge. In particular, we first introduce a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data that enables the capability of learning the multimodal foundation models. Our proposed dataset enables conversational models to comprehend the visual and semantic features of the insects. Second, we propose a new Insect-LLaVA model, a new general Large Language and Vision Assistant in Visual Insect Understanding. Then, to enhance the capability of learning insect features, we develop an Insect Foundation Model by introducing a new micro-feature self-supervised learning with a Patch-wise Relevant Attention mechanism to capture the subtle differences among insect images. We also present Description Consistency loss to improve micro-feature learning via text descriptions. The experimental results evaluated on our new Visual Insect Question Answering benchmarks illustrate the effective performance of our proposed approach in visual insect understanding and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks.

Paper Structure

This paper contains 24 sections, 12 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Our Proposed Multimodal Dataset and Visual Insect Instruction Data. The left figure illustrates the samples of the four Subphylums, including Chelicerata, Crustacea, Hexapoda, and Myriapoda. The middle figure shows an example of hierarchical descriptions of the Aurantia Species. The right figure illustrates the corresponding insect instruction data.
  • Figure 2: Our Patch-wise Relevant Attention. Given masked insect images and separated image patches, our model can learn to distinguish patches with minor differences via relevant scores computed between masked images and image patches.
  • Figure 3: Treemap of the Multimodal Dataset. Nested boxes represent classes, orders, and families. The size of the boxes represents the relative number of samples.
  • Figure 4: Our Data Collection Pipeline.
  • Figure 5: The Distribution of Classes in Multimodal Insect Dataset (Left) and the Distribution of Insecta Orders (Right).
  • ...and 12 more figures