Table of Contents
Fetching ...

Target Prompting for Information Extraction with Vision Language Model

Dipankar Medhi

TL;DR

Target Prompting addresses the challenge of extracting precise information from image-based documents by explicitly guiding vision-language models to focus on designated regions. The method employs the Phi-3-vision-instruct model with a region-aware prompting scheme and a curated dataset of document images paired with targeted prompts. Experimental results, based on manual relevance assessments, show that region-focused prompts improve accuracy and reduce noise relative to general prompts, though further testing is needed. The study highlights the practical potential of targeted prompts for structured information retrieval in multimodal document QA workflows, with plans to expand datasets and evaluation scope in future work.

Abstract

The recent trend in the Large Vision and Language model has brought a new change in how information extraction systems are built. VLMs have set a new benchmark with their State-of-the-art techniques in understanding documents and building question-answering systems across various industries. They are significantly better at generating text from document images and providing accurate answers to questions. However, there are still some challenges in effectively utilizing these models to build a precise conversational system. General prompting techniques used with large language models are often not suitable for these specially designed vision language models. The output generated by such generic input prompts is ordinary and may contain information gaps when compared with the actual content of the document. To obtain more accurate and specific answers, a well-targeted prompt is required by the vision language model, along with the document image. In this paper, a technique is discussed called Target prompting, which focuses on explicitly targeting parts of document images and generating related answers from those specific regions only. The paper also covers the evaluation of response for each prompting technique using different user queries and input prompts.

Target Prompting for Information Extraction with Vision Language Model

TL;DR

Target Prompting addresses the challenge of extracting precise information from image-based documents by explicitly guiding vision-language models to focus on designated regions. The method employs the Phi-3-vision-instruct model with a region-aware prompting scheme and a curated dataset of document images paired with targeted prompts. Experimental results, based on manual relevance assessments, show that region-focused prompts improve accuracy and reduce noise relative to general prompts, though further testing is needed. The study highlights the practical potential of targeted prompts for structured information retrieval in multimodal document QA workflows, with plans to expand datasets and evaluation scope in future work.

Abstract

The recent trend in the Large Vision and Language model has brought a new change in how information extraction systems are built. VLMs have set a new benchmark with their State-of-the-art techniques in understanding documents and building question-answering systems across various industries. They are significantly better at generating text from document images and providing accurate answers to questions. However, there are still some challenges in effectively utilizing these models to build a precise conversational system. General prompting techniques used with large language models are often not suitable for these specially designed vision language models. The output generated by such generic input prompts is ordinary and may contain information gaps when compared with the actual content of the document. To obtain more accurate and specific answers, a well-targeted prompt is required by the vision language model, along with the document image. In this paper, a technique is discussed called Target prompting, which focuses on explicitly targeting parts of document images and generating related answers from those specific regions only. The paper also covers the evaluation of response for each prompting technique using different user queries and input prompts.
Paper Structure (7 sections, 5 figures, 1 table)

This paper contains 7 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of a RAG pipeline.Encoder converts text data to vector embeddings. Retriver fetches relevant chunks from the vector store and feeds the retrieved information to LLM to generate a response for the query.
  • Figure 2: Dataset sample document images. Showcases a few sample document pages from the dataset.
  • Figure 3: Information extraction overview. Model input includes document page image and user query with the system prompt "<image_1>".
  • Figure 4: Overview of general prompt output. The responses put all the details from the document images in a single chunk of text.
  • Figure 5: Target Prompting results. The responses are more accurate and to the point. Gives more control over the model response.