Table of Contents
Fetching ...

Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering

Sagar Srinivas Sakhinana, Geethan Sannidhi, Venkataramana Runkana

TL;DR

This work proposes a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework for open-domain question answering (ODQA) tasks, offering enhanced data privacy, explainability, and cost-effectiveness.

Abstract

In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance. Recent advancements in Generative AI, such as Large Multimodal Models (LMMs) like GPT4 (Omni), have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA). However, proprietary models pose data privacy risks, and their computational complexity prevents knowledge editing for domain-specific customization on consumer hardware. To overcome these challenges, we propose a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework for open-domain question answering (ODQA) tasks, offering enhanced data privacy, explainability, and cost-effectiveness. Our novel multi-agent framework employs introspective and specialized sub-agents using open-source, small-scale multimodal models with the ReAct (Reason+Act) prompting technique for PFD and P&ID analysis, integrating multiple information sources to provide accurate and contextually relevant answers. Our approach, supported by iterative self-correction, aims to deliver superior performance in ODQA tasks. We conducted rigorous experimental studies, and the empirical results validated the proposed approach effectiveness.

Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering

TL;DR

This work proposes a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework for open-domain question answering (ODQA) tasks, offering enhanced data privacy, explainability, and cost-effectiveness.

Abstract

In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance. Recent advancements in Generative AI, such as Large Multimodal Models (LMMs) like GPT4 (Omni), have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA). However, proprietary models pose data privacy risks, and their computational complexity prevents knowledge editing for domain-specific customization on consumer hardware. To overcome these challenges, we propose a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework for open-domain question answering (ODQA) tasks, offering enhanced data privacy, explainability, and cost-effectiveness. Our novel multi-agent framework employs introspective and specialized sub-agents using open-source, small-scale multimodal models with the ReAct (Reason+Act) prompting technique for PFD and P&ID analysis, integrating multiple information sources to provide accurate and contextually relevant answers. Our approach, supported by iterative self-correction, aims to deliver superior performance in ODQA tasks. We conducted rigorous experimental studies, and the empirical results validated the proposed approach effectiveness.
Paper Structure (8 sections, 14 equations, 3 figures, 15 tables)

This paper contains 8 sections, 14 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: The figure shows the multi-agent framework for ODQA on complex documents for PFD and P&ID analysis. It consists of an introspective agent, including a main agent and a critique agent. The main agent orchestrates specialized sub-agents, routing the end-user request to the relevant sub-agent. Each sub-agent utilizes SMMs with the ReACT technique in a four-stage workflow to utilize external tools, allowing dynamic access to specialized resources beyond its pre-trained knowledge: task planning, tool selection, tool calling, and response generation. In task planning, user intent is analyzed and queries are decomposed into sub-tasks. Tool selection involves SMMs selecting appropriate tools (APIs, databases, external knowledge repositories) to solve these sub-tasks. Tool calling involves SMMs extracting the required parameters from the user query and calling the selected tools to retrieve relevant information from document databases, memory databases, web searches, and Wikipedia articles. Memory consists of long-term memory, which stores reusable information for future queries, and short-term memory, which holds session-specific data for immediate processing. Finally, response generation integrates the outputs from these tools with the SMMs' internal knowledge to create comprehensive and coherent responses, and the critique agent iteratively refines the outputs using reflection and correction cycles. We fine-tune SMMs to select appropriate tools and use them accurately during task-specific adaptation to provide accurate and contextually relevant responses.
  • Figure 2: The figure shows the PFD of a crude oil distillation unit.
  • Figure 3: The figure shows the text detection and OCR results for the PFD.