Assisting humans in complex comparisons: automated information comparison at scale

Truman Yuen; Graham A. Watt; Yuri Lawryshyn

Assisting humans in complex comparisons: automated information comparison at scale

Truman Yuen, Graham A. Watt, Yuri Lawryshyn

TL;DR

This work tackles the token-length and retrieval constraints of large language models in large-scale information comparison by introducing ASC$^2$End, a pre-retrieval pipeline combining abstractive summarization, criteria embedding, and retrieval-augmented generation to perform cross-document comparisons with minimal domain-specific training. It partitions tasks into machine-level (summarization) and human-level (comparison reasoning), selecting suitable models (e.g., Mistral $7$B for DS and GPT-4 for CA) to balance efficiency and reasoning quality. Through evaluation on a $1253$-document financial news corpus and a $20$-page sustainability criteria, the system demonstrates strong ROUGE performance for summarization and superior CA accuracy with GPT-4, while ablations confirm the necessity of DS, RAG, and CA. The framework achieves time- and cost-efficient scaling for automated information analysis across domains, with practical implications for rapid, evidence-based decision support in finance and other knowledge areas.

Abstract

Generative Large Language Models enable efficient analytics across knowledge domains, rivalling human experts in information comparisons. However, the applications of LLMs for information comparisons face scalability challenges due to the difficulties in maintaining information across large contexts and overcoming model token limitations. To address these challenges, we developed the novel Abstractive Summarization & Criteria-driven Comparison Endpoint (ASC$^2$End) system to automate information comparison at scale. Our system employs Semantic Text Similarity comparisons for generating evidence-supported analyses. We utilize proven data-handling strategies such as abstractive summarization and retrieval augmented generation to overcome token limitations and retain relevant information during model inference. Prompts were designed using zero-shot strategies to contextualize information for improved model reasoning. We evaluated abstractive summarization using ROUGE scoring and assessed the generated comparison quality using survey responses. Models evaluated on the ASC$^2$End system show desirable results providing insights on the expected performance of the system. ASC$^2$End is a novel system and tool that enables accurate, automated information comparison at scale across knowledge domains, overcoming limitations in context length and retrieval.

Assisting humans in complex comparisons: automated information comparison at scale

TL;DR

This work tackles the token-length and retrieval constraints of large language models in large-scale information comparison by introducing ASC

End, a pre-retrieval pipeline combining abstractive summarization, criteria embedding, and retrieval-augmented generation to perform cross-document comparisons with minimal domain-specific training. It partitions tasks into machine-level (summarization) and human-level (comparison reasoning), selecting suitable models (e.g., Mistral

B for DS and GPT-4 for CA) to balance efficiency and reasoning quality. Through evaluation on a

-document financial news corpus and a

-page sustainability criteria, the system demonstrates strong ROUGE performance for summarization and superior CA accuracy with GPT-4, while ablations confirm the necessity of DS, RAG, and CA. The framework achieves time- and cost-efficient scaling for automated information analysis across domains, with practical implications for rapid, evidence-based decision support in finance and other knowledge areas.

Abstract

End) system to automate information comparison at scale. Our system employs Semantic Text Similarity comparisons for generating evidence-supported analyses. We utilize proven data-handling strategies such as abstractive summarization and retrieval augmented generation to overcome token limitations and retain relevant information during model inference. Prompts were designed using zero-shot strategies to contextualize information for improved model reasoning. We evaluated abstractive summarization using ROUGE scoring and assessed the generated comparison quality using survey responses. Models evaluated on the ASC

End system show desirable results providing insights on the expected performance of the system. ASC

End is a novel system and tool that enables accurate, automated information comparison at scale across knowledge domains, overcoming limitations in context length and retrieval.

Paper Structure (27 sections, 6 equations, 8 figures, 6 tables)

This paper contains 27 sections, 6 equations, 8 figures, 6 tables.

Introduction
The ASC$^2$End System
Method
Machine-Level vs. Human-Level Reasoning
Model Selection
Data Preprocessing of Text Corpus and Criteria
Data Acquisition
Document Summarization (DS)
Criteria Embedding (CE)
Retrieval Augmented Generation (RAG)
Comparison Assessment (CA)
Baseline and Ablation Study Experiment Setup
Experimental Setup
Results
Abstractive Summarization Results
...and 12 more sections

Figures (8)

Figure 1: ASC$^2$End Pipeline. The Document Summarization (DS) and Criteria Embedding (CE) modules preprocess the information provided by the user. Inputs are highlighted at the beginning of each module (document corpus in DS and criteria document in CE). The document summary is supplied to both the Retrieval Augmented Generation (RAG) prompt and the Comparison Assessment (CA) module. The vector database is used to drive the similarity search in the RAG module. The RAG prompt is combined with the results of the similarity search, where the information is relayed to a human-level LLM to enhance the retrieved passages. The CA module uses the information preprocessed from the DS and CE modules to perform the comparison assessment task using the same human-level LLM as the RAG process. Highlighted in the final step are the generated assessments for each document.
Figure 2: Prompt used to perform the summary task. {split_text} refers to each 2000 token chunk that is provided to the LLM to summarize.
Figure 3: Prompt provided to perform the RAG task. The information provided to {summary} was each summarized document and {target_topic} was a user-defined input to control the scope of the search.
Figure 4: Prompt provided to perform the comparison assessment. The {summary} generated from the Document Summarization module and the {retrieved_text} outputted from the RAG module are provided to the prompt to perform the information retrieval and comparison tasks. The {target_topic} and {company} are provided by the user to direct the scope of comparison.
Figure 5: ROUGE Scores of the first abstractive summarization assessment with a 2500 token limit of the four machine-level reasoning LLMs. n = 1,2, L were the evaluated n-grams for the ROUGE scoring, scores range from 0-1. The F1 score (right) was calculated with the influence of the precision (left) and recall (middle) scores.
...and 3 more figures

Assisting humans in complex comparisons: automated information comparison at scale

TL;DR

Abstract

Assisting humans in complex comparisons: automated information comparison at scale

Authors

TL;DR

Abstract

Table of Contents

Figures (8)