Table of Contents
Fetching ...

vitaLITy 2: Reviewing Academic Literature Using Large Language Models

Hongye An, Arpit Narechania, Emily Wall, Kai Xu

TL;DR

vitaLITy 2 introduces an LLM-powered visual analytics system for literature review that uses a Retrieval Augmented Generation architecture to semantically search and analyze a corpus of 66,692 papers. It combines multiple text embeddings (ADA, GloVe, SPECTER) with vector databases (Faiss, ChromaDB) and a LangChain-driven prompt-chaining framework to support natural-language queries, summarization, and literature-review drafting via a chat interface. The system extends vitaLITy 1 with ADA embeddings, an enhanced UI, and novel capabilities to summarize collections of papers and generate literature reviews, all available as open-source. Despite promising capabilities, the work acknowledges limitations such as lack of full-text access and potential LLM hallucinations, outlining feasible future enhancements including full-text chunking and external knowledge integration to improve accuracy and utility.

Abstract

Academic literature reviews have traditionally relied on techniques such as keyword searches and accumulation of relevant back-references, using databases like Google Scholar or IEEEXplore. However, both the precision and accuracy of these search techniques is limited by the presence or absence of specific keywords, making literature review akin to searching for needles in a haystack. We present vitaLITy 2, a solution that uses a Large Language Model or LLM-based approach to identify semantically relevant literature in a textual embedding space. We include a corpus of 66,692 papers from 1970-2023 which are searchable through text embeddings created by three language models. vitaLITy 2 contributes a novel Retrieval Augmented Generation (RAG) architecture and can be interacted with through an LLM with augmented prompts, including summarization of a collection of papers. vitaLITy 2 also provides a chat interface that allow users to perform complex queries without learning any new programming language. This also enables users to take advantage of the knowledge captured in the LLM from its enormous training corpus. Finally, we demonstrate the applicability of vitaLITy 2 through two usage scenarios. vitaLITy 2 is available as open-source software at https://vitality-vis.github.io.

vitaLITy 2: Reviewing Academic Literature Using Large Language Models

TL;DR

vitaLITy 2 introduces an LLM-powered visual analytics system for literature review that uses a Retrieval Augmented Generation architecture to semantically search and analyze a corpus of 66,692 papers. It combines multiple text embeddings (ADA, GloVe, SPECTER) with vector databases (Faiss, ChromaDB) and a LangChain-driven prompt-chaining framework to support natural-language queries, summarization, and literature-review drafting via a chat interface. The system extends vitaLITy 1 with ADA embeddings, an enhanced UI, and novel capabilities to summarize collections of papers and generate literature reviews, all available as open-source. Despite promising capabilities, the work acknowledges limitations such as lack of full-text access and potential LLM hallucinations, outlining feasible future enhancements including full-text chunking and external knowledge integration to improve accuracy and utility.

Abstract

Academic literature reviews have traditionally relied on techniques such as keyword searches and accumulation of relevant back-references, using databases like Google Scholar or IEEEXplore. However, both the precision and accuracy of these search techniques is limited by the presence or absence of specific keywords, making literature review akin to searching for needles in a haystack. We present vitaLITy 2, a solution that uses a Large Language Model or LLM-based approach to identify semantically relevant literature in a textual embedding space. We include a corpus of 66,692 papers from 1970-2023 which are searchable through text embeddings created by three language models. vitaLITy 2 contributes a novel Retrieval Augmented Generation (RAG) architecture and can be interacted with through an LLM with augmented prompts, including summarization of a collection of papers. vitaLITy 2 also provides a chat interface that allow users to perform complex queries without learning any new programming language. This also enables users to take advantage of the knowledge captured in the LLM from its enormous training corpus. Finally, we demonstrate the applicability of vitaLITy 2 through two usage scenarios. vitaLITy 2 is available as open-source software at https://vitality-vis.github.io.
Paper Structure (13 sections, 4 figures)

This paper contains 13 sections, 4 figures.

Figures (4)

  • Figure 1: Architecture of vitaLITy 2: (Step 1) User input to the system. (Step 2 & 3) Retrieve data from the vector database. (Step 4) Combine the result with user input in the prompt. (Step 5) Recall result from LLM. (Step 6) Return the final result to the user.
  • Figure 2: The vitaLITy 2 User Interface. (A) Paper Collection View shows the entire corpus of publications, (B) Similarity Search View shows options to look-up publications that are similar to another list of publications or by a work-in-progress title and abstract, (C) Visualization Canvas shows an interactive 2-D UMAP projection of the embedding space of the entire paper collection, (D) Meta View shows summaries of certain attributes with respect to the Paper Collection View (A), (E) Opens a Saved Papers View from which the saved papers can be exported as JSON. Extending vitaLITy 1, we added (F) Chat with your Data view to allow users to ask natural-language based questions based on the paper corpus. We also added ADA embeddings (in addition to GloVe and SPECTER embeddings) and enable users to Summarize or write a Literature Review on the Saved Papers using LLMs, including the ability to customize the prompts.
  • Figure 3: Noori's process of doing literature review using vitaLITy 2. (Step 1) Noori searches the vitaLITy 2 database for articles recommended by supervisor. (Step 2) Noori selects Ada Embedding as the embedding option used by vitaLITy 2 in "Similarity search". (Step 3) Noori adds the paper she just searched as a seed for "Similarity Search". (Step 4) Noori uses "Similarity Search" to find some related papers. (Step 5) Noori saves papers with similarity score of $>0.1$ and highlights them in the UMAP Visualization MAP. (Step 6) Noori selects an additional paper that interested her in the UMAP visualization map. (Step 7) Noori tries to modify and use different prompts and does "Summarize" and "Literature Review". (Step 8) Noori exports the saved papers to a bib file.
  • Figure 4: Aaron's process for literature search using vitaLITy 2. (Step 1) Aaron utilizes the "Chat with your data" feature to quickly explore the domain of "grounded theory", an area unfamiliar to him. (Step 2) Aaron plots an paper cited in the LLM feedback onto the UMAP visualization interface. (Step 3) Aaron selects a set of closely related papers from the UMAP visualization and Aaron adds these papers to the "Similarity Search".(Step 4) Aaron uses "Similarity Search" feature to find some semantically similar papers. (Step 5) Aaron saves a subset of papers of particular interest to his "saved papers" list. (Step 6) Aaron employs the "Summarize" and "Literature Review" feature to review the saved papers.