Table of Contents
Fetching ...

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Narayanan PP, Anantharaman Palacode Narayana Iyer

TL;DR

This work introduces HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables and surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o.

Abstract

Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

TL;DR

This work introduces HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables and surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o.

Abstract

Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.
Paper Structure (20 sections, 6 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 20 sections, 6 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Hysem Architecture diagram
  • Figure 2: An illustration of a table optimized by the Context Optimizer is shown in \ref{['fig:html-tab-bef-opt']}, presenting the original HTML table without any optimization. \ref{['fig:html-tab-aft-opt']} displays the same table after optimization. In these figures, tokens within each cell are highlighted with distinct colors to facilitate easy observation of the token count per cell. The tokenization is performed using the LLaMa 3 tokenizer.
  • Figure 3: This table compares various dosage regimens of Glucophage and Glucophage XR. The challenge in converting it to semantic JSON lies in its nested categories, including dosage, measurement types, and time points. Each combination of dose, metric, time point, and statistical details (mean change, confidence intervals) must be accurately mapped to database schema keys. Ensuring data integrity while managing variability in column structures is essential for creating a usable semantic representation.
  • Figure 4: Sample Table to illustrate Semantic JSON
  • Figure 5: Hysem JSON
  • ...and 5 more figures