HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Narayanan PP; Anantharaman Palacode Narayana Iyer

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Narayanan PP, Anantharaman Palacode Narayana Iyer

TL;DR

This work introduces HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables and surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o.

Abstract

Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 20 sections, 6 equations, 10 figures, 7 tables, 2 algorithms.

Introduction
Methodology
Context Optimizer Subsystem
Encoding Phase
Decoding Phase
Semantic Synthesizer
Syntax Corrector
Evaluation Methodology
Intrinsic Evaluation
Extrinsic Evaluation
Results
Baselines and Benchmarking
Token Reduction Efficiency
Conclusion and Future Work
Appendix
...and 5 more sections

Figures (10)

Figure 1: Hysem Architecture diagram
Figure 2: An illustration of a table optimized by the Context Optimizer is shown in \ref{['fig:html-tab-bef-opt']}, presenting the original HTML table without any optimization. \ref{['fig:html-tab-aft-opt']} displays the same table after optimization. In these figures, tokens within each cell are highlighted with distinct colors to facilitate easy observation of the token count per cell. The tokenization is performed using the LLaMa 3 tokenizer.
Figure 3: This table compares various dosage regimens of Glucophage and Glucophage XR. The challenge in converting it to semantic JSON lies in its nested categories, including dosage, measurement types, and time points. Each combination of dose, metric, time point, and statistical details (mean change, confidence intervals) must be accurately mapped to database schema keys. Ensuring data integrity while managing variability in column structures is essential for creating a usable semantic representation.
Figure 4: Sample Table to illustrate Semantic JSON
Figure 5: Hysem JSON
...and 5 more figures

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

TL;DR

Abstract

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (10)