Table of Contents
Fetching ...

MPL: Multiple Programming Languages with Large Language Models for Information Extraction

Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng, Shikun Zhang

TL;DR

This research proposes a novel framework that explores the potential of incorporating different PLs in the SFT phase of supervised fine-tuning during the supervised fine-tuning (SFT) phase of Python, and introduces virtual running with virtual running to simulate code-style inputs more effectively and efficiently.

Abstract

Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.

MPL: Multiple Programming Languages with Large Language Models for Information Extraction

TL;DR

This research proposes a novel framework that explores the potential of incorporating different PLs in the SFT phase of supervised fine-tuning during the supervised fine-tuning (SFT) phase of Python, and introduces virtual running with virtual running to simulate code-style inputs more effectively and efficiently.

Abstract

Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.

Paper Structure

This paper contains 21 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The typical procedure of code-style information extraction system, which mainly contains two components: the code-style simulation and the docstring usage.
  • Figure 2: Our framework utilizes multiple programming languages, i.e, Python, C++, and Java, to convert elements from IE tasks and target textual inputs into code-style formats. To enhance the simulation process and help LLMs in processing textual inputs and generating outputs more naturally, we introduce the function-prompt with function definition and virtual running components. Better viewed in color.
  • Figure 3: Performance and training statistics for different LLMs with various input formats. Avg. Score represents the average score across all datasets, while Avg. Len denotes the average input length after tokenization with the corresponding LLM's tokenizer. Notably, the same prompt yields different Avg. Len values across models due to variations in their tokenizer configurations. Detailed counts are provided in Appendix D
  • Figure 4: The detailed input using Python and function-prompt on ACE05-NER dataset.
  • Figure 5: The detailed input using C++ and function-prompt on ACE05-NER dataset.
  • ...and 1 more figures