Table of Contents
Fetching ...

TechGPT-2.0: A large language model project to solve the task of knowledge graph construction

Jiaqi Wang, Yuying Chang, Zhong Li, Ning An, Qi Ma, Lei Hei, Haibo Luo, Yifei Lu, Feiliang Ren

TL;DR

This work addresses the challenge of aligning large language models with knowledge-graph construction tasks, focusing on NER and relation triplet extraction in Chinese domains. It introduces TechGPT-2.0, comprising two 7B instruction-tuned models and a QLoRA variant for long-text processing, trained on Huawei Ascend infrastructure and built atop Chinese-adapted LLAMA2 and related architectures. The paper details a two-stage data pipeline totaling ~4 million instruction-fine-tuning instances, domain KG data, and general-purpose tasks, along with extensive Ascend-server debugging and training experiences to guide future researchers. While experimental demonstrations are limited by resources, the work offers practical guidelines for data preparation, model selection, long-context handling, and platform-specific training, contributing to open-source KG-friendly LLM development in the Chinese community.

Abstract

Large language models have exhibited robust performance across diverse natural language processing tasks. This report introduces TechGPT-2.0, a project designed to enhance the capabilities of large language models specifically in knowledge graph construction tasks, including named entity recognition (NER) and relationship triple extraction (RTE) tasks in NLP applications. Additionally, it serves as a LLM accessible for research within the Chinese open-source model community. We offer two 7B large language model weights and a QLoRA weight specialized for processing lengthy texts.Notably, TechGPT-2.0 is trained on Huawei's Ascend server. Inheriting all functionalities from TechGPT-1.0, it exhibits robust text processing capabilities, particularly in the domains of medicine and law. Furthermore, we introduce new capabilities to the model, enabling it to process texts in various domains such as geographical areas, transportation, organizations, literary works, biology, natural sciences, astronomical objects, and architecture. These enhancements also fortified the model's adeptness in handling hallucinations, unanswerable queries, and lengthy texts. This report provides a comprehensive and detailed introduction to the full fine-tuning process on Huawei's Ascend servers, encompassing experiences in Ascend server debugging, instruction fine-tuning data processing, and model training. Our code is available at https://github.com/neukg/TechGPT-2.0

TechGPT-2.0: A large language model project to solve the task of knowledge graph construction

TL;DR

This work addresses the challenge of aligning large language models with knowledge-graph construction tasks, focusing on NER and relation triplet extraction in Chinese domains. It introduces TechGPT-2.0, comprising two 7B instruction-tuned models and a QLoRA variant for long-text processing, trained on Huawei Ascend infrastructure and built atop Chinese-adapted LLAMA2 and related architectures. The paper details a two-stage data pipeline totaling ~4 million instruction-fine-tuning instances, domain KG data, and general-purpose tasks, along with extensive Ascend-server debugging and training experiences to guide future researchers. While experimental demonstrations are limited by resources, the work offers practical guidelines for data preparation, model selection, long-context handling, and platform-specific training, contributing to open-source KG-friendly LLM development in the Chinese community.

Abstract

Large language models have exhibited robust performance across diverse natural language processing tasks. This report introduces TechGPT-2.0, a project designed to enhance the capabilities of large language models specifically in knowledge graph construction tasks, including named entity recognition (NER) and relationship triple extraction (RTE) tasks in NLP applications. Additionally, it serves as a LLM accessible for research within the Chinese open-source model community. We offer two 7B large language model weights and a QLoRA weight specialized for processing lengthy texts.Notably, TechGPT-2.0 is trained on Huawei's Ascend server. Inheriting all functionalities from TechGPT-1.0, it exhibits robust text processing capabilities, particularly in the domains of medicine and law. Furthermore, we introduce new capabilities to the model, enabling it to process texts in various domains such as geographical areas, transportation, organizations, literary works, biology, natural sciences, astronomical objects, and architecture. These enhancements also fortified the model's adeptness in handling hallucinations, unanswerable queries, and lengthy texts. This report provides a comprehensive and detailed introduction to the full fine-tuning process on Huawei's Ascend servers, encompassing experiences in Ascend server debugging, instruction fine-tuning data processing, and model training. Our code is available at https://github.com/neukg/TechGPT-2.0
Paper Structure (15 sections, 3 figures, 1 table)

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: RTE Data Proportion Chart
  • Figure 2: General task data proportion chart
  • Figure 3: Summary data proportion chart