IEEE Transactions on Cognitive Communications and Networking, 11 December 2024
Large Language Models (LLMs) have transformed various fields with their remarkable ability to comprehend and generate human-like text. Despite these advancements, their effectiveness in specialized domains such as finance, law, medicine, and telecommunications remains limited. To adapt these models to new domains, it is essential to train them on relevant datasets. Fine-tuning is a well-known method for training LLMs on new tasks using specialized datasets. However, generating these specialized datasets presents a critical challenge, as structuring the data appropriately for effective learning is complex. To address this challenge, this paper presents 5G Instruct Forge, an advanced data engineering pipeline designed to create domain-specific datasets for 5G networking, particularly from the 3rd Generation Partnership Project (3GPP) specifications. By processing unstructured documents, i.e., 3GPP Technical Specifications (TSs), into structured formats, our pipeline enables LLMs to be fine-tuned for understanding and generating 5Grelated content. As a proof of concept, we generated the OpenAirInterface
(OAI) Instruct dataset using our pipeline, utilizing a subset of the 3GPP TSs used to develop OAI. Evaluation results demonstrate that training generic open-source LLMs on this dataset resulted in new 5G-aware LLMs outperforming OpenAI’s GPT-4 on 5G-specific tasks.
Type:
Journal
Date:
2024-12-11
Department:
Systèmes de Communication
Eurecom Ref:
8009
Copyright:
© 2024 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
See also: