化工学报

• •    

大模型时代的石油化工行业高质量数据集:挑战与机遇

罗梦杰(), 赵云鹏, 张梦轩, 石孝刚, 蓝兴英()   

  1. 中国石油大学(北京)重质油全国重点实验室,北京 102249
  • 收稿日期:2025-07-22 修回日期:2025-10-21 出版日期:2025-11-25
  • 通讯作者: 蓝兴英
  • 作者简介:罗梦杰(1997—),女,博士研究生,17864260625@163.com
  • 基金资助:
    国家重点研发计划项目(2024YFE0212400)

High-quality datasets for petrochemical industry in era of large language models: challenges and opportunities

Mengjie LUO(), Yunpeng ZHAO, Mengxuan ZHANG, Xiaogang SHI, Xingying LAN()   

  1. State Key Laboratory of Heavy Oil Processing, China University of Petroleum, Beijing 102249, China
  • Received:2025-07-22 Revised:2025-10-21 Online:2025-11-25
  • Contact: Xingying LAN

摘要:

通用大模型性能的提升与开源生态的完善,正推动人工智能迈入行业深度赋能的新阶段。石油化工行业作为典型的流程工业,长期运行中积累了海量原始数据。然而,这些数据普遍存在碎片化、非结构化和标注不足等问题,难以直接用于大模型训练。因此,通用大模型在石化专业知识的理解和应用上仍存在局限,幻觉问题频出,制约了其在石化行业的应用。聚焦大模型在石化行业应用的关键——行业高质量数据集,系统梳理了石化行业高质量数据集建设面临的挑战与机遇。结合行业数据特征,提出了一个面向石化行业高质量数据集建设与应用的通用框架,旨在为石化行业构建适用性强、可扩展的高质量数据集体系,最后探讨了数据集赋能行业应用的前景与发展方向,以促进大模型与石化行业的深度融合与应用。

关键词: 石油化工, 高质量数据集, 大语言模型, 人工智能, 过程系统, 模型, 化学反应器

Abstract:

The rapid advancement of generalized large language models (LLMs) and the maturation of the open-source ecosystem are driving artificial intelligence (AI) toward deeper integration with industrial domains. As a typical process industry, the petrochemical industry has accumulated vast amounts of raw data over long-term operations. These data are often fragmented, unstructured, and poorly labeled, making them difficult to use directly for training LLMs. Such limitations constrain LLMs' ability to understand and apply specialized petrochemical knowledge, leading to frequent hallucinations that hinder their deployment. Focused on the key to applying LLMs in the petrochemical industry — high-quality industry datasets, a systematic review is presented on the challenges and opportunities in constructing such datasets. Based on the characteristics of industry data, a general framework is proposed for building and applying high-quality datasets in the petrochemical industry, aiming to establish a scalable and practical dataset system. The prospects and development directions of dataset-driven industrial applications are also discussed to promote the deep integration of LLMs with the petrochemical industry.

Key words: petrochemical industry, high-quality datasets, large language model, artificial intelligence, process systems, model, chemical reactors

中图分类号: