• •
收稿日期:2025-07-22
修回日期:2025-10-21
出版日期:2025-11-25
通讯作者:
蓝兴英
作者简介:罗梦杰(1997—),女,博士研究生,17864260625@163.com
基金资助:
Mengjie LUO(
), Yunpeng ZHAO, Mengxuan ZHANG, Xiaogang SHI, Xingying LAN(
)
Received:2025-07-22
Revised:2025-10-21
Online:2025-11-25
Contact:
Xingying LAN
摘要:
通用大模型性能的提升与开源生态的完善,正推动人工智能迈入行业深度赋能的新阶段。石油化工行业作为典型的流程工业,长期运行中积累了海量原始数据。然而,这些数据普遍存在碎片化、非结构化和标注不足等问题,难以直接用于大模型训练。因此,通用大模型在石化专业知识的理解和应用上仍存在局限,幻觉问题频出,制约了其在石化行业的应用。聚焦大模型在石化行业应用的关键——行业高质量数据集,系统梳理了石化行业高质量数据集建设面临的挑战与机遇。结合行业数据特征,提出了一个面向石化行业高质量数据集建设与应用的通用框架,旨在为石化行业构建适用性强、可扩展的高质量数据集体系,最后探讨了数据集赋能行业应用的前景与发展方向,以促进大模型与石化行业的深度融合与应用。
中图分类号:
罗梦杰, 赵云鹏, 张梦轩, 石孝刚, 蓝兴英. 大模型时代的石油化工行业高质量数据集:挑战与机遇[J]. 化工学报, DOI: 10.11949/0438-1157.20250806.
Mengjie LUO, Yunpeng ZHAO, Mengxuan ZHANG, Xiaogang SHI, Xingying LAN. High-quality datasets for petrochemical industry in era of large language models: challenges and opportunities[J]. CIESC Journal, DOI: 10.11949/0438-1157.20250806.
| [34] | 吴与伦, 王振雷, 王昕. 基于对比学习的乙烯裂解炉运行工况识别方法[J]. 化工学报, 2025, 76(6): 2733-2742. |
| Wu Y L, Wang Z L, Wang X. Contrastive learning based on method for identifying operating conditions of ethylene cracking furnace[J]. CIESC Journal, 2025, 76(6): 2733-2742. | |
| [35] | Zhang M X, Yang Z, Zhao Y P, et al. A hybrid safety monitoring framework for industrial FCC disengager coking rate based on FPM, CFD, and ML[J]. Process Safety and Environmental Protection, 2023, 175: 17-33. |
| [36] | Wang J L, Xu C Q, Zhang J, et al. Big data analytics for intelligent manufacturing systems: a review[J]. Journal of Manufacturing Systems, 2022, 62: 738-752. |
| [37] | Chaouche S, Randon Y, Adjed F, et al. DQM: data quality metrics for AI components in the industry[J]. Proceedings of the AAAI Symposium Series, 2024, 4(1): 24-31. |
| [38] | Picard S, Chapdelaine C, Cappi C, et al. Ensuring dataset quality for machine learning certification[C]//2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). October 12-15, 2020, Coimbra, Portugal. IEEE, 2020: 275-282. |
| [39] | Iantovics L B, Enăchescu C. Method for data quality assessment of synthetic industrial data[J]. Sensors, 2022, 22(4): 1608. |
| [40] | Mumuni A, Mumuni F. Automated data processing and feature engineering for deep learning and big data applications: a survey[J]. Journal of Information and Intelligence, 2025, 3(2): 113-153. |
| [41] | Wang C, Luo Y, Du W Z, et al. Faster and stronger: unleashing data processing potential through hardware heterogeneity[J]. IEEE Internet of Things Journal, 2025, 12(10): 14559-14576. |
| [42] | Roh Y, Heo G, Whang S E. A survey on data collection for machine learning: a big data - AI integration perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(4): 1328-1347. |
| [43] | Wanasinghe T R, Wroblewski L, Petersen B K, et al. Digital twin for the oil and gas industry: overview, research trends, opportunities, and challenges[J]. IEEE Access, 2020, 8: 104175-104197. |
| [44] | Neri P. Big data in the digital oilfield requires data transfer standards to perform[C]//Offshore Technology Conference. April 30-May 3, 2018. Houston, Texas, USA. OTC, 2018: D031S037R001. |
| [45] | Freitas N, Rocha A D, Barata J. Data management in industry: concepts, systematic review and future directions[J]. Journal of Intelligent Manufacturing, 2025. |
| [46] | Agerskans N, Ashjaei M, Bruch J, et al. A data flow framework to support the selection and integration of digital technologies for smart production[J]. International Journal of Production Research, 2025, 63(12): 4269-4286. |
| [47] | Meiser M, Zinnikus I. A survey on the use of synthetic data for enhancing key aspects of trustworthy AI in the energy domain: challenges and opportunities[J]. Energies, 2024, 17(9): 1992. |
| [48] | Meiser M, Duppe B, Zinnikus I. Generation of meaningful synthetic sensor data: Evaluated with a reliable transferability methodology[J]. Energy and AI, 2024, 15: 100308. |
| [49] | Wei B, Tan S, Zhang Q C, et al. A hybrid soft sensor for key product yield of FCC unit based on deep learning framework driven by data and process mechanism[J]. Chemical Engineering Research and Design, 2024, 202: 429-443. |
| [50] | Inoue T, Ikami T, Egami Y, et al. Data-driven optimal sensor placement for high-dimensional system using annealing machine[J]. Mechanical Systems and Signal Processing, 2023, 188: 109957. |
| [51] | AlAbdouli M A, Al-Shihabi S. Artificial intelligence and its performance impacts in the oil and gas industry: Challenges, insights, and evaluation approaches[J]. Results in Engineering, 2025, 28: 107636. |
| [52] | Kuang L C, Liu H, Ren Y L, et al. Application and development trend of artificial intelligence in petroleum exploration and development[J]. Petroleum Exploration and Development, 2021, 48(1): 1-14. |
| [53] | Fan W F, Pang K H, Tian C. Imputing sparse and noisy labels for GNNs[C]//2025 IEEE 41st International Conference on Data Engineering (ICDE). May 19-23, 2025, Hong Kong, China. IEEE, 2025: 2295-2308. |
| [54] | Mohammadpoor M, Torabi F. Big Data analytics in oil and gas industry: an emerging trend[J]. Petroleum, 2020, 6(4): 321-328. |
| [55] | Maidla E, Maidla W, Rigg J, et al. Drilling analysis using big data has been misused and abused[C]//IADC/SPE Drilling Conference and Exhibition. March 6-8, 2018. WorthFort, Texas, USA. Richardson: Society of Petroleum Engineers, 2018: SPE 189583-MS. |
| [56] | 国家发展改革委. 关于印发«关于加强数字经济创新型企业培育的若干措施»的通知[EB/OL]. [2025-10-20]. . |
| National Development and Reform Commission. Notice on issuing "Several measures for strengthening the cultivation of innovative enterprises in the digital economy"[EB/OL]. [2025-10-20]. . | |
| [57] | 国家发展改革委, 国家能源局. 关于推进"人工智能+"能源高质量发展的实施意见[EB/OL]. [2025-10-20]. . |
| National Development and Reform Commission, National Energy Administration. Implementation opinions on promoting high-quality development of "Artificial Intelligence+" in the energy sector[EB/OL]. [2025-10-20]. . | |
| [58] | 国务院. 关于深入实施"人工智能+"行动的意见[EB/OL]. [2025-10-20]. . |
| State Council of the People's Republic of China. Opinions on the deep implementation of the "Artificial Intelligence+" action[EB/OL]. [2025-10-20]. . | |
| [59] | 国家发展改革委. 关于促进数据标注产业高质量发展的实施意见[EB/OL]. [2025-10-20]. . |
| National Development and Reform Commission. Implementation opinions on promoting high-quality development of the data annotation industry[EB/OL]. [2025-10-20]. . | |
| [60] | 国家发展改革委. 关于促进数据产业高质量发展的指导意见[EB/OL]. [2025-10-20]. . |
| National Development and Reform Commission. Guiding opinions on promoting high-quality development of the data industry[EB/OL]. [2025-10-20]. . | |
| [61] | Lim J, Vogel-Heuser B, Kovalenko I. Large language model-enabled multi-agent manufacturing systems[C]//2024 IEEE 20th International Conference on Automation Science and Engineering (CASE). August 28 - September 1, 2024, Bari, Italy. IEEE, 2024: 3940-3946. |
| [62] | Silva L M V D, Köcher A, Gehlhoff F. Beyond formal semantics for capabilities and skills: Model Context Protocol in manufacturing[J]. arXiv preprint, 2025, arXiv:. |
| [63] | Thebelt A, Wiebe J, Kronqvist J, et al. Maximizing information from chemical engineering data sets: Applications to machine learning[J]. Chemical Engineering Science, 2022, 252: 117469. |
| [64] | Tariq Z, Aljawad M S, Hasan A, et al. A systematic review of data science and machine learning applications to the oil and gas industry[J]. Journal of Petroleum Exploration and Production Technology, 2021, 11(12): 4339-4374. |
| [1] | 高璟卉, 李海洋. 2023年中国石油和化学工业经济运行报告[J]. 现代化工, 2024, 44(3): 252-258. |
| Gao J H, Li H Y. Economic operation report of China petroleum and chemical industry in 2023[J]. Modern Chemical Industry, 2024, 44(3): 252-258. | |
| [2] | 马百凯, 刘洋. 2024年中国石油和化学工业经济运行报告[J]. 现代化工, 2025, 45(4): 270-276. |
| Ma B K, Liu Y. Economic operation report of China petroleum and chemical industry in 2024[J]. Modern Chemical Industry, 2025, 45(4): 270-276. | |
| [3] | 祝昉, 范敏. 当前石油和化工行业发展面临的突出问题及对策[J]. 化工管理, 2017(16): 14-15. |
| Zhu F, Fan M. Outstanding problems and countermeasures faced by the development of petroleum and chemical industry at present[J]. Chemical Enterprise Management, 2017(16): 14-15. | |
| [4] | Yuan Q T, Yin R Y, Cao X H, et al. Strategic research on the goals, characteristics, and paths of intelligentization of process manufacturing industry for 2035[J]. Chinese Journal of Engineering Science, 2020, 22(3): 148. |
| [5] | Wang J L, Luo X T, Zhang X H, et al. Artificial general intelligence (AGI) applications and prospect in oil and gas reservoir development[J]. Processes, 2025, 13(5): 1413. |
| [6] | Liu H, Ren Y L, Li X, et al. Research status and application of artificial intelligence large models in the oil and gas industry[J]. Petroleum Exploration and Development, 2024, 51(4): 1049-1065. |
| [7] | 张伟. AI一体化大模型助力航天应用场景升级[N]. 中国高新技术产业导报, 2025-05-12(22). |
| Zhang W. AI integrated model helps upgrade aerospace application scenarios[N]. China High-Tech Industry Herald, 2025-05-12(22). | |
| [8] | 李玉峰, 宗国庆, 张东豪. 基于具身大模型的多场景智能巡检机器人系统框架与多机器人协同应用集成研究[J]. 智能计算机与应用, 2025, 15(5): 37-43. |
| Li Y F, Zong G Q, Zhang D H. Research on the embodied large model-based multi-scenario intelligent inspection robot system framework and multi-robot collaborative application integration[J]. Intelligent Computer and Applications, 2025, 15(5): 37-43. | |
| [9] | 邓鹏, 唐文涛, 罗静. 机器人大模型发展与挑战[J]. 电子测量与仪器学报, 2024, 38(12): 12-25. |
| Deng P, Tang W T, Luo J. Robotic large model development and challenges[J]. Journal of Electronic Measurement and Instrumentation, 2024, 38(12): 12-25. | |
| [10] | 卢阳光, 闵庆飞, 刘锋. 中国智能制造研究现状的可视化分类综述: 基于CNKI(2005—2018)的科学计量分析[J]. 工业工程与管理, 2019, 24(4): 14-22, 39. |
| Lu Y G, Min Q F, Liu F. Classified and visualization review of the current research about intelligent manufacturing in China: scientific measurement analysis based on the CNKI database(2005-2018)[J]. Industrial Engineering and Management, 2019, 24(4): 14-22, 39. | |
| [11] | Fornasiero R, Kiebler L, Falsafi M, et al. Proposing a maturity model for assessing Artificial Intelligence and Big data in the process industry[J]. International Journal of Production Research, 2025, 63(4): 1235-1255. |
| [12] | 国家能源局. 中国石油发布3000亿参数昆仑大模型[EB/OL]. [2025-07-20]. . |
| National Energy Administration. China National Petroleum Corporation (CNPC) released the 300-billion-parameter Kunlun model[EB/OL]. [2025-07-20]. . | |
| [13] | Zhou J B, Xu F Y, Chang Z J, et al. From lab to fab: a large language model for chemical engineering[J]. Chinese Journal of Catalysis, 2025, 73: 159-173. |
| [14] | 中化信息技术有限公司. 全新升级!天枢智能化学合成平台算法再优化,数据更海量![EB/OL]. [2025-07-20]. . |
| Sinochem Information Technology Co., Ltd. TianShu chemical AI platform unveils major upgrade: smarter algorithms, richer data[EB/OL]. [2025-07-20]. . | |
| [15] | Zhao H C, Tang X R, Yang Z R, et al. ChemSafetyBench: Benchmarking LLM safety on chemistry domain[J]. arXiv preprint, 2024, arXiv:. |
| [16] | Cui T Y, Wang Y L, Fu C P, et al. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems[J]. arXiv preprint, 2024, arXiv:. |
| [17] | Raptis T P, Passarella A, Conti M. Data management in industry 4.0: state of the art and open challenges[J]. IEEE Access, 2019, 7: 97052-97093. |
| [18] | 张永泽, 马骏. 流程制造业与离散制造业物流特点[J]. 北京邮电大学学报(社会科学版), 2010, 12(6): 72-76. |
| Zhang Y Z, Ma J. Logistic features in process industry and discrete industry[J]. Journal of Beijing University of Posts and Telecommunications (Social Sciences Edition), 2010, 12(6): 72-76. | |
| [19] | Qian F, Zhong W M, Du W L. Fundamental theories and key technologies for smart and optimal manufacturing in the process industry[J]. Engineering, 2017, 3(2): 154-160. |
| [20] | Chai T Y. Industrial process control systems: research status and development direction[J]. Scientia Sinica Informationis, 2016, 46(8): 1003-1015. |
| [21] | Mao S, Wang B, Tang Y, et al. Opportunities and challenges of artificial intelligence for green manufacturing in the process industry[J]. Engineering, 2019, 5(6): 995-1002. |
| [22] | 钱锋, 杜文莉, 钟伟民, 等. 石油和化工行业智能优化制造若干问题及挑战[J]. 自动化学报, 2017, 43(6): 893-901. |
| Qian F, Du W L, Zhong W M, et al. Problems and challenges of smart optimization manufacturing in petrochemical industries[J]. Acta Automatica Sinica, 2017, 43(6): 893-901. | |
| [23] | 朴英爱, 张艺凡. 人工智能提升制造业产业链韧性的作用机理与中国路径[J]. 当代经济管理, 2025, 47(8): 56-64. |
| Piao Y A, Zhang Y F. The mechanism of artificial intelligence in enhancing the resilience of the manufacturing industry chain and Chinese paths[J]. Contemporary Economic Management, 2025, 47(8): 56-64. | |
| [24] | Min Q F, Lu Y G, Liu Z Y, et al. Machine learning based digital twin framework for production optimization in petrochemical industry[J]. International Journal of Information Management, 2019, 49: 502-519. |
| [25] | Nils U-M, Nergaard H, Erdödi L, et al. Secure information sharing in an industrial internet of things[J]. arXiv preprint, 2016, arXiv:. |
| [26] | Li X Y, Li Z F, Gao L. Paths for the digital transformation and intelligent upgrade of China's discrete manufacturing industry[J]. Chinese Journal of Engineering Science, 2022, 24(2): 64. |
| [27] | Wang J P, Zhang W S, Shi Y K, et al. Industrial big data analytics: challenges, methodologies, and applications[J]. arXiv preprint, 2018, arXiv:. |
| [28] | Nwosu K C, Kamara I, Abdulgader M, et al. Data partitioning and storage strategies for artificial intelligence and machine learning applications: a review of techniques[C]//2024 International Conference on Computer and Applications (ICCA). December 17-19, 2024, Cairo, Egypt. IEEE, 2024: 1-10. |
| [29] | 钱锋, 桂卫华. 人工智能助力制造业优化升级[J]. 中国科学基金, 2018, 32(3): 257-261. |
| Qian F, Gui W H. Boosting optimization and upgrade for manufacturing industry by artificial intelligence[J]. Bulletin of National Natural Science Foundation of China, 2018, 32(3): 257-261. | |
| [30] | Lee J, Bagheri B, Kao H G. A cyber-physical systems architecture for industry 4.0-based manufacturing systems[J]. Manufacturing Letters, 2015, 3: 18-23. |
| [31] | Lee J, Azamfar M, Singh J. A blockchain enabled Cyber-Physical System architecture for Industry 4.0 manufacturing systems[J]. Manufacturing Letters, 2019, 20: 34-39. |
| [32] | Figliè R, Amadio R, Tyrovolas M, et al. Towards a taxonomy of industrial challenges and enabling technologies in industry 4.0[J]. IEEE Access, 2024, 12: 19355-19374. |
| [33] | Koroteev D, Tekic Z. Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future[J]. Energy and AI, 2021, 3: 100041. |
| [1] | 刘豪, 王林, 丁昊, 耿嘉怡. R1150+R1234ze(E)二元体系223.15~253.15 K汽液相平衡研究[J]. 化工学报, 2025, 76(S1): 1-8. |
| [2] | 郭纪超, 徐肖肖, 孙云龙. 基于植物工厂中的CO2浓度气流模拟及优化研究[J]. 化工学报, 2025, 76(S1): 237-245. |
| [3] | 马爱华, 赵帅, 王林, 常明慧. 太阳能吸收制冷循环动态特性仿真方法研究[J]. 化工学报, 2025, 76(S1): 318-325. |
| [4] | 吴成云, 孙浩然. 民用飞机空调系统性能仿真与燃油代偿损失研究[J]. 化工学报, 2025, 76(S1): 351-359. |
| [5] | 李卫, 陈浩, 柯钢, 黄孝胜, 李成娇, 郭航, 叶芳. 高原环境适应性试验室模拟平台新风系统仿真[J]. 化工学报, 2025, 76(S1): 360-369. |
| [6] | 密晓光, 孙国刚, 程昊, 张晓慧. 印刷电路板式天然气冷却器性能仿真模型和验证[J]. 化工学报, 2025, 76(S1): 426-434. |
| [7] | 黄灏, 王文, 李沛昀. 三角转子膨胀机串联运行特性研究[J]. 化工学报, 2025, 76(S1): 435-443. |
| [8] | 张文锋, 郭玮, 张新玉, 曹昊敏, 丁国良. 铝管铝翅片换热器模型开发及软件实现[J]. 化工学报, 2025, 76(S1): 84-92. |
| [9] | 王一飞, 李玉星, 欧阳欣, 赵雪峰, 孟岚, 胡其会, 殷布泽, 郭雅琦. 基于裂尖减压特性的CO2管道断裂扩展数值计算[J]. 化工学报, 2025, 76(9): 4683-4693. |
| [10] | 王偲凡, 栗一帆, 陈江波, 周桓. 碳酸盐型卤水Li+, Na+, K+, CO |
| [11] | 赵婧, 董书辰, 李高洋, 黄友科, 石浩森, 缪舒文, 谭辰妍, 朱唐琦, 李永帅, 潘慧, 凌昊. 基于电化学模型的电池性能模拟与优化[J]. 化工学报, 2025, 76(9): 4922-4932. |
| [12] | 娄岚浩, 杨立鹏, 杨晓光. 锂离子电池电化学机理模型参数辨识研究综述[J]. 化工学报, 2025, 76(9): 4369-4382. |
| [13] | 周轶磊, 李智, 彭鑫. 基于代理模型的连续重整反应过程自优化控制结构设计[J]. 化工学报, 2025, 76(9): 4499-4511. |
| [14] | 罗海梅, 王泓, 孙照明, 尹艳华. 同向双螺杆传热系数计算模型的分析与验证[J]. 化工学报, 2025, 76(9): 4809-4823. |
| [15] | 杨开源, 陈锡忠. 颗粒破碎的离散元及有限离散元模拟方法比较[J]. 化工学报, 2025, 76(9): 4398-4411. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||
京公网安备 11010102001995号