化工学报 ›› 2023, Vol. 74 ›› Issue (2): 630-641.DOI: 10.11949/0438-1157.20221060

• 热力学 • 上一篇    下一篇

以离子液体密度为例的分子性质预测模型建模方法探讨

陈家辉(), 杨鑫泽, 陈顾中, 宋震(), 漆志文   

  1. 华东理工大学化工学院,化学工程联合国家重点实验室,上海 200237
  • 收稿日期:2022-07-27 修回日期:2022-09-22 出版日期:2023-02-05 发布日期:2023-03-21
  • 通讯作者: 宋震
  • 作者简介:陈家辉(1998—),男,硕士研究生,y30200121@mail.ecust.edu.cn

A critical discussion on developing molecular property prediction models: density of ionic liquids as example

Jiahui CHEN(), Xinze YANG, Guzhong CHEN, Zhen SONG(), Zhiwen QI   

  1. State Key Laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2022-07-27 Revised:2022-09-22 Online:2023-02-05 Published:2023-03-21
  • Contact: Zhen SONG

摘要:

分子性质预测模型是针对特定应用需求筛选设计化学品的有力工具,然而诸多相关建模过程中的测试集划分、交叉验证、算法选择等关键环节普遍存在严谨性不足的问题,模型真实预测性能难以保证。以基团贡献法预测离子液体密度为例,探讨了分子性质预测模型建模过程中数据集划分和交叉验证的重要性,提出了自动基团划分方法并研究了数据集中基团涉及分子个数对预测精度的影响。通过对比五种回归算法(多重线性回归、岭回归、随机森林、支持向量机、神经网络),基于岭回归的基团贡献模型预测性能最佳,在由1078种离子液体、共计23034个数据点组成的数据集上得到的平均相对误差为1.88%。

关键词: 分子性质预测, 模型, 数据集划分, 交叉验证, 算法, 离子液体, 密度

Abstract:

Molecular property prediction models are powerful tools for screening or designing chemicals to meet specific application requirements. However, many key aspects in model development such as the size and diversity of dataset, test set partitioning method, cross-validation, and algorithm selection are not treated with enough rigor, which could lead to doubtful estimation of the true predictive performance of models. Taking the group contribution method to predict the density of ionic liquids as an example, the importance of dataset partitioning and cross-validation in the modeling of molecular property prediction models was discussed. An automatic group fragmentation method of ILs is proposed and the effect of group occurrence threshold(evaluated by the number of ILs containing the group in the dataset) on the prediction accuracy is investigated. By comparing five regression algorithms(multiple linear regression, ridge regression, random forest, support vector machine, and neural network), the group contribution model based on ridge regression has the best prediction performance. The average relative error obtained on the composed dataset is 1.88%.

Key words: molecular property prediction, modelling, dataset partitioning, cross-validation, algorithms, ionic liquids, density

中图分类号: