CIESC Journal ›› 2023, Vol. 74 ›› Issue (3): 1187-1194.DOI: 10.11949/0438-1157.20221216

• Process system engineering • Previous Articles     Next Articles

Group2vec: group vector representation and its property prediction applications based on unsupervised machine learning

Xinyuan WU(), Qilei LIU(), Boyuan CAO, Lei ZHANG, Jian DU   

  1. Frontiers Science Center for Smart Materials Oriented Chemical Engineering, Institute of Chemical Process Systems Engineering, School of Chemical Engineering, Dalian University of Technology, Dalian 116024, Liaoning, China
  • Received:2022-09-06 Revised:2022-12-21 Online:2023-04-19 Published:2023-03-05
  • Contact: Qilei LIU

Group2vec:基于无监督机器学习的基团向量表示及其物性预测应用

吴心远(), 刘奇磊(), 曹博渊, 张磊, 都健   

  1. 大连理工大学化工学院,化工系统工程研究所,智能材料化工前沿科学中心,辽宁 大连 116024
  • 通讯作者: 刘奇磊
  • 作者简介:吴心远(1997—),男,硕士研究生,806973411@mail.dlut.edu.cn
  • 基金资助:
    国家自然科学基金项目(22208042);中央高校基本科研业务费专项资金项目(DUT22YG218);中国博士后科学基金项目(2022M710578)

Abstract:

Quantitative structure-property relationship models play an important role in chemical product design. The natural language processing-based deep learning modeling method is one of the effective methods to construct quantitative structure-property relationship models. A group embedding model (Group2vec)-based deep learning framework is proposed for property predictions. First, a pre-training database for the group embedding model and four property prediction databases are established. Second, the text-based SMILES strings in databases are converted to the group sequences by using the group division method. Third, the CBOW algorithm is used to pre-train the group sequences to obtain the group vectors containing similar structure information. Finally, a deep learning model including the attention mechanism is built based on group vectors, and the model is tested on different property databases. The comparison results show that the deep learning property prediction model based on Group2vec not only has high prediction accuracy and versatility, but also has a certain degree of interpretability.

Key words: product engineering, neural networks, prediction, functional group, attention mechanism

摘要:

定量构效关系模型在化工产品设计中发挥着重要作用。基于自然语言处理技术的深度学习建模方法是构建定量构效关系模型的有效方法之一。提出一种基于基团词嵌入模型(Group2vec)的深度学习物性预测框架。首先,建立数据库用于预训练与物性预测。其次,利用基团分割方法,将数据库中分子SMILES文本转化为基团序列。再次,通过CBOW算法将基团序列进行词嵌入预训练,获得包含相似性结构信息的基团向量。最后,基于基团向量构建包含注意力机制的深度学习模型,并在不同物性数据库上进行模型测试,同时将其与现有模型进行比较,对比结果表明基于Group2vec的深度学习物性预测模型不仅具有较高的预测准确性与通用性,也具备一定的可解释性。

关键词: 产品工程, 神经网络, 预测, 基团, 注意力机制

CLC Number: