化工学报 ›› 2017, Vol. 68 ›› Issue (3): 916-924.DOI: 10.11949/j.issn.0438-1157.20161555

• 过程系统工程 • 上一篇    下一篇

基于改进ELM的递归最小二乘时序差分强化学习算法及其应用

徐圆, 黄兵明, 贺彦林   

  1. 北京化工大学信息科学与技术学院, 北京 100029
  • 收稿日期:2016-11-03 修回日期:2016-11-08 出版日期:2017-03-05 发布日期:2017-03-05
  • 通讯作者: 贺彦林,xyfancy@163.com
  • 基金资助:

    国家自然科学基金项目(61573051,61472021);软件开发环境国家重点实验室开放课题(SKLSDE-2015KF-01);中央高校基本科研业务费专项资金项目(PT1613-05)。

Recursive least-squares TD (λ) learning algorithm based on improved extreme learning machine

XU Yuan, HUANG Bingming, HE Yanlin   

  1. School of Information Science & Technology, Beijing University of Chemical Technology, Beijing 100029, China
  • Received:2016-11-03 Revised:2016-11-08 Online:2017-03-05 Published:2017-03-05
  • Contact: 10.11949/j.issn.0438-1157.20161555
  • Supported by:

    supported by the National Natural Science Foundation of China (61573051,61472021),the Open Fund of the State Key Laboratory of Software Development Environment (SKLSDE-2015KF-01) and the Fundamental Research Funds for Central Universities of China (PT1613-05).

摘要:

针对值函数逼近算法对精度及计算时间等要求,提出了一种基于改进极限学习机的递归最小二乘时序差分强化学习算法。首先,将递推方法引入到最小二乘时序差分强化学习算法中消去最小二乘中的矩阵求逆过程,形成递推最小二乘时序差分强化学习算法,减少算法的复杂度及其计算量。其次,考虑到LSTD(0)算法收敛速度慢,加入资格迹增加样本利用率提高收敛速度的算法,形成LSTD(λ)算法,以保证在经历过相同数量的轨迹后能收敛于真实值。同时,考虑到大部分强化学习问题的值函数是单调的,而传统ELM方法通常运用具有双侧抑制特性的Sigmoid激活函数,增大了计算成本,提出采用具有单侧抑制特性的Softplus激活函数代替传统Sigmoid函数,以减少计算量提高运算速度,使得该算法在提高精度的同时提高了计算速度。通过与传统基于径向基函数的最小二乘强化学习算法和基于极限学习机的最小二乘TD算法在广义Hop-world问题的对比实验,比较结果证明了所提出算法在满足精度的条件下有效提高了计算速度,甚至某些条件下精度比其他两种算法更高。

关键词: 强化学习, 激活函数, 递归最小二乘算法, 函数逼近, 广义Hop-world问题

Abstract:

To meet the requirements on accuracy and computational time of value approximation algorithms, a recursive least-squares temporal difference reinforcement learning algorithm with eligibility traces based on improved extreme learning machine (RLSTD(λ)-IELM) was proposed. First, a recursive least-squares temporal difference reinforcement learning (RLSTD) was created by introducing recursive method into least-squares temporal difference reinforcement learning algorithm (LSTD), in order to eliminate matrix inversion process in least-squares algorithm and to reduce complexity and computation of the proposed algorithm. Then, eligibility trace was introduced into RLSTD algorithm to form the recursive least-squares temporal difference reinforcement learning algorithm with eligibility trace (RLSTD(λ)), in order to solve issues of slow convergence speed of LSTD(0) and low efficiency of experience exploitation. Furthermore, since value function in most reinforcement learning problem was monotonic, a single suppressed approximation Softplus function was used to replace sigmoid activation function in the extreme learning machine network in order to reduce computation load and improve computing speed. The experiment result on generalized Hop-world problem demonstrated that the proposed algorithm RLSTD(λ)-IELM had faster computing speed than the least-squares temporal difference learning algorithm based on extreme learning machine (LSTD-ELM), and better accuracy than the least-squares temporal difference learning algorithm based on radial basis functions (LSTD-RBF).

Key words: reinforcement learning, activation function, recursive least-squares methods, function approximation, generalized Hop-world problem

中图分类号: