基于渐近式k-means聚类的多行动者确定性策略梯度算法

打开文本图片集
中图分类号:TP18 文献标志码:A 文章编号:1671-5489(2025)03-0885-10
Multi-actor Deterministic Policy Gradient Algorithm Based on Progressive k -Means Clustering
LIU Quan 1,2 ,LIU Xiaosong²,WU Guangjun²,LIU Yuhan³ (1.SchoolofComputerScienceand Techology,Kashi Unersity,Kashi844O,XinjiangUygurAutonomousRegion,China; 2. School of Computer Science and Technology,Soochow University,Suzhou 215008,Jiangsu Province,China; 3. Academyof Future Education,Xi'an Jiaotong-Liverpool University, Suzhou 2150oo,Jiangsu Province,China)
Abstract: Aiming at the problems of poor learning performance and high fluctuation in the deep deterministic policy gradient (DDPG) algorithm for tasks with some large state spaces,we proposed a multi-actor deep deterministic policy gradient algorithm based on progressive k -means clustering (MDDPG-PK-Means) algorithm. In the training process,when selecting actions for the state at each time step,the decision-making of the actor network was assisted based on the discrimination results of the k -means clustering algorithm. At the same time,as the training steps increased,the number of (204 k -means cluster centers gradually increased. The MDDPG-PK-Means algorithm was applied to the MuJoCo simulation platform,the experimental results show that,compared with DDPG and other algorithms,the MDDPG-PK-Means algorithm has better performance in most continuous tasks.
Keywords: deep reinforcement learning;deterministic policy gradient algorithm; k -means clustering; multi-actor
强化学习(reinforcement learning,RL)是一种在环境中不断自主学习,寻找规律以最大化未来累计奖赏,从而寻找最优策略达到目标的方法[1].其根据Agent 的当前状态寻找可执行的动作,因此强化学习适合解决序贯决策问题[2-3].
在传统强化学习中,基于值函数的 SARSA(state-action-reward-state-action)和 Q -Learning[4-5]算法在经典强化学习任务,如Cart-Pole和Mountain-Car等低维状态空间环境中效果较好,但在高维动作空间环境中性能不佳.随着深度学习的发展,深度神经网络有高效识别高维数据的能力,因此将深度学习(deep learning,DL)与强化学习相结合的深度强化学习(deep reinforcement learning,DRL)[6]能解决高维动作空间问题,目前 DRL已成为人工智能领域的热门研究方向之—[7-8]。(剩余13831字)