PPO 适用于任意的分布,然而在 OpenAI Five 中所有的动作分布均为离散分布。
特别地,对于以下的移动动作,显然更适用于连续分布。而 OpenAI Five 对之同样使用离散分布。
Offset: A 2D (X; Y ) coordinate indicating a spatial offset, used for abilities which target a location on the map. The offset is interpreted relative to the caster or the unit selected by the Unit Selection parameter, depending on the ability. Both X and Y are discrete integer outputs ranging from -4 to +4 inclusive, producing a grid of 81 possible coordinate pairs.
Learning Dexterous In-Hand Manipulation 这篇论文中所使用的框架和 OpenAI Five 相似。论文中指出,PPO 在离散动作空间上的学习效果要更好:
While PPO can handle both continuous and discrete action spaces, we noticed that discrete action spaces work much better. This may be because a discrete probability distribution is more expressive than a multivariate Gaussian or because discretization of actions makes learning a good advantage function potentially simpler.
OpenAI Five 及腾讯的论文,其输出的动作均为离散分布。
baselines 中的 MultiCategoricalPd,认为多个离散分布的最优组合即为最优。我想这应该是不正确的。
墨之科技,版权所有 © Copyright 2017-2027
湘ICP备14012786号 邮箱:ai@inksci.com