PPO works much better for discrete action spaces

引言

PPO 适用于任意的分布,然而在 OpenAI Five 中所有的动作分布均为离散分布。

特别地,对于以下的移动动作,显然更适用于连续分布。而 OpenAI Five 对之同样使用离散分布。

Offset: A 2D (X; Y ) coordinate indicating a spatial offset, used for abilities which target a location on the map. The offset is interpreted relative to the caster or the unit selected by the Unit Selection parameter, depending on the ability. Both X and Y are discrete integer outputs ranging from -4 to +4 inclusive, producing a grid of 81 possible coordinate pairs.

来自论文

Learning Dexterous In-Hand Manipulation 这篇论文中所使用的框架和 OpenAI Five 相似。论文中指出,PPO 在离散动作空间上的学习效果要更好:

While PPO can handle both continuous and discrete action spaces, we noticed that discrete action spaces work much better. This may be because a discrete probability distribution is more expressive than a multivariate Gaussian or because discretization of actions makes learning a good advantage function potentially simpler.

例子

OpenAI Five 及腾讯的论文,其输出的动作均为离散分布。

疑问

baselines 中的 MultiCategoricalPd,认为多个离散分布的最优组合即为最优。我想这应该是不正确的。






深度学习推荐
深度学习推荐

墨之科技,版权所有 © Copyright 2017-2027

湘ICP备14012786号     邮箱:ai@inksci.com