PPO works much better for discrete action spaces/图文 - 墨之科技

i 收藏

PPO works much better for discrete action spaces

引言

PPO 适用于任意的分布，然而在 OpenAI Five 中所有的动作分布均为离散分布。

特别地，对于以下的移动动作，显然更适用于连续分布。而 OpenAI Five 对之同样使用离散分布。

Offset: A 2D (X; Y ) coordinate indicating a spatial offset, used for abilities which target a location on the map. The offset is interpreted relative to the caster or the unit selected by the Unit Selection parameter, depending on the ability. Both X and Y are discrete integer outputs ranging from -4 to +4 inclusive, producing a grid of 81 possible coordinate pairs.

来自论文

Learning Dexterous In-Hand Manipulation 这篇论文中所使用的框架和 OpenAI Five 相似。论文中指出，PPO 在离散动作空间上的学习效果要更好：

While PPO can handle both continuous and discrete action spaces, we noticed that discrete action spaces work much better. This may be because a discrete probability distribution is more expressive than a multivariate Gaussian or because discretization of actions makes learning a good advantage function potentially simpler.

例子

OpenAI Five 及腾讯的论文，其输出的动作均为离散分布。

疑问

baselines 中的 MultiCategoricalPd，认为多个离散分布的最优组合即为最优。我想这应该是不正确的。

{{login["user_name"]}} 退出

登录

图文信息

上一条
下一条
全部	全部图文

Sample-Efficient Imitation Learning via Generative Adversarial Nets

深度学习推荐

Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

深度学习推荐

SAMPLE EFFICIENT IMITATION LEARNING FOR CONTINUOUS CONTROL

深度学习推荐

Guided Policy Search 引导策略搜索

深度学习推荐

近端策略优化算法 Proximal Policy Optimization Algorithms

深度学习推荐

生成对抗模仿学习 Generative Adversarial Imitation Learning

深度学习推荐

对抗生成网络 Generative Adversarial Networks

深度学习推荐

无奖励工程的端到端机器人强化学习 End-to-End Robotic Reinforcement Learning without Reward Engineering

深度学习推荐

普通策略梯度算法 vanilla policy gradient

深度学习推荐

信任域策略优化算法 trust region policy optimization

深度学习推荐

深度增强学习框架：rllab & garage

深度学习推荐

值分布增强学习算法分布式贝尔曼算子 a distributional perspective on reinforcement learning

深度学习推荐

高斯分布的信息熵、交叉熵和相对熵（KL散度）公式推导

深度学习推荐

近端策略优化算法 Proximal Policy Optimization Algorithms

深度学习推荐

优先经验重播 Prioritized Experience Replay

深度学习推荐

Soft Actor-Critic

深度学习推荐

Stabilizing transformers for reinforcement learning

深度学习推荐

Sample-Efficient Imitation Learning via Generative Adversarial Nets

深度学习推荐

Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

深度学习推荐

SAMPLE EFFICIENT IMITATION LEARNING FOR CONTINUOUS CONTROL

深度学习推荐

scale * np.clip(np.random.normal(0, 1, (2,)), -3, 3)

get_circle_points 代码

TD3_BC 与 BC 训练结果

python 将 list 中的 dict 进行组合

.sh 枚举与遍历例子

css :hover 和前面的冒号不能有空格

random.shuffle(data)

python tree.map_structure

python dict 迭代 for key, value in d.items()

一种调用 softlearning 的方式，代码，类

python numpy x[None, ..., None]

flask debug=True 为什么会启动两次

类的继承与属性复制例子

gym Box 的两种定义方式

No registered env with id: halfcheetah-v2

DDPG HalfCheetah-v2 reward

.sh 文件自动输入密码的两种方式

DigitsFlow 设计

sys.path.append

d4rl dataset halfcheetah-expert-v2

shm 不断上涨的问题

mimetype x-mixed-replace boundary

WebSocket 推视频的优势

TD3_BC hopper，halfcheetah 实验结果

https 会加密哪些内容？

Python 类（实例）销毁

TD3_BC halfcheetah-v2 实验结果

mysql 修改密码、登录

PlanT 如何更新编辑的组件

视频识别在工程中的应用

图像位姿自动校准

Bus error (core dumped)

ocr.pytorch RCNN

pytorch errno 28 no space left on device

PlanT 如何实现 style 的 scoped

PlanT post 接口

websocket 接收数据的方法

传多个参数的方式

PlanT 是如何实现前端推送更新的？

Python 调用 dll 与回调

Python 调用 dll

python c++ 共享内存

ffmpeg 视频文件 rtsp

ffmpeg yuv rtsp

导入 Vue 组件

PlanT，一个没有前端的前端设计网站

自动重启，避免不断重启

python 显示隐藏终端、控制台

python 如何关闭 os.system 启动的程序？

cdn vue-quill-editor.js

hough detection 直线的表示

TD3_BC 未开始 Q 网络训练时，actor_Q_loss_list 为什么明显达不到 2.5

learn opencv hough line detection 代码

深度学习直线特征检测 line feature detection

TD3_BC LunarLander-V2 Critic loss 下降为什么分瓣？

在 TD3_BC 中，先训练好 policy 网络，然后仅训练 Critic 网络，效果是一样的

rtsp 转浏览器视频流

判断两个时间段是否交叉

判断多个时间，日期的大小

python WebsocketServer 只能用 localhost 和 127.0.0.1 访问的解决方法

性感美女，在线裸聊

美女性感自拍

深度学习统计图片集锦

HOG LBPs computer vision 观止

from typing import TypedDict 错误

精美图片收藏

深度学习图片集锦

anaconda 新手使用的 3 个步骤

Python UDP 通信的消息长度限制与分包

php 页面中使用 return 中断自身并返回结果

cross_entropy 中的 reduce_mean

php 中使用 json 的方法

JQuery $.get ajax 请求

Guided Policy Search 引导策略搜索

深度学习推荐

Pendulum 2DoF with NAF Algorithm

深度学习推荐

近端策略优化算法 Proximal Policy Optimization Algorithms

深度学习推荐

生成对抗模仿学习 Generative Adversarial Imitation Learning

深度学习推荐

对抗生成网络 Generative Adversarial Networks

深度学习推荐

无奖励工程的端到端机器人强化学习 End-to-End Robotic Reinforcement Learning without Reward Engineering

深度学习推荐

普通策略梯度算法 vanilla policy gradient

深度学习推荐

信任域策略优化算法 trust region policy optimization

深度学习推荐

深度增强学习框架：rllab & garage

深度学习推荐

值分布增强学习算法分布式贝尔曼算子 a distributional perspective on reinforcement learning

深度学习推荐

高斯分布的信息熵、交叉熵和相对熵（KL散度）公式推导

深度学习推荐

Mujoco UR5 机械臂仿真

机器人推荐

JS 获取 get 参数 get_url_param 函数

文贝推荐

漫谈区块链技术

Windows 全景合成软件

文贝推荐

近端策略优化算法 Proximal Policy Optimization Algorithms

深度学习推荐

优先经验重播 Prioritized Experience Replay

深度学习推荐

Soft Actor-Critic

深度学习推荐

Stabilizing transformers for reinforcement learning

深度学习推荐

春江花月夜

网页弹出指定大小窗口 JS 代码

Visual Studio 2017 离线版和安装教程

文贝推荐

墨之科技，版权所有 © Copyright 2017-2027

湘ICP备14012786号邮箱：ai@inksci.com