2023 年 3月 5 日随笔档案 - MoonOut - 博客园

摘要： motivation：减少 RL 试错过程中的 unsafe behavior。技术路线：先模仿学习，再在 on-line learning 时强行改可能 unsafe 的 action，即 post-hoc rectification。阅读全文

posted @ 2023-03-05 13:13 MoonOut 阅读(78) 评论(0) 推荐(0) 编辑

月出兮彩云归 ?