OpenAI 重读 - Dota2 职业队伍可以从中学到什么?OpenAI Reread-What can the Dota2 professional team learn from it?
Published:
距离“Dota 2 with Large Scale Deep Reinforcement Learning”发表已经过去近一年半了,这期间我们经历了恭喜OG,疫情导致的赛事体系停摆,以及多支队伍的架构和人员变动。
职业体系完成这一轮洗牌后,IG 拿到了后疫情时代的第一个 Major 冠军,引发全球社区热议,Ana 也再次宣布了在 OG 复出。夸张地说,这支队伍已经成了很多队伍和玩家心头挥之不去的梦魇,且版本似乎又有往他们擅长的奶推体系靠拢的趋势。
It has been nearly a year and a half since the publication of “Dota 2 with Large Scale Deep Reinforcement Learning”. During this period, we have experienced congratulations to OG, the suspension of the competition system caused by the epidemic, and changes in the structure and personnel of multiple teams.
After the professional system completed this round of reshuffle, IG won the first Major championship in the post-epidemic era, which triggered heated discussions in the global community. Ana also announced once again that it will come back in OG. To put it in exaggeration, this team has become a lingering nightmare for many teams and players, and the version seems to have a tendency to move closer to the milk pushing system they are good at.
By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
让我们回到2019年 —— 在2019年四月,OpenAI Five 的 Finals 版本在 Dota2-7.21d 以一个干脆的 2:0 战胜了当时的 OG,宣告了强化学习在这个游戏上达成了预期目标:战胜最强的人类队伍。TI9 的结果也再次证实了 OG 是那一年当之无愧的最强。
这个逻辑真的正确吗?OG 的第二次冠军就算有 99% 归功于他们自身,难道就没有 OpenAI 1% 的功劳吗?
Let’s go back to 2019-in April 2019, the Finals version of OpenAI Five defeated the then OG with a clean 2:0 in Dota2-7.21d, declaring that reinforcement learning has achieved the expected goals in this game: Defeat the strongest human team. The results of TI9 also confirmed once again that OG was the most deservedly strongest that year.
Is this logic really correct? Even if 99% of OG’s second championship is due to themselves, wouldn’t it be possible that there is no credit for OpenAI 1%?
事实上,作为现场观赛者之一,我可以明显地看到 OG 对于 OpenAI 的一些模仿,包括思路明确的 BP,准确的接战时机判断,以及对于数据分析师的重视 —— 相反,据我所知目前国内有的队伍依然不重视这一部分的工作,在战术选择上过度依赖选手和教练的经验,甚至出现过在一些情况下对版本和对手的理解不如职业解说的情况。当然也可能是因为条件有限,因为一个好的数据师可能贵过一个非明星一线选手,这是题外话。
我们是不是可以思考一下两个问题:
- OG 从 OpenAI 学到了什么?
- 其他队伍可以从 OpenAI 中学到什么?
In fact, as one of the live spectators, I can clearly see some of OG’s imitations of OpenAI, including clear-thinking BP, accurate judgment of the timing of engagement, and the emphasis on data analysts-on the contrary, according to me It is known that some domestic teams still do not pay attention to this part of the work, relying too much on the experience of players and coaches in tactical selection, and even in some cases the understanding of the version and opponents is not as good as the professional commentary. Of course, it may also be due to limited conditions, because a good statistician may be more expensive than a non-star first-line player, which is a digression.
Can we think about two questions:
- What did OG learn from OpenAI?
- What can other teams learn from OpenAI?
第一个问题刚才已经提到,而且我相信 OG 可能也不愿意,或者说不可以(可能有保密协议存在)透露其中的细节。所以接下来会主要讨论第二个问题:在 2021 年重读这篇文章,能找到哪些依然值得职业队伍学习的内容?
让我们略过那些晦涩且难以解释的模型细节,直接来到和 Dota2 游戏本身更相关的部分:
- OG 的 BP 选择在 OpenAI 看来是否合理?
- BP 先后手在 OpenAI 看来对胜率是否有影响?
- OpenAI 认为应该如何获胜?
BP:文章中给了一个案例,在 OG Pick 的每一步后,OpenAI 都对自己的胜率做了预测,可以看到,OG 的前三手都被认为很糟糕,尤其是第一手撼地神牛,直接导致了预测胜率上升了(>10%, 1.8%, 0.7%),但是 OG 在最后两手的 RIKI 和 SF 则被认为是最优解。文中还特地说明了因为对裂地沟壑的地形评估很难进行,因此第一个大于百分之十的变化值的参考性有待商榷。
The first question has just been mentioned, and I believe that OG may be unwilling or unable to (maybe a non-disclosure agreement exists) to disclose the details. So I will focus on the second question next: What content can be found that is still worth learning by professional teams after re-reading this article in 2021?
Let us skip those obscure and difficult to explain model details, and go directly to the more relevant part of the Dota 2 game itself:
- Is OG’s BP selection reasonable in OpenAI’s view?
- Does BP’s succession have any effect on the winning rate from OpenAI’s point of view?
- How does OpenAI think it should win?
BP: The article gave a case. After each step of OG Pick, OpenAI made predictions on its own winning rate. It can be seen that OG’s first three hands are considered bad, especially the first hand. Bull, which directly leads to an increase in the predicted winning rate (>10%, 1.8%, 0.7%), but OG’s RIKI and SF in the last two hands are considered to be the optimal solution. The article also specifically stated that because the topographic assessment of the cracked gully is difficult to carry out, the reference of the first change value greater than 10% is open to question.
Figure: BP 阶段,OpenAI 对胜率的预测。In the BP stage, OpenAI’s prediction of the winning rate.
先后手:OpenAI 认为面对 OG,在开始 Pick 之前,先后手情况下自己胜率预测为 54%/53%,而且是在很小的池子的前提下。因此我们可以说,在比赛 BP 中,除了某些特别吃 BP 导致没有一丝摇摆空间的战术体系,先后手并不会对比赛结果有显著的影响。
如何获胜:这将是这篇文章的重头戏,因为我会基于这部分内容提出一套我觉得有可行性的临场评估方法。
Succession: OpenAI believes that in the face of OG, before starting the Pick, its own winning rate prediction is 54%/53%, and it is on the premise of a small pool. Therefore, we can say that in the BP of the game, except for certain tactical systems that eat BP so that there is no room for swing, the sequence of moves will not have a significant impact on the result of the game.
How to win: This will be the highlight of this article, because I will propose a set of on-the-spot evaluation methods that I think are feasible based on this part of the content.
激励权重表 Incentive Weight Table
让我们来到核心部分,也就是激励权重表。这是一个人为设置的激励函数,用以让 OpenAI 进行行为优先级与投资回报率的评估。
粗暴一点看,获得10点金钱的收益可以等价于获得30点经验(0.00610 == 0.00230),单单这一个比较就可以用来作为“全地图都亮的情况下去哪打钱总收益更高”的参考条件。多拿几个参数出来,就可以拿到更多量化结果,例如“双方经济等级在某个阶段时,应该抓人还是推塔?”,“应不应该换状态?换完应该走回家,送野还是还是送死?”,“特定对线正反补优先度”等。
当然,单单这张表,不可能真的对职业赛事有太多参考价值,因为它们本就不是完全正确的,而且随着版本变化这些激励权重都应该产生变化。所以,俱乐部的数据分析师是不是可以把以上参数作为初始变量,通过 Dota2 开源接口获取的当前版本每天高分段和职业比赛的数据,对这些参数进行训练,获取符合当前版本的激励权重,并且以此为基础提炼出高阶规律,给教练和队员提供临场战术选择的评估标准?BP 选择的自动化评估系统可以基于类似的逻辑搭建。
Let us come to the core part, which is the incentive weight table. This is an artificially set incentive function for OpenAI to evaluate behavioral priority and return on investment.
Look roughly, the gain of 10 points of money can be equivalent to the gain of 30 points of experience (0.00610 == 0.00230). This comparison alone can be used as “where the whole map is bright, where to pay the total amount of money?” Reference conditions for higher returns. With a few more parameters, you can get more quantitative results, such as “When the economic levels of both sides are at a certain stage, should you arrest or push the tower?”, “Should you change status? After the change, you should walk home. Sent to the wild or to die?”, “Priority of positive and negative complements for specific alignments” and so on.
Of course, this table alone cannot really have much reference value for professional events, because they are not completely correct, and these incentive weights should change as the version changes. Therefore, can the data analyst of the club use the above parameters as initial variables, and obtain the daily high-segment and professional game data of the current version through the Dota2 open source interface, and train these parameters to obtain incentive weights that conform to the current version, and On this basis, extract high-level rules and provide coaches and players with evaluation criteria for on-the-spot tactics selection? The automated evaluation system selected by BP can be built based on similar logic.
事实上,我并不觉得我是第一个想到这么做的人,可能有的队伍(例如 OG)已经做过类似工作了。以上工作并没有太多的技术难度,一个中阶炼丹师加一个工程师就可以搭建出雏形。大胆一点,Valve 开发的大数据胜率预测系统内也可能有一个相似的动态激励机制。
我相信,随着职业体系的发展,决定胜负的因素会有越来越大的权重在场外,也会有越来越多的队伍依靠战术体系而非个人发挥取得成绩。就像近年的传统体育赛事一样 —— 超级明星自带体系,普通运动员让自己变成体系喜欢的形状。
笔者并不是强化学习或电竞从业者,若有差漏请业内人士指正。
最后感谢 OpenAI 团队,为这个游戏带来新的方向。
In fact, I don’t think I am the first person to think of doing this. Some teams (such as OG) may have already done similar work. The above work is not too technically difficult, an intermediate alchemist (ML engineer) and an system engineer can build a prototype. Be bold, there may also be a similar dynamic incentive mechanism in the big data win rate prediction system developed by Valve.
I believe that with the development of the professional system, the factors that determine the outcome will be more and more weighted off the court, and more and more teams will rely on the tactical system instead of individual performance to achieve results. Just like the traditional sports events in recent years-superstars have their own systems, and ordinary athletes make themselves into the shape that the system likes.
The author is not a practitioner of intensive learning or e-sports. If there are any discrepancies, please correct me.
Finally, I would like to thank the OpenAI team for bringing a new direction to this game.
Berner, Christopher, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi et al. “Dota 2 with large scale deep reinforcement learning.”arXiv preprint arXiv:1912.06680(2019).