David Silver - Deep Reinforcement Learning from AlphaGo to AlphaStar (Talk back at UAlberta) Part 2

David Silver – Deep Reinforcement Learning from AlphaGo to AlphaStar (Talk back at UAlberta) Part 2

I was waiting to try and find the best possible teacher this is improving the parsing network to play the same thing that alpha zero with search absolutely so give a try and learn to still Republic this search is really few steps we're doing all these kind of looking at you to the basic combine in your network with looking at search and that's amazing action selection policy now we can instill that down into the neural nets get this neural network to predict one step removed when you have to run this planning so we do that we to commute will be encountered during South Bay and as a new partner at the same time which a community member to predict the winner of those came to self plan so again we play is very high quality games between alpha zero not just for policy network for passing them by playing as yourself like in the original ago these were actually using search to improve the policy run our best possible action selection procedure now really think look ahead use your face time with their captions and at the end of that we get a very high quality estimate of who the winner couple years so with a trainer I know how to predict through the a very high quality game so again alpha zero itself becomes a teacher and once we have that have a new party and up you network and that gives us the next iteration of the result in the results taking something away or taking away the human data we've taken away any sport human eyes that was better system like even things like symmetries about all these kind of things and we have that it was still able to not only reach the performance at the previous time ago dramatically and at the end of training material so the making model here is that I'm a consistent more general we take away our human bias is as to what's effective or not take away the reasons that were leading into to perhaps misjudge things and we bought into running things for itself and to understand the system itself better the veteran better until ultimately when we have to plate this online against the top human players in the world 160 entities so we also found that this system was able to discovering new kinds of único knowledge so positions we see that by the time during training substance discover these kind of sequences these like they're even sequences that human players actually using go and they've discovered them you know one of the other are the other families with very very famous well known sequences humans like to read books but eventually is actually discarded in favor of only sequences that if discovered which it considers to be superior and actually certain human professionals of the museum alphago's openings instead of their own so we also applied the zero to the game of chess chess is arguably the most studied domain in the history of a this was studied by all of the seminal figures in AI computer chess Fabio teacher eh an employment they were all fascinated with my computer chest and it was considered the discoverer of artificial intelligence for several decades the chance is interesting to us because it has the most highly specialized systems in the world because of all this dedicated dependent that's being applied to be made of chess so in 1997 deeply was able to defeat the human world champion Garry Kasparov but since thing about maybe some of you don't know is that this amazing progress continue to happen beyond that until the state of the art programs now starfish that indisputably superhuman in a bastard cpu we also considered the game of Japanese chess joking I was actually more complex than chess because it has a larger and larger action space to tackle any casualties there were special movie which is you can place that is back down on the Border's wedding arrival and so this really leads to enormous tactical complexity and only recently would have reached a very versatile ability and for all of these cases but all of these versions of chess that similar games the previous day to the other engines were based on something allow me to search using is very powerful functions that are being optimized by human eggs about human programmers and often using a lot of very general technique also very case specific theory states as well so if you look at the anatomy of a particularly chess engine because previous trips we see the huge knowledge combined with intercepts there were none of these heuristics and the question is how they perform and the answer is they actually outperform starfish so he was able to defeat stop four hours of training myself playing now I should point out that when I say it wasn't so Starcraft is a game in which is a real-time strategy game which you have to gather resources and then use those resources if you don't mind resources the use those resources to kind of build up your technology treated to the particle technology not to use that technologies to try and go to beat your opponent and there's an enormous complexity to this indeed among video games it's considered to be people human ability you ask human video game as seriously sports players now what's the hardest to this in two or twenty years of active human play price pots in the billions are still active today if you don't have one like Malthus and from a research point of view there's been resulted in hundreds of submissions again about this work actually started in Alberta over twelve years of competition starting with these are various different competitions that have been around many years by 2006 many competitions become something of a benchmark and these and high-end artificial intelligent systems that are able to play D canonical video from Miami point to be what's interesting well the complexity is immense I mean we talked about Lamar taxes basic girls all that beginner to finish no publicity grew well in the actual space 26 pastel but this you can figure this is like a giant game of chess with hundreds of units on each side have reach one of the residues you get to to play an action on a crush the different kinds of actions each of those actions can do something to any subset of your your own units so it's like you get to kind of plan your way through Twitter 17 different forms over to some particular location on this huge map and you can begin to play the game is the one cost and at the end of that we just give it one bit of information which is you might be lost and it's also very impoverished information in two particular ways so first of all there's something called war which means that you only see the public units within range of your enemy units so if it's outside if your parents of your currently is it the kind of outside the line of sight close to your enemies you've got no idea they're even there knowing what your parents doing the vast majority is kind of unseen banking and in addition you have to kind of click move your camera around you can only actually land for 12 units that are on screen so hard game theoretical charges in particular it's like this giant game rock-paper-scissors in that every strategy has a counter strategy happy by designing apart by discovering micro players and tuning of the game years okay showing observations from this okay and you see this kind of thing flashing here you see this flashing English division that means that were actually not picking the faction to play we're picking how long to wait until next Howell the observation so we can make it is like my preferences do something and then wake my tents we actually mix like this compound actually has many paths in first of all particular it to starting tomorrow these are different iterations of training and we the banking system is something directing we saw that Stanley wrote over time and we estimated a two-player propensity about two strong people the better players around appearing here so according to our Intel estimates we were doing better because we also saw a monitoring program with respect to what we've seen before this this idea recorded robust to find myself today I play against our memory of previous of ourselves was weakening in the tender so this block shows you essentially the dominant strategies which is computed by only the Nash equilibrium in each iteration so the dominant strategies in which strategies were dominating the other strategies situation of the leader the iterations of the training towers here you see the dominant strategies coming so okay was that we actually went back to this question of could we address the one final way which we weren't playing when humans made which was tragically be able to move the camera around in the way that humans move the camera around and that would be attained like living the right amount as ever as a person is fo be more focused compared to the one we tried before and all that amount of anyway that online and we've had obviously picked adaption to ten minutes and well we've elevated the agent I showed you before there was some criticism from the community that we were we were still thinking a little bit too fast intent of what humans could do in certain situations as well so I just wanted to finish by saying you know we're taught we've talked about games I hope I come into doing some small way research into general we want to know that these techniques can have impacts our beyond beyond games and so I just mentioned one obviously success story perhaps more than deep learning type of report magnified we're still far as many of the ideas I mentioned you think of research female being replaced by kind of simulated leaving Star Search and still really things like these


One thought on “David Silver – Deep Reinforcement Learning from AlphaGo to AlphaStar (Talk back at UAlberta) Part 2”

Leave a Reply

Your email address will not be published. Required fields are marked *