David Silver leads the reinforcement learning research group at DeepMind and was lead researcher on AlphaGo, AlphaZero and co-lead on AlphaStar, and MuZero and lot of important work in reinforcement learning. ~From Lex Fridman's Website
I was waiting for this interview for a long time and it did not disappoint. I have learned a lot of things from this interview, how David Silver is looking at building a General Intelligence (check out MuZero), how Reinforcement Learning plays a part in it. Gained tremendously from the more than 100 mins long interview! Strongly recommended that you go through it as well.
And if you do not know, David Silver conducted a Reinforcement Learning course found on YouTube. Here are the playlist and course website (for the notes). The recommended textbook for the course is the classic book by Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction". Richard S. Sutton happened to be David Silver's Ph.D. thesis supervisor.
Note 1: Chess Systems
David Silver described the brain as an information process system where it takes in information, processes it and turns it into output. The study of intelligence then is to find out the processes inside.
David Silver's Ph.D. thesis was on building a Go Chess System using Reinforcement Learning. During his Ph.D., the building of Chess Systems is using inference or reasoning methods to build rule-based systems (Classical AI). It uses these methods to reason out which patterns (on chessboard) was useful to predict the win-lose result and chooses moves that lead to a win. There is no deep learning involved. A lot of heuristic search algorithms were used instead.
In the chess of Go, humans were better because intuition helped them a lot in evaluating the current state of the board to determine how much of the territory is captured and also possible/potential territory. In order to beat humans, such intuition needs to be built into the chess system. Or in other words, it is to build learning into the chess systems.
Previous (Classical AI) systems was put in a map of knowledge with a lot of pieces (or subsystems) in them and these are subsystems are assembled to come into play when certain sub-problems on the board comes into play, meaning the chess system has to identify, through established rules, the sub-problem on hand, bringing the necessary subsystem online to solve it. Given the many pieces in the chess systems, it can be extremely brittle. By building learning processes into the chess system, it can learn and verify on its own which subsystem to bring in instead.
Fun Fact: Check out MoGo that was mentioned in the interview. This is the best Go chess system so far, built using Classical AI methods. In short, described in the interview, MoGo will at each position, played randomly (Monte Carlo) to determine the win-rate for black and for white. This information will then be used to evaluate the next move.
Note 2: Reinforcement Learning
David Silver was asked, "Is Reinforcement Learning at the core of Intelligence?". David Silver explained most tasks that reflect/define Intelligence, they can be formalized into an RL problem, giving a path to building Intelligence.
To David Silver, Reinforcement Learning is "a study, science, and problem of intelligence in the form of an agent that interacts with the environment. The agent will take different actions in the environment with the environment giving back the reward signal. Reinforcement Learning is to get the agent to learn how to maximize the reward signal."
So what is Deep Reinforcement Learning? RL is made up of three components with each type of RL having a different emphasis on the components. Components are:
- Value Function: Predicting the value/reward agent is going to get in the future.
- Policy: Does it have a good representation of a policy, determining what actions to take.
- Model: Does it have a good representation of the environment.
Deep reinforcement learning comes in when deep learning models are used to represent these components. David Silver mentioned that he was surprised by Deep Learning in that its learning seem to be without bound. Deep Learning creates a bumpy multi-dimensional space (to understand this I will suggest you have a look at Gradient Descent), intuitively to humans who can only process up to 3-dimensional space, it should be caught easily into a local optimum. In reality, Deep Learning manages to go even lower in the multidimensional space when given enough time and/or additional dimensions. This point is really interesting to me, the author, as based on what we saw so far, training many epochs and iterations, it seems to be "converging". It seems by adding more dimensions, they shake up the space quite well to provide a path to a lower optimum.
Note 3: Self-Play versus Human Data
Alpha Go was initially trained with human data (past games), the reason is not that that is the traditional way but rather, Deepmind was exploring the limits of using Deep Learning in Reinforcement Learning. They wanted to know how far Deep Learning can go, before moving on to build AlphaGo Zero that is trained up solely on self-play. Using RL, the chess system will be able to learn to handle unseen situations, through trial and error (Explore/Exploit) be able to create robust solutions to them.
The next step is how to generalize the learning so much so it can be transferred to another domain and thus AlphaZero was born (here is the paper). Till now the success or level of generalization is seen in MuZero, where the software can master both chess and Atari games.
David Silver mentioned that the current success is still limited in that the general reinforcement learning (built thus far) performs well in environments where the rules are well-defined and perfect information. The next step is to break through these barriers but it is not going to be easy given the physical environment that humans operate in is messy and rules are not well-defined at all.
Having said that, in the interview, it was mentioned that these systems do demonstrate creativity (David Silver defined it as being able to discover something which was not known before or out of the norm). We did see creativity in a few places too. (See Video 1 & Video 2). My suspicion why they display this "creative" behavior is more on the design, which brings to a point I have been sharing for a long while, that AI Ethics is not about building ethics into AI but more on the designer rather, till the circumstances changes.
Note 4: Reward Function
At the end of the day, to build better Artificial Intelligence, all components of Reinforcement Learning should be thought through in greater details, how to design the value function, policy and environment (model) as close to reality as possible, and more importantly the reward function as it drives the behavior of the agent, similar to humans.
There were many nuggets of "gold" in the interview on Reinforcement Learning, understanding how David Silver sees Reinforcement Learning and how it fits into the overall Intelligence that humans are trying to solve, to move closer to Artificial General Intelligence. Again, strongly recommended that readers check out the video! Definitely well spent!
Below are further resources I came across when researching more on the interview.
- AlphaGo Zero explained in One Page
- Exploration in Reinforcement Learning
- Explore, Exploit, and Explode — The Time for Reinforcement Learning is Coming
- Paper on AlphaGo
I hope you find them useful! To be updated with my learning and sharing, consider subscribing to my newsletter below. Each subscription is a vote of confidence in my work. :)