Demo of this kind looks pretty interesting at the graphical level. Under the hood however, these are simple mathematics and some logic. Maybe the kinds of the engines that play chess or Go are complex. But, quite a few of them are fairly simple, logical math.
In this example of reinforcement learning (You may read the paper here). Demo environment is a 5 by 5 square board. There is an Agent at the top left corner and he has to learn
To reach the right bottom corner, H. And,
To reach there by the shortest or near shortest possible path/s.
The Agent’s World (See the live demo here)
Watch all the 15 episodes - It should take about 4-5 minutes.
The agent A can move right, left, up, down and diagonal right/left up/down. One square at a time.
Initially the agent doesn’t know where the H is. In other words he has no knowledge of the destination and he has to learn to get there. What he knows however, is how to make those moves as we mentioned above and capacity to learn in this demo environment.
For the sake of simplicity I have kept the destination fixed. As such we can make the destination dynamic and the same capacity of the agent, - the learning algorithm - works for the learning part. This is based on the q-table the old school algorithm for reinforcement learning where we make the machine learn without labelled data that we keep hearing about in deep learning or neural-nets.
Here in the demo the A starts random. He randomly moves around the demo world(5 X 5 board). In a way we can say his objective is a given, that is hard coded. In other words, he doesn’t learn about the objective. He learns and gets smarter about the path to get there.
Where is the learning here?
Demo is not hard coded to reach A to H after some random moves. It is coded to find out the shortest path to H. As the A moves up down etc he kind of learns. A has track of his moves and the state or the board position he is in. Each move(Not exactly) makes the A more learned about the environment and hence the path to get to his destination.
How does the q-table algorithm work
At the heart of the learning is the Q-table. Where we build the state and the possible actions that he can make from that state. For instance, A can from the start position can make 3 moves, - to the right, to the diagonal-down, to down.
Initially the agent moves randomly and some how he reaches the H thanks to the chance and the brute force computing ability of the computers. Once he reaches the H, Agent is given an incentive, i.e q table is updated. The action that made him reach that terminal position is given a highest q value.
Position of Q-table after some moves
Action codes are as follows
1 = right, 6 = diagonal-down left, 7 = down, 8 = diagonal down right, -1 = left, -6 -diagonal down up right, -7 up, -8 = diagonal up left.
Now the fresh episode starts and the agent again moves randomly with the brute force. This time, he may reach H OR, OR.. the other position which helped him reach to the H in the previous episode. If he reaches the H then, just like the previous episode, the state and the action from which he reached gets the same q value. On the other hand if he reaches the H-1, state, he will now update the q-table for that action with slightly discounted q value.
As you can see now, with each episode as the q table gets updated with q values, after certain number of episodes the q-table, the table of actions from each states - gets fully updated with q-values.
So how did A learn?
Okay, there is another built in capacity of the agent here which I forgot to mention. In fact it is best, I mentioned it here in a way. Agent, choses his action not randomly. Before he moves, he first checks the q value of all the possible actions from the state, or position he is in and then choses the the action with the highest q value. How then he moved randomly initially? Because q table was 0 for all the actions earlier and it appeared random.
