Beware of Machine Learning
There is always people believed there is no limit to science. If everyone believed ‘sky is the limit’ then today’s world of technology wouldn’t exist.
Growth in computing is ever growing. Conventional machine learning methods, tasks like prediction, classification and clustering have become normal with deep neural networks. Though we haven’t achieved human surpassing performance in all domains, success achieved so far cannot be denied.
Global leaders like Elon Musk warning for threat to humans by AI, is majorly because of Reinforcement techniques. These are the methods that are going to change business landscape. Though we haven’t still understood and in infancy as to how to apply it in real problems, the potential is already visible. Some of the existing products of reinforcement learning are, humanoid robots, ability to balance self, ability to navigate, self-driving cars, voice recognition, etc.
The concept of Reinforcement learning was proposed in 1950s, and ever since it has undergone multiple improvements. When AI caught up with Reinforcement learning, it has opened up multiple new opportunities as well as nightmares(If someone uses the tools against humans) in recent past years.
To understand completely the domain, we will see what are the evolutions reinforcement learning has undergone.
First lets us understand problem statement, what is reinforcement learning. These are class of algorithms where there exists a Markov Decision Problem exists. The solution applied will be referred as Markov Decision Process.
What are Markov Decision Problem ? – Wherever there are discrete random environment and an objective is to be met in that environment, it can be defines as a MDP.
Example, Incase of autonomous driving car, the environment in which it has to drive is random and there is objective to transport from point A to point B.
Humanoid robot, Just to make it walk like human irrespective of surface in which it walks, the robot has to maintain its balance and change motor positions present inside.
The different components that will be present in a MDP are States, Actions and Rewards.
The above in context of autonomous car can be,
Sates – All informations about the car and about the external scene information collected by sensors in car.
Actions – The action the car will take next, steer, left, right, accelerate, brake etc.
Reward – The compensation or the benefit score given for every distance covered or time passes by. The reward will be appropriate to the action.
The goal of a Markov Decision Process will be to identify the correct set of actions for every state and maximise the reward. This can be called as optimal policy.
The code implementation of Markov Decision Process will be in two separate program. One will be a simulation of the environment which will give reward based on action and state. The other will be the agent which takes action and has state information with it. The agent plays against the environment and identifies the optimal policy.
Now in deriving a optimal policy there are multiple methods based on the timeline the methods evolved.
Below are some of fundamental methods.
Dynamic Programming – This forms the basis of reinforcement problem solving. In this method the complete knowledge about the environment, all states, possible actions and rewards has to be known.
Monte Carlo Method – In this method only a sample of the entire environment is considered to identify the optimal policy.
Temporal Difference – This method also considers only sample of the entire environment but keeps updating it own price estimates by learning from short sequences.
The fundamental methods final solution is derived based on Bellman Equation.
Q – Learning Algorithm: This method evolved in 1990s and produced better results than other methods. It followed a method that every action is taken based on a function and that refines rewards in multiple iterations. These methods uses Bellman equation also.
Deep Q – Learning : When the environment grows bigger and bigger for every function the optimisation becomes difficult, hence neutral networks are implemented to control the Q function.
There will be series of articles following this explaining individual methods.
This is my Bio,
I’m Baranikumar with 7 years of IT Career in multiple fortune 500 companies. Worked as a Software Developer Tester, Avionics System Engineer, and now an educator of Machine Learning.
A Masters in Data Analytics, toastmaster, franchise business owner of LIVEWIRE Skill training institute (A Division of CADD training services, pioneer in training and skill development for last 30 years).
I teach regularly at Livewire. You can email for any queries (firstname.lastname@example.org)
Alternatively you can reach through below contact form as well.