Ethereum Classic

Machine Learning and Blockchain Part 1

The two most disruptive technologies of this decade are artificial intelligence and blockchain technology. While two came about in different eras and their initial purposes were quite different, it genuinely looks like they were made for each other.

Artificial Intelligence depends on data to make machines smarter while the blockchain technology entire purpose lies in providing honest and non-tamperable data. So, to understand this marriage and its benefits you need to understand what artificial intelligence is. In this guide, we are going to talk about one of the core concepts of artificial intelligence, machine learning.

Since the topic is so vast we will need to distribute it into several guides. In this part we are going to talk about:

  • Different kinds of machine learning
  • Supervised and unsupervised learning
  • Classification vs Regression
  • Training and test set
  • Hypothesis functions
  • Linear regression
  • Basics of linear algebra
  • Gradient descent
  • Normal equation

Hoo boy, we have a lot of topics to go through. So let’s begin straightaway.

What is Machine Learning?

First and foremost, what is machine learning?

Machine learning (ML) is the study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. In layman’s terms, it is the knowledge source that a computer system uses to become more intelligent.

So, what are the different kinds of learning systems out there? Well, they basically fall under 4 distinct categories:

  • Supervised Learning
  • Unsupervised Learning
  • Semi-Supervised Learning
  • Reinforcement Learning

Supervised Learning

In supervised learning, you feed the machine a series of inputs and its corresponding outputs. The idea is to get them accustomed to the kind of outputs they can expect from certain inputs so that they eventually learn for themselves as to what output they should give to future inputs.

[Input/Output Set] —–> [Machine Algorithm] ——> [Working Model]

This is basically classic spoon feeding. You are telling your machine to learn how to identify specific outputs for inputs. Think of how you learned maths when you were a kid. You went through some examples and then solved new problems of your own.

Unsupervised Learning

In unsupervised learning, you simply feed a set of inputs to the machine without giving them the required outputs. The idea is to help your machine recognize and identify different patterns in the input and cluster them according to similar data.

So, if you feed 1, black, 3, red to the machine, then they will cluster them as:

{red, black}, {1,3}

So, the skeletal framework of this machine will look like this:

[Input Set] —–> [Machine Algorithm] ——> [{Cluster 1}, {Cluster 2}….{Cluster N}]

Semi-Supervised Learning

Somewhere between the supervised and unsupervised learning, we have semi-supervised learning. Semi-supervised data is a combination of labeled and unlabeled data. Labeled data are the inputs which have the corresponding outputs, while unlabeled data has inputs with no outputs.

Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy over unsupervised learning (where no data is labeled), but without the time and costs needed for supervised learning (where all data is labeled).

Reinforcement Learning

Finally, we have reinforcement learning.

In this kinda learning, we have an agent who interacts with the environment by committing an action. This action changes the state of the environment and depending on whether the state is good or bad, the agent will either get a reward or a punishment.

We have ourselves faced this kind of learning multiple times in our life. Whenever we did something good, we got rewarded, while every time we did something bad/wrong, we got punished. This taught us which actions were the right ones to take and which ones weren’t.

Supervised vs Unsupervised Learning

Most of our focus will be on supervised and unsupervised learning. So, let’s take an example and understand the difference between the two.

What we are going to do here is feed two different machines supervised and unsupervised data respectively. The common theme being basket of fruits. In our case, we an apple, a banana and a bunch of grapes.

So, will we proceed with both the supervised and unsupervised approach?

Supervised Approach

We are going to use “features” to carefully label our inputs. This will, in essence, train our machine and help them recognize the fruits later.

So, the features that we can use here are:

  • Color
  • Size
  • Shape

Using these features we can help describe the inputs. The idea is to train our machine to discern the name of the fruits (output) when faced with certain features of the fruits (inputs). So, in this case, it will be:

Red Big Round Apple
Yellow Big Cylindrical Banana
Green Small Round Grapes


Upon getting all these data, aka the test data, the machine can create an algorithm, wherein, in the future, when given a fruit, it will look at its:

  • Colour
  • Size
  • Shape

After that, it will be able to tell if the fruit is Apple, Banana, or Grape.

Unsupervised Approach

Now, let’s look at the unsupervised approach.

So, now you just give the fruits to the machine without exactly telling them what the name of the fruit is. So, in this case, there is a bunch of inputs without any labels.

The first thing that the machine will do is to cluster the inputs based on a feature. Let’s start with the size. According to the size, we will have two clusters, {(Big), (Small)}.

With our fruits, the clusters will now look like this:

{(Apple, Banana), (Grapes)}

However, this is not enough, we need to break down the cluster even more to get more specific results. So, now let’s arrange the clusters wrt another feature: Color.

So, we have:

  • Red and Large: Apple
  • Yellow and Large: Banana
  • Green and Small: Grapes.

Giving us the combined cluster: {(Apple), (Banana), (Grapes)}

So, this is how our machine will learn with unsupervised learning. Our machine doesn’t interact with any test data.

Going Deeper Into Supervised Learning

As we have said before, when it comes to supervised learning you will feed some test data to your machine, which will kinda look like this:

INSTANCES X1, X2, X3, X4 …………………….Xn Y
Ia a1, a2, a3, a4 ……………….an Ya
Ib b1, b2, b3, b4 ………………bn Yb
Ic c1, c2, c3, c4 ……………….cn Yc
Id d1, d2, d3, d4 ………………dn Yd


In the table above, we are only dealing with four instances of the training data: Ia, Ib, Ic, and Id. Each of those instances gives their own corresponding outputs: Ya, Yb, Yc, Yd.

Obviously, there can “n” number of instances, of which let’s take an output from the test data, let’s say Yd.

Upon examining the value of Yd, we need to see whether it is a continuous value or a discrete value.

From a mathematical POV, what are continuous and discrete variables?

Continuous Variables

A continuous variable is any variable whose value can keep changing in a given period. Think of your exact current age. Your age is increasing every single second. In fact, this chrome extension right here will help you see how your age is a continuous value.

Think of the graph of a curve like a parabola:

As you travel through the graph, the value of your coordinates (x,y) will be continuously changing.

Discrete Variables

A discrete variable can only take up one specific value in a given time period. Check out this graph:

In the graph above, we are charting specific points in a parabola. As you can see, those specific coordinates on the curve will remain the same no matter how much time passes. This is why, if we were to chart those points, there is no continuous flow in their value. Hence, they are discrete.

So, let’s give you an example. “Between August 15, 2018, to August 17, 2018, how many students were in your university?” That’s a specific number which didn’t change in that time period right?

That’s why this is a discrete value.

Regression vs Classification

Alright, so now let’s look at what continuous and discrete value means in the context of Machine Learning. In the example above, if the value of Yd is discrete, then obviously it means that we are facing a Classification problem, otherwise, if it is continuous then we are facing a Regression problem.

Regression Problem

In regression, our machine predicts the value of the continuous variable Y, based on the input data X. The way it does that is by finding the best relationship that represents the given set of data.

An example of this can be predicting the value of a house in a certain location at a given time.

In fact, let’s look a classic regression problem, selling a used car. In order to predict the price, you will need to look at certain attributes of the used car. In our case, let’s assume that you are only looking at the mileage and taking a guess at the price.

So given the mileage price of several cars you have the data points and you can come up with the function so that given a new car whose mileage is given you can predict the price. In regression, you come up with a function which takes the input instance and the parameters of the model.

In this case, the equation that we have come up us is: y = g(x,θ)

“y” is the output while x and θ are the input for the model algorithm function g().

Classification Problem

If the output is discrete, then it is deemed a classification problem. A model based on classification can help in determining what a future input data can be labeled as. Classification basically means that given a known relationship, the machine will be able to identify the class/feature that data belongs to. All the data in a classification problem is clearly differentiated into classes.

Let’s look at a classic case of the classification problem. Credit scoring.

Ok, so what we have above is a simple x-y graph. On the x-axis we have the income and on the y-axis, we have the savings of the person. The condition that we have right here is as follows. In order to qualify for a loan, the person must:

  • Have income > θ1
  • Have savings > θ2.

This is why, if you look at the graph again:

All the low-risk data is what qualifies for a loan, while the rest don’t.

Do you see the difference between classification and regression here? In the classification problem, the results were clearly differentiated into classes. In the case of our credit scoring classification model, the results were differentiated into two classes:

  • Eligible for loan
  • Ineligible for loan

This was not possible in a regression problem.

Can you guess why? The results of a regression problem are continuous in nature. Since they are not exact, they can’t be classified. Simple as that.

Features, Training Set, and Test Set

So we have talked quite a lot about features so far. Features are what helps the machine understand the input value. They are quantifiable properties that can be actually measured. Features fall under 4 categories:

  • Categorical: Blood type is an example of a categorical feature. A blood type can be A, B, AB, or O
  • Ordinal: Large, medium, small.
  • Integer-valued: Simple counting properties. Eg. the number of words in a given page
  • Real-valued: Eg. height

Now, onto training set and test set.

A lot of data is needed to be fed into the machine for them to create their model. To create the most efficient model possible, the data set is divided into two sets:

  • Training set
  • Test data set

The training set has the inputs and their corresponding outputs aka labels. The training set is there solely for the training of your model so that they can discern the relationship between X and Y.

The test set, on the other hand, doesn’t feed any labels to the model. It just feeds the input so that the machine can calculate what the output can be. The output of the machine is then checked to see if they are correct or not. If they are wrong by a large margin, then the model is further worked upon via re-training.

What is a Hypothesis?

One of the fundamental rules of machine learning is that in order to get the output y from the input x, there must exist a relationship between the two, which can be denoted by a function f() such that y = f(x).

Now, obviously, the machine won’t stumble onto the correct optimal function, right out of the bat right? So, what machine learning algorithms do is that they guess or hypothesize what the correct function could be.

So that is pretty much what a hypothesis is, it is a certain function that may hopefully be similar to the target function that we want to model. In the context of email spam classification, it would be the rule we came up with that allows us to separate spam from non-spam emails.

In machine learning, the set of all hypothesis that we go through to pinpoint on our final function is called a Hypothesis set. Our main goal is to go through all these different hypotheses, run various tests and then finally decide on a hypothesis which can be our final function.

As you can imagine, the different machine learning models have different hypothesis sets, For example, the 2d- perceptron has the hypothesis set: 𝐻(𝐱)={𝑠𝑖𝑔𝑛(𝑤1∗𝑥1+𝑤2∗𝑥2+𝑤0)∀𝑤0,𝑤1,𝑤2}

So, how exactly do the training set, and hypothesis set the link up with each other?

Well…it looks something like this:

Feature and Hypothesis Space

So, what have we learned so far?

The models use the features of the input to learn what the output can be. Based on that, they create a hypothesis to establish the relationship between the input and the output. So, the features are basically the building block of the hypothesis. Obviously, any instance can have n-number of features, but let’s assume that for now, we are feeding our model with an instance which has just two features, x1 and x2.

For an instance, with n-features we are defining an n-dimensional feature-space, so for 2 features, we define a 2-dimensional feature-space.

The graph above shows the feature space. The red spot that you see is a point or an instance in the feature space where x1 = 2 and x2= 5.

So, we are feeding instances of data into a model, each of those instances is a point in the feature space for that particular model.

Think about it like this. Going back to our fruit example. If we have only two features “shape” and “color”. The data of our features are as follows:

  • Shape: Big, Small, Medium
  • Colour: Red, Yellow, Orange

Obviously whatever hypothesis we make needs to around this data right? That’s what we mean when we say that a hypothesis is created in a feature space aka all the data defined by a feature.

So, how do we leverage the feature space to come up with our hypothesis? Let’s take a simple classification problem wherein there are two classes of instances:

  • Positive
  • Negative

So, this is what the distribution of instances looks like in our feature space. We have two features which have two class of outputs: positive and negative.

Now, we need to decide on a hypothesis which will help us decide which class future instances will fall into. One way of solving this problem is to simply run a line through the feature space.

Check out the diagram below.

So, as you can see, the graph is heavy on the “+” class on the left side of the line. So all the data that lies of that side will be labeled as “+”. Similarly, because there is a higher density of “-” class data on the right side, we label all the instances on the right side as “-”.

Alright, now if you check the instances wrt to the hypothesis, you will see lots of “?”. Those are instances whose class our model will have to predict during the testing phase. According to our hypothesis, we can make a probable guess that the instances on the left of the line will be “+” while the instances on the right will be “-”.

One thing to keep in mind, our model not deducing these results with a 100% possibility. What they doing is inducing them. Inductive reasoning is a method of reasoning in which the premises are viewed as supplying some evidence for the truth of the conclusion. While the conclusion of a deductive argument is certain, the truth of the conclusion of an inductive argument may be probable, based upon the evidence given.

Now, you are probably wondering, why did we just draw that one specific line and deemed it as our hypothesis function right? We could have probably taken any other line through feature space and made that our hypothesis right?

You are absolutely correct.

As you can see, there are multiple hypotheses possible, which is in line with what we told you earlier. The goal is to go through multiple possible hypotheses and end up with one which gives us the least probability of failure.

So, in other words, the hypothesis space is the space of all legal hypothesis, that you can describe using the features that you have chosen, and the language that you have chosen.

Wait what…did we just say language?

What does “language” mean? Well..let’s check it out.

Types/Language of Hypotheses

We have actually talked about one of the types of hypotheses in the last section. Remember that line we drew through the feature space? That’s a linear function which helped us in differentiating between the two classes.

It is important to know what kind of hypothesis you are going to use because along with the features, it will help in describing your hypothesis space. So, what are the different kinds of functions out there?

Let’s take a look.

#1 Decision Tree

A decision tree is well..a tree. Every node in this tree is a decision that we take based on the value of the attribute. Based on that decision, we go to different branches. The leaf node, or the last nodes in the tree (in the example above it will leaf nodes are Bird, Mammal, Fish) are the labels that we will give to our input values.

If y = f(x) where x is the input, then y are the labels.

#2 Linear Predictor Function

These are the functions that you are already familiar with.

The linear function is the straight line that cuts through the feature space, helping us differentiate between different classes of instances.

#3 Multivariate Linear Function

A multivariate linear function is a linear function which has more than two variables. As such, the hypothesis space exists in >2-dimensions.

#4 Single Layer Perceptron(SLP)

An SLP is the simplest form of the artificial neural network out there. It is a feed-forward network which is based on a threshold transfer function.

#5 Multi-Layer Neural Network

Multilayer networks solve the classification problem for nonlinear sets by employing hidden layers, whose neurons are not directly connected to the output. They give a way of defining a complex, non-linear form of hypotheses.

What is Bias?

Alright, so as you can imagine by looking at the hypothesis space above, there can be a large huge number of hypothesis functions possible. If our model was to go through all these functions, then it will become extremely really time-consuming and inefficient.

How do you zero-in on a smaller set of hypothesis function from the larger hypothesis space? You need to specify a bias. There are two kinds of bias that you can specify:

  • Constraint bias: You are not considering certain hypotheses based a qualification
  • Preference bias: You have a preference for a particular kind of hypothesis.

So, if you went to a restaurant and told them not to give any desserts because you are staying off sugar, then that will be a constraint bias. However, if you said that you wanted a sirloin steak over desserts because you like steaks, then that is a preference bias.

One of the most famous examples of bias is the Occam’s Razor, which says that the most obvious solution is the most complex one.

Working With Unsupervised Data

Till now, the examples that we have seen above deals with supervised data where everything already has been labeled. However, things change when we deal with unsupervised data.

Let’s revisit what supervised data looks like:

As you can see, in our example above, all the data was classified into labels, in this case, two labels “+” and “-”.

However, what if we are suddenly dealing with unlabelled data?

Like this.

Now as you can see, there is no way to differentiate the data according to labels. So you either:

  • Have the data with no labels
  • All data belonging to the same label

So, what you program your model to do, is to cluster data that it thinks will be similar…something like this:

One of the most interesting applications of unsupervised learning lies in news curation. Go to Google News.

Check this out:

You, see, Google gets a bucketload of news every single day. Their model uses an unsupervised learning machine algorithm, which takes in all this barrage of news and helps cluster them according to relevant topics. In our example above, Google got barraged with news related to ISIS’s planned suicide attack in India and they grouped them accordingly in one cluster.

Another interesting use-case is in genetics. The model takes in the data of numerous people and their gene code and clusters them accordingly. So if you look below, it has people with similar gene patterns clustered into one.

Along with clusters, unsupervised learning helps in structuring data. To understand this, let’s understand what the Cocktail Party Problem is.

What is The Cocktail Party Problem?

We are pretty sure you have been to some form of a cocktail party or any sort of gathering. The idea is that when so many people in a party talk at the same time, their voices become, sort of, muddled into one messy noise. So, how about we create a working model, which listens to this noise and structures them into outputs that makes sense?

Consider the following.

So, we have a hypothetical cocktail party here with at just two speakers, Speaker #1 we will call A and Speaker #2 we will call B.

Both of them are speaking at the same time, and we have placed two microphones in varying distances from this group. Microphone 1 is closer to A than Microphone 2 and Microphone 2 is closer to B than Microphone 1. So, what are the noises being caught by the two microphones:

  • Microphone #1: A+b
  • Microphone #2: a+B

In the “equation” above:

  • A is the dominant voice of Speaker #1 aka A
  • B is the dominant voice of Speaker #2 aka B
  • a is the distant voice of Speaker #1 aka A
  • b is the distant voice of Speaker #1 aka B

Now we will feed the data from these microphones into our model, and our model will structure this noise into A and B. The model helps in isolating each and every person’s voice who came into this party.

Let’s take a look into some more examples of unsupervised learning machine algorithms:

  • Classify your emails into spam and not spam
  • Given a dataset of patients with cancer or no cancer, learn to group new patients in the correct group.

What is Linear Regression?

Linear regression is the first proper machine learning algorithm that we are going to take a look at. Linear regression can be used to solve classic regression problems which require us to predict discrete values.

We have touched on linear regression before, but we will take a deeper look into it in this section.

Ok, so what you see above is the data that we have input into our model. The data that we are working with deals with the housing prices in Oregon. So, this data has two specific features:

  • Price
  • Size

Using these two features, we have already plotted several of those points in our graph. The idea is to strike a line through all the data such that it will be easy for our model to accurately predict the price of a new house based on the size that we are aiming for.

Quite like this.

In order for our model to do this, it needs to be fed a lot of training data. All the points (or the crosses in this case) that we have plotted in our graph prior to drawing the line, are training data.

So, some of the examples of training data that we have used are:

2104 460
1416 232
1534 315





30 more elements

Let’s look at some of the notations for you:

  • Number of training examples: m
  • Input variables: x
  • Output/target variable: y

In our example above:

  • m = 30
  • Size in feet square is “x”
  • Price in 1000 dollars is “y”

So a training example is denoted like (x,y). While the ith element of the set is denoted as(x(i), y(i)).

In the table above:

  • x(1) = 2104
  • y(1) = 460

If you remember what we told you about the hypothesis function above, feeding this training data into our model will help in finding our hypothesis set. In our graph above, the following four lines can belong to our hypothesis set:

Our goal is to find the one hypothesis function in this set, which can help our model to make the most accurate predictions.

Before we go about finding this, let’s see what equation we are going to use to represent our linear regression.

Linear Regression Representation

Our linear regression with two features is nothing more than a 2-dimensional straight line equation which is represented by y = m*x + c where:

  • m is the slope of the line
  • c is the y-intercept, or where the line is intersecting the y-axis

When it comes to machine language terminology, we will change m and c to represent the line like this: h(x) = θ1*x + θ0. Here we are hoping that our hypothesis output or h(x) is as close to y as possible.

Anyway, now let’s take a look at what our graph looks like. So, if θ1 = 3 and θ0 = 2, then our graph will look like this:

If θ1 = 4 and θ0 = 0, then:

And, one final example. If θ1 = 0 and θ0 = 3, then the graph is simply a straight line running across 3 in the y coordinate:

Alright, so that’s about it. Now, let’s go back to our hypothesis set:

Our main job is to find the line which best helps our model in making accurate predictions. In other words, we need to find the values of θ1 and θ0 which will help us make the most optimal choice.

As we have told you before, we need to use a bias here to eliminate all other linear functions, and the bias we are going to use here is the cost function.

What is the Cost Function?

To understand how the cost function will look like, let’s do some brainstorming. The idea of our cost function is to find the value of h(x) which is close to y as possible, so we need h(x) – y to be as low as possible.

Instead of looking for the least value of h(x) – y, we are going to find the least value of  (h(x) – y)^2. The reason why we are doing this is going to be clear in a second. Now, you will need to find the value of h(x) for all the value of x’s from the training set. So, if x(i) is the current instance of the training set then the equation is:

∑(h(x(i))−y(i))^2) where the summation will go from i = 1 to m.

So, now do you see why we took squares of the differences? If suppose we had 5 instances in our training sets and the value of the differences was coming to -2, -1, 0, 1, and 2. Then summing all that will give you 0, which won’t help us in our calculations.

Anyway, we don’t just want the summation of all the values, we need an average so we will divide our value by m, giving us: J(θ0,θ1)= (∑(h(x(i))−y(i))^2)/m

However, many mathematicians insist on dividing with 2m instead of m, because if we wanted to differentiate ∑(h(x(i))−y(i))^2, it will give us an unneeded factor of 2, which we can negate.

So, the final cost function is: J(θ0,θ1)= (∑(h(x(i))−y(i))^2)/2m

Simplifying the Equation

The important equations that we have so far are:

  • h(x) = θ1*x + θ0
  • J(θ0,θ1)= (∑(h(x(i))−y(i))^2)/2m

Let’s simplify this equation a little bit. We will first assign the 0 to θ0. This turns our first equation into: h(x) = θ1*x. This makes our graph something like this:

This, in turn, changes our second equation to J(θ1)= (∑(θ1*x(i)−y(i))^2)/2m

Now our job becomes much more uncomplicated. Now what we need to do, is use the value of  θ1*x(i) to find the correct value of J(θ1).

Before we do that, let’s assume that the training set that our model has is as follows:

x y
1 1
2 2
3 3

So, nothing fancy. We have three instances in our training set, making m = 3.

Calculating the Cost Function

So, to do this, let’s take an example where θ1 = 1, giving us the following equation: h(x) = x. This makes our equation look like this:

Our three points in the training set are falling pretty straightforwardly on the line aka hypothesis function.

Now, let’s calculate our cost function for θ1 = 1.

J(1)= (∑(x(i)−y(i))^2)/2m

So, the value of m is 3 and i will go from 1-3 giving us:

J(1) = ((x(1)-y(1))^2 + (x(2)-y(2))^2 + (x(3)-y(3))^2)/(2*3)

Referring to our table we know that the value of x(i), y(i) are: (1,1), (2,2), (3,3).

So, J(1) = ((1-1)^2 + (2-2)^2 + (3-3)^2)/6 = (0+0+0)/6 = 0.

Meaning, for θ1 = 1, the value of J(1) = 0.

Now, let’s work with another hypothesis function for the same training set, where θ1 = 0.5.

h(x) = 0.5x

Our graph will now look like this:

Unlike last time, the value of the training set isn’t falling on our hypothesis. So, let’s calculate the cost function:

J(0.5)= (∑(0.5*x(i)−y(i))^2)/6

Substituting the values, you will get:

J(0.5) = [(0.5-1)^2 + (1-2)^2 + (1.5-3)^2]/6 = ~0.58

So, for θ1 = 0.5, the value of J(0.5) is 0.58.

Now, we will take the value of θ1 = 0, which will give us: h(x) = 0.

So, it basically is the x-coordinate.

J(0) = [(0-1)^2 + (0-2)^2 + (0-3)^2)]/6 = ~2.3.

Similarly, if you plot for more value of J(θ1), and trace the graph, then you will see something like this:

The classic parabola graph. Alright, so when we are done plotting this graph, we need to find the value J(θ1) which is the most minimized. One look at the graph above and you can tell that when θ1=1, the value of J(θ1) is minimal.

So, for this case, our model will have θ1 = 1.

Cost Function With Both Parameters

Ok, so in the previous case, we assumed θ0 to be 0. But in real life situations, we won’t always be that lucky. A more real-life hypothesis function will look like this:

Ok, so not much has changed apart from the linear graph having a y-intercept. However, when we calculate the cost function with both the parameters: J(θ0,θ1)= (∑(h(x(i))−y(i))^2)/2m, it completely changes the second graph:

Our ideal value of cost function will be somewhere in the middle of that demarcated area. Obviously, calculating from a 3D graph is going to be a real pain, so what we are going to do, is to use contour plots. Contour plots are an ingenious way of condensing 3d graphs into 2d graphs.

So, this is what you do to make a contour plot. You take periodic cross-sections from the 3d graph. Each cross section has the same value of J(θ0,θ1). Ok so to summarize, this is what we have so far:

  • A 3d graph with the following parameters: J(θ0,θ1), θ0, and θ1
  • Multiple cross-sections are taken of the entire graph. Each cross-section has different values of θ0 and θ1 with but has the same value of J(θ0,θ1).

So, what we have right now is different cross-sections. Now, these different cross-sections are sort of smushed together to create a contour which sorta looks like this:

Now, let’s look at the contour plot in detail. Each line in the contour shows one particular cross-section. So, all the points in that particular line will have the same value of cost functions. See the three points on the line below?

They will have the same value of cost function.

So, see the red point on the contour above, if we want to get the values of θ0 and θ1, then we simply need to trace the point to its x and y coordinates.

Quite like this.

So, according to our contour graph, the coordinates of that point are (-0.11, 850). Now that we have the value of θ0 and θ1. Using this, we can plot the linear graph to find the hypothesis function like this: h(x)= θ1x + θ0

Substituting values: h(x) = -0.11x + 850, which gives us this graph:

So, by going through different values in the contour plot we can find the hypothesis function which satisfies most of the values in our training set. Imagine that after going through all the different points in the contour plot, you finally come across a hypothesis which strikes through the majority of the points in our training set, which looks like this:

If you are happy with the result, then you can use the values of θ1 and θ0 to get the proper value of our hypothesis function.

What is Gradient Descent?

If you remember what we taught you before, the reason why we use cost function J(θ0,θ1) is to find the most minimized value possible.

So, let’s see how we can find the minimized value. The simple steps that we are going to use are:

  • Starting with some random (θ0,θ1). We usually start with (0,0).
  • Keep changing the values of (θ0,θ1) until we get a minimized J(θ0,θ1)

Suppose the graph that we have to work with is this:

Like we said before, we are going to initialize the value of (θ0,θ1) as (0,0). See, that cross mark in the graph below? That’s the starting point.

Right, so we have marked the starting point. Now let’s do a simple thought experiment. Imagine that the graph that you see above is a landscape with 3 hills in it. The black cross spot is you, you are standing on top of that hill and your main aim is to find the most optimal way of going down. So, this is how you are going to go about it:

  • You will take a look around to see your best possible path
  • Then you will take a baby step towards that path
  • Repeat the first two steps until you reach your destination.

So, in the example above, the most optimal path that you have taken is as follows:

So, for (θ0,θ1) = (0,0) this is the most ideal descent. However, what if we change the value of the parameters and start somewhere else….say here:

So, if we chart the most optimal path for this, it looks like this:

So, as you can see, as we keep changing the value of our parameters, the starting point on our graph changes accordingly, which changes our optimal path.

Ok, so this is what it looks like image-wise. Now let’s look into the mathematics of it.

Repeat until convergence {

θj:= θj − α ∂/∂θj J(θ0, θ1). (for j=0 and j=1)


Ok, so that’s the mathematical equation of the gradient descent.

Now, we know that it looks pretty scary so let’s just go through what is happening here.

The “:=” operator

The “:=” operator is an assignment operator as opposed to a “=” which is a truth declaration operator. To understand what the difference between the two is, just think about your C++ programming.

When you said a = b, you are overwriting the value of a and assigning the value of b to it.

On the other hand, when you say a==b, you are comparing the values of the two and seeing if they are equal or not.

Same thing with “:=” and “=” in this context.

The α Operator

Now we look into the α operator. The α operator is the learning rate. Let’s understand what that means in the context of trekking. If the value of α is high, then that means that we are doing a very aggressive descent by taking large steps. On the other hand, if α is small, then we are taking very small cautious steps.

α’s value needs to be carefully calculated. If it is too low, then your model will be very slow. If it is too big, then your model will fail to converge and may even diverge.

The Derivative

The derivative is the ∂/∂θj J(θ0, θ1).

This basically the rate of change in the cost function. As you make your descent down the graph, The value of your cost function keeps changing. So the derivative calculates the rate at which this change is happening.

Alright, so now we have all the actors in the play. So let’s take a look at the equation again:

θj:= θj − α ∂/∂θj J(θ0, θ1).

So, the new value of θj is the previous value subtracted by the product of the learning rate and the rate of change. To put in the context of trekking, our new location is our original location minus the product of how large our step is and the rate of change in our descent.

However, there is one more thing that you need to take note of here. We are moving in a 3-dimensional space which includes not just one value of θj but two, θ0 and θ1. So, as we move down, the value of both θ0 and θ1 needs to change simultaneously.

This is called a simultaneous update which looks something like this:

temp0:= θ0 − α ∂/∂θ0 J(θ0, θ1)

temp1:= θ1 − α ∂/∂θ1 J(θ0, θ1)

θ0:= temp0

θ1:= temp1

That is the correct way method of simultaneous upgradation. The incorrect method will be something like this:

temp0:= θ0 − α ∂/∂θ0 J(θ0, θ1)

θ0:= temp0

temp1:= θ1 − α ∂/∂θ1 J(θ0, θ1)

θ1:= temp1

Can you guess why?

In the first case, we storing both the value for θ0 and θ1 in a temporary variable and updating their values simultaneously.

However, in the second case, we are calculating the second temporary variable with the new value θ0. This is not really simultaneous upgradation now, is it? We are updating one variable after another.

The Relationship between α and θ1

So, what exactly is a derivative? How do you show it in the graph?

The derivative of that point is the slope of the tangent running through it. Now conversely, what happens when the tangent is running through the local minima aka the local optima.

Since the slope is 0 if the value of the derivative is 0. In other words, if the value of θ1 is in the local minima, the value of ∂/∂θ1 becomes zero. So, when we do:

θ1 := θ1 − α ∂/∂θ1 J(θ0, θ1), the value of θ1 remains unchanged.

There is one interesting thing to know here especially when it comes to the relationship between θ and α.

As you keep going down the slope and the value of the derivative approaches 0, the value of α will keep on decreasing. Visually it will look something like this:

See how the size of the arrow, aka α, keeps decreasing.

Bringing Together Gradient Descent and Linear Regression

Ok, so now we have the following formulas.

Gradient Descent algorithm:

Repeat until convergence {

θj:= θj − α ∂/∂θj J(θ0, θ1). (for j=0 and j=1)


Linear Regression and cost function:

  • h(x) = θ1*x + θ0
  • J(θ0,θ1)= (∑(h(x(i))−y(i))^2)/2m

The idea is to apply the gradient descent to minimize our cost function.

The key to all this lies in figuring out the value of our derivative: ∂/∂θj J(θ0, θ1)

Substituting the value of J(θ0, θ1) from the cost function formula, we get:


Ok, now let’s take our linear regression formula and substitute h(x(i)) with θ1*x(i) + θ0.

∂/∂θj[(∑(θ1*x(i) + θ0−y(i))^2)/2m]

We only need to figure out the derivative for two values: j=0 and j=1 or, θ0 and θ1. The value of the derivatives are as follows:

  • ∂/∂θ0J(θ0, θ1) = [∑(h(x(i)) – y(i))]/m where summation goes from 0 to m
  • ∂/∂θ1J(θ0, θ1) = [∑(h(x(i)) – y(i))]*x(i)/m where summation goes from 0 to m

So, now that we know the values of the derivatives, let’s substitute those values in the gradient descent formula for the simultaneous update.

θ0:= θ0 − α[∑(h(x(i)) – y(i))]/m.

θ1:= θ1 − α[∑(h(x(i)) – y(i))]*x(i)/m

Convex Functions

The cost function for two parameters will always be a convex function, which looks like this:

The speciality of this function is that it doesn’t have any local optima, except for one global optima.

Do you know what is the convenience of that?

Instead of having multiple points where the gradient descent may converge on, there is always one point of convergence.

So, how does gradient descent help in minimizing the function? Firstly, we will convert our convex graph into a contour one.

We have selected a point on the graph, which has the coordinates (900,-0.1).

The hypothesis function that corresponds to this is h(x) = -0.1x + 900, which looks like this:

Ok, so from this point, we make our journey towards the convergence.

As you can see, we have made our way to the convergence.

Each and every point in that descent is used as data for the training set which creates their own hypothesis function.

The blue line is the hypothesis function for the convergence coordinates. That is our minimized cost function and hence the final hypothesis.

Whatever we have learned so far is the basic of machine learning. Your understanding of linear regression and gradient descent needs to be clear before we go any further.

Anyway, before we get any further, let’s brush up some mathematics, in particular, linear algebra.

Brushing Up on Linear Algebra

So, let’s start up with matrices and vectors? Do you know what the matrix is?

Ok, no. Not that.

The matrix, in the context of mathematics, is a rectangular array of numbers.

The image above is a matrix. Let’s call this matrix A. This matrix has 4 rows and 4 columns, so it is a 4X4 matrix.

We refer to individual elements in the matrix as Aij where “i” is its row number and “j” is the column number.

So, A11 is 9, A12 is 13, A44 is 10, A34 is 4.

In this article, we are going to denote matrices like this: [(row 1), (row 2), (row 3)…(row n)].

So, the matrix above will be denoted like this: [(9,13,5,2), (1, 11, 7, 6), (3, 7, 4, 1), (6, 0 ,7, 10)].

Ok, so what is a vector?

A vector is an N X 1 matrix i.e. a matrix which N columns and 1 row.

The image above is an example of a vector. We are going to represent the vector like this [1, 2, 3, 4].

Quick thing to note here, the way we have denoted the vector above is 4X1 notation. If we wanted to denote 1X4 notation then it would look like this: [(1,2,3,4)]. The parenthesis denotes a row.

Since this vector is 4X1 vector, we are going to call it a 4-dimensional vector. A 5X1 vector would be a 5-dimensional vector.

Mathematical Operations with Matrices

#1 Addition

If we were to add two matrices, A and B, the first thing that we need to make sure is that they are of the same dimensions. So, if A is a 2X2 matrix, then B must also be a 2X2 matrix.

Suppose the values in A and B are as follows:

A= [(2,3), (4,5)]

B = [(4,3), (2,1)]

Now, let’s say that these two matrices add up to form a new matrix C which is again a 2X2 matrix. The elements of this added matrix will be as follows:

Cij = Aij + Bij

Meaning: C11 = A11 + B11.

Hence, the value of C is [(2+4, 3+3), (4+2, 3+1)] = [(6,6), (6.6)].

#2 Subtraction

Subtraction happens pretty much the same way that addition does. It is the same process but instead of adding, we are subtracting.

With the same matrices as above, i.e. A and B, we can create a new matrix C such that:

Cij = Aij – Bij

Meaning: C11 = A11 – B11.

Hence, the value of C is [(2-4, 3-3), (4-2, 3-1)] = [(-2, 0), (2, 2)].

#3 Scalar Multiplication

A scalar is a singular, normal variable which doesn’t belong to a matrix. So “2” is a scalar number. Now, what happens when we multiply a matrix with a scalar.

Suppose we multiply a scalar k with the matrix A.

Then the new matrix B will be kA, where Bij = k*Aij.

So, 3 * [(1,2), (3,4)] = [(3,6), (9,12)].

#4 Scalar Division

Quite like multiplication, if we are dividing matrix A with scalar k, then the new matrix B will be A/k, where Bij = Aij/k.

So, [(3,0), (9,12)]/3 = [(1,0), (3,4)]

#5 Matrix Multiplication

Two matrices, A and B, are going to multiply with each other if and only if the number of columns in A matches the rows in matrix B.

So, suppose A is a 3X4 matrix and B is a 4X2 matrix. The multiplication will be possible because the number of columns in matrix A matches the number of rows in matrix B.

The resulting matrix C, will have the number of rows as matrix A and the number of columns in matrix B. So C will be a 3X2 matrix.

In order to get the new matrix, you will need to do a dot product of the rows and columns of matrices A and B. The “Dot Product” is where we multiply matching members, then sum them up.

To understand what that means, check this out:

(1, 2, 3) • (7, 9, 11) = 1×7 + 2×9 + 3×11 = 58

So, we match the 1st members (1 and 7), multiply them, likewise for the 2nd members (2 and 9) and the 3rd members (3 and 11), and finally sum them up.

Anyway, going back to our previous matrix:

[(1,2,3), (4,5,6)] * [(7,8), (9,10), (11,12)] = [(58, 64), (139, 154)]

This matrix multiplication actually comes in pretty handy when we want to make some predictions based on our hypothesis function and features.

Suppose we have a feature named “House size” which has four training instances: 2104, 1416, 1534, 852.

Along with this, we have a hypothesis function which looks like this: h(x) = -40 + 0.25x

We are going to change the training instances into a 4X2 matrix by adding one on each row to look something like this:

[(1,2104), (1,1416), (1,1534), (1,852)].

We will multiply this to our hypothesis vector which is a 2X1 matrix which looks like this: [-40,0.25].

So, when you multiply the Training Instance with the parameters, the product matrix is the prediction set.

[(1,2104), (1,1416), (1,1534), (1,852)] * [-40,0.25] = [ 486, 314, 343.5, 173]

So, if we were to make an algorithm out of this:

For i= 1 -> n


prediction(i) = θ0 + (θ1 * instance(i))


This algorithm is extremely helpful when you have to go through thousands of data instead of just 4.

Identity, Inverse, and Transpose Matrix

There are 3 kinds of matrices that you should know about. Let’s start with identity.

Identity Matrix

The identity matrix acts as the “1”.

Let us explain.

When you multiply a number with 1, nothing changes right?

Eg. 4*1 = 4.

Similarly, if I is an identity matrix and A is a normal matrix, then A*I = A or I*A = A.

An identity matrix has 1’s in the diagonal positions of the matrix while the rest of the elements are 0.

So, in a 2X2 identity matrix looks like this= [(1,0), (0,1)].

3X3 identity matrix looks like this = [(1,0,0), (0,1,0), (0,0,1)].

Inverse Matrix

If A is a matrix then its inverse will be A^-1.

A * A^-1 = I(identity)

The inverse matrix is basically a reciprocal of the original matrix.

Eg. if A is [(3,4), (2,16)] then A^-1 will be [(0.4, -0.1), (-0.05, 0.075)]. When you multiply both of them, the result is [(1,0), (0,1)].

A matrix which doesn’t have an inverse is called a singular/degenerative matrix. [(0,0), (0,0)] is an example of a singular matrix.

Transpose Matrix

A matrix B is said to be a transpose of matrix A if the row elements of A are equal to the column elements of B.

So, if A = [(1,2,3), (4,5,6)] then B will be [(1,4), (2,5), (3,6)].

The transpose of matrix A is denoted as A(t),

Linear Regression v2.0. Working with Multiple Variables

So, when we were working with linear regression before, we were only dealing with one feature and the output. However, when it comes to real life practical use-cases, we have to deal with multiple variables.

Going back to our house buying linear regression algorithm? This is what we were working with:

2104 460
1416 232
1534 315





30 more elements

However, it looks more like this:

2104 6 2 3 460
1416 2 1 2 232
1534 4 2 4 315





30 more elements

The notations are as follows:

  • m = The total number of elements, which in this case is 33
  • n = The number of features, which here is 4
  • x(i) = The input features of the ith instance.
  • x(i,j) = The jth feature of the ith instance

So, in the table above:

x(1) = [1416, 2, 1, 2]

x(1,2) = 2 aka the second element of the feature set of x(1).

The Hypothesis Function

The hypothesis function with just one feature looked like this:

h(x)= θ0 + θ1x

However, with multiple features like this, the hypothesis function will look like this:

h(x)= θ0 + θ1×1 + θ2×2 + θ3×3 + θ4×4

In order to simplify things, we can actually take this one step further.

We have two matrices.

  • The feature matrix = [x0, x1, x2, x3, x4……xn]
  • The parameter(θ) matrix = [θ0, θ1, θ2, θ3, θ4….θn]

The assumption that we are making here is that x0 = 1.

Now, keeping that assumption in mind, the hypothesis function will look like this:

h(x)= θ0x0 + θ1×1 + θ2×2 + θ3×3 + θ4×4 + ……. θnxn


h(x) = [(θ0, θ1, θ2, θ3, θ4….θn)] * [x0, x1, x2, x3, x4……xn]

If you think about it, h(x) is basically a product of the θ(t) matrix (transpose) and the feature matrix.

So, we can conclude that: h(x) = θ(t)x

This is called the multivariate hypothesis function.

Gradient Descent For Multiple Variables

So we have seen the hypothesis function with multiple parameters, the cost function for this is as follows:

J(θ0,θ1,θ2,θ3…..θn)= (∑(h(x(i))−y(i))^2)/2m

We can condense θ0,θ1,θ2,θ3…..θn as just “θ” an n+1 dimensional vector(since we are starting from 0 and going to n).

So the cost function can be condensed into: J(θ)= (∑(h(x(i))−y(i))^2)/2m

Now the gradient descent equation looks like this:

Repeat until convergence {

θj:= θj − α ∂/∂θj J(θ0,θ1,θ2,θ3…..θn) (simultaneously do for  j=0, j=1, j=2……j=n)


Once again, J(θ0,θ1,θ2,θ3…..θn) gets condensed to J(θ) where, once again, “θ” is an n+1 dimensional vector.

Previously when we were dealing with two parameters θ0 and θ1, we were only doing two simultaneous updates which looked like this:

temp0:= θ0 − α  [∑(h(x(i)) – y(i))]/m

temp1:= θ1 − α  [∑(h(x(i)) – y(i))]*x(i)/m

θ0:= temp0

θ1:= temp1

However, we are dealing with multiple parameters here. So, the simultaneous update doesn’t happen for just 2 parameters, it happens for all n+1 parameters (0 to n). So the algorithm looks like this:

Repeat while n > = 1


θj := θj − α  [∑(h(x(i)) – y(i))]*x(i,j)/m (equation simultaneously updated for j=0……n)


So, during this loop, the steps look like this:

θ0 := θ0 − α  [∑(h(x(i)) – y(i))]*x(i,0)/m (where x(i,0) = 1)

θ1 := θ1 − α  [∑(h(x(i)) – y(i))]*x(i,1)/m

θ2 := θ2 − α  [∑(h(x(i)) – y(i))]*x(i,2)/m





θn := θn − α  [∑(h(x(i)) – y(i))]*x(i,n)/m

Feature Scaling

Before we even initiate gradient descent in our multivariate equation, there is one important thing that you must keep in mind.

Remember our contour graphs?

The thing is that when we have so many features with such diverse values, then the contour graph becomes increasingly convoluted.

Check out the following:

The contour on the right is the one that we desire, it is well rounded and the path that we need to trace to the global optima is manageable.

The contour on the left is, however, what we will get if try to blindly plot a real-life feature set. The graph is way too elongated and the path that we will take is convoluted and messy.

This is where we use feature scaling to make our values more manageable. Let us show you a little example to see what we mean by that:

Let’s go back to our housing example. Let’s just work with two features, for now, x1 and x2, where x1 is the size of the house varying from 0-2000 square feet and x2 is the number of floors in the house ranging from 1-5. Obviously, the discrepancy between 2000 and 5 is huge which would lead to weird contour graphs.

So, we need to cut them down to size.

To cut down the features into a scale that is more measurable, we are going to divide each feature by its maximum value.

So, since x1 varies from 0-2000, we will be measuring x1/2000. And similarly, instead of x2 we will be measuring x2/5.

Doing this, the value of x1 will now go from 0 <= x1<= 1 and the value of x2 will also go from 0 <= x2 <= 1.

When it comes to feature scaling, the idea is to make sure that the value of xi approximately comes in between -1 and 1.

So, -1< = xi < = 1

Since we automatically assume that the value of x0 is 1, it already comes within that range. However, keep in mind that this is an approximate range. So, if the value of xi ranges between 0 and 2, it is still ok.

The problem arises when we are dealing with features that may, for example, range between -2000 and 5000.

Another elegant method of handling feature scaling, apart from simply dividing with the largest value is to do mean normalization.

What is Mean Normalization?

In mean normalization, we subtract the instance value of a specific feature with the mean value of that feature and then divide the result with the highest value.

So, in the example above, x2 is the “number of floors” feature which can take values between 1-5. Suppose the mean value of this feature is 2 (meaning most of the houses in our set have 2 floors). In that case, instead of just doing x2/5 as we did earlier, we are going to do (x2-2)/5.

With mean normalization, our goal is to be somewhere in the range of -0.5 < = xi < = 0.5.

Customized Features and Hypothesis

One of the coolest things that you can do is to customize your own features. Consider the example of the house pricing. Let’s say that we are working with two features: plot length(x1) and plot depth(x2).

So, the hypothesis function looks like this: h(x) = θ0 + θ1*x1 + θ2*x2


However, instead of making such a needless complication, why not combine the length and depth to work with just one feature aka area (length vs area)?

This will give us a new feature called area, which gives us a far more manageable hypothesis function: h(x) = θ0 + θ1*x1

Apart from this, there is another cool piece of customization that we can do:

Consider this plot. Do you see how the instances are plotted in a way which will make it difficult for a single linear function to cut through a majority of the values? Check out the image below and you can see what we mean:

However, you don’t really need to make your hypothesis function in a straight line! You can actually use a square function aka parabolic function to cover a majority of the instances like this:

This turns your hypothesis function into this: h(x) = θ0 + θ1*x + θ2*x^2.

However, a parabolic function may not be the best option here. Can you tell why? Check this out:

Do you know what this graph says if we are dealing with our Price-Size function? It basically says that as the size of the house increases, after some time, its price may actually decrease!

Obviously, this is not ideal, so, what you can do is to use a cubic function.

As you can see, the cubic function suits this set of instances much better. Our hypothesis function will now look like this: h(x) = θ0 + θ1*x + θ2*x^2 + θ3*x^3

Cool, so it is all good now right?

Well…not quite.

You see, we are using a feature “size” in three different forms, as x, as x^2, and as x^3.

The thing is that suppose the range of x goes till 1000, then x^2 will go till 10^6 and x^3 goes till 10^9!

That’s a huge range, and we will have to do some strict feature scaling to make sure that the contour plot doesn’t go haywire.


We can also choose a square root function to kill two birds with one stone.

Not only are we satisfying the instance set, but we have a feature set that won’t lead to convoluted contour plots.

Our hypothesis function will now look like this: h(x) = θ0 + θ1*x + θ2*x^0.5

The Normal Equation Method

So far we have seen that in order to solve for θ you can use the gradient descent method. However, there is one more method that you can use to solve for θ analytically, and it is called the normal equation method.

Let’s take an example set of our house pricing instances.

1 2981 5 2 15 450
1 1432 2 1 12 245
1 3421 7 3 20 600
1 2413 3 1 40 300

So, in this case, since there are 4 instances, m=4. We all know that x0 is, by default, 1.

If we make a feature set for this, it will look something like this. Let’s call this feature set X.

X = [(1, 2981, 5, 2, 15), (1, 1432, 2, 1, 12), (1, 3421, 7, 3, 20), (1, 2413, 3, 1, 40)].

The Y vector will look like this:

Y = [450, 245, 600, 300]

The formula for the creation of θ will look like this (Xt*X)^-1*Xt*Y.

The formula is the inverse of the product of the transpose matrix and matrix and the product of this inverse along with the transpose and the target matrix.

If we were to use this formula in Octave, then the code will look like this: pinv(X’ *X) *X’ * Y

In Octave, X’ is the transpose of X. The pinv() is the keyword used to find the inverse of a matrix. So pinv(X’*X) will be the inverse of the product of transpose X and X.

Unlike gradient scaling, the normal equation doesn’t really require feature scaling.

The differences between gradient scaling and normal equation look like this:

You need to choose the value of α No need to choose the value of α
Requires multiple iterations Doesn’t require iteration
Use this when there are many features (>10^6) Use this when there relatively fewer features.

There are certain cases where normal equation fails. It usually happens when the product of Xt and X is non-invertible aka you can’t get its inverse.

This happens in usually two cases:

  • You are using redundant features (eg. x1= height in feet, x2= height in inches)
  • You are using more features than you have instances. In this case, the only way you can go forward is by deleting some of the features or by using regularization (more on this later).


So, that’s it with part 1. We have covered a lot of topics today. Please take your time to go through everything. We know that this could be overwhelming so try to go through it section-by-section. We are going to be back soon with part 2!

Leave A Reply

Your email address will not be published.