Gradient descent: an introduction

A crash course into the nature of gradient descent.

This will be just a sort of introduction because I don’t have much energy allocated right now. So, I decided to segment this entry into two parts as opposed to my original plan of including a more comprehensive informative text. Next week I’ll show how I implemented gradient descent.

Finally. I have a greater understanding of gradient descent compared to when I first heard of it when I began doing machine learning.

Gradient descent

An equation that describes the nature of gradient descent:

  • $x_n$ is some point
  • $x_{n + 1}$ is a successive point
  • $\gamma$ is a sort of rate of climbing, or more specifically, the rate of differentiation
  • $\nabla{F}$ is the gradient or derivative of a function $F$

The above equation suggests that $F(x_n) \geq F(x_{n + 1})$, which is to say that cumulative differentiations (of the nature described above) yield smaller values [1]. As such, gradient descent can be used to minimize a cost function $J$. The mean squared error cost function, which operates on a set of $n$ observations, is one such function. It is defined as

where $h_\theta(x)$ is a hypothesis and $\hat{Y_i}$ is an actually occurring fact that may differ from the hypothesis.

The hypothesis, when it comes to using gradient descent to select hyperparameters for a neural network, takes the same form that a simple linear regression line has, with $\theta_0$ and $\theta_1$ as the hyperparameter values.

The partial derivatives of the mean squared error cost function $J(\theta_0, \theta_1)$ are

That’ll be all for now. Next week, I’ll expand upon this and show how I implemented gradient descent.


Footnotes

Written on May 18, 2018