# Gradient descent: an introduction

A crash course into the nature of gradient descent.

This will be just a sort of introduction because I donâ€™t have much energy allocated right now. So, I decided to segment this entry into two parts as opposed to my original plan of including a more comprehensive informative text. Next week Iâ€™ll show how I implemented gradient descent.

Finally. I have a greater understanding of gradient descent compared to when I first heard of it when I began doing machine learning.

## Gradient descent

An equation that describes the nature of gradient descent:

- $x_n$ is some point
- $x_{n + 1}$ is a successive point
- $\gamma$ is a sort of rate of climbing, or more specifically, the rate of differentiation
- $\nabla{F}$ is the gradient or derivative of a function $F$

The above equation suggests that $F(x_n) \geq F(x_{n + 1})$, which is to say that cumulative differentiations (of the nature described above) yield smaller values [^{1}]. As such, gradient descent can be used to minimize a cost function $J$. The mean squared error cost function, which operates on a set of $n$ observations, is one such function. It is defined as

where $h_\theta(x)$ is a hypothesis and $\hat{Y_i}$ is an actually occurring fact that may differ from the hypothesis.

The hypothesis, when it comes to using gradient descent to select hyperparameters for a neural network, takes the same form that a simple linear regression line has, with $\theta_0$ and $\theta_1$ as the hyperparameter values.

The partial derivatives of the mean squared error cost function $J(\theta_0, \theta_1)$ are

Thatâ€™ll be all for now. Next week, Iâ€™ll expand upon this and show how I implemented gradient descent.