manuelsolan-o - Understanding Gradient Descent

Understanding Gradient Descent: SSR example

As we mention in the Fitting Functions to Data blog, (Supervised) Machine Learning Algorithms are build by the following structure:

Training Function (Model)
Loss Function
Optimization

In order to optimize the loss function we have to alternatives.

Gradient Descent \(\vec{w}^{i+1} = \vec{w}^{i} - \eta \nabla L(\vec{w}^{i})\)

Simple Example

Data

Linear Algebra notation:

\[\text{Sum of Squared Residuals (SSR)} = (\vec{y}_{\text{true}} - \vec{y}_{\text{predicted}})^t (\vec{y}_{\text{true}} - \vec{y}_{\text{predicted}})\]

Since this example has only three data points we’ll use the summation Notation

\[\text{Sum of Squared Residuals (SSR)} = \sum_{i=1}^{3} (y^i_{\text{true}} - y^i_{\text{predicted}})^2\]

Since \(y_{\text{predicted}} = w_0 + w_1\cdot x_1\) and we are assuming \(w_1\) as \(0.64\)

\[\text{Sum of Squared Residuals (SSR)} = (1.4 - w_0 + 0.64 \cdot x^1_1)^2 + (1.9 - w_0 + 0.64 \cdot x^2_1)^2 + (1.4 - w_0 + 0.64 \cdot x^3_1)^2\] \[\text{Sum of Squared Residuals (SSR)} = (1.4 - w_0 + 0.64 \cdot 0.5)^2 + (1.9 - w_0 + 0.64 \cdot 2.3)^2 + (1.4 - w_0 + 0.64 \cdot 2.9)^2\]

\[L(\vec{w}^{i}) = (1.4 - w_0 + 0.64 \cdot 0.5)^2 + (1.9 - w_0 + 0.64 \cdot 2.3)^2 + (1.4 - w_0 + 0.64 \cdot 2.9)^2\]

The derivative of our loss function

\[L(\vec{w}^{i}) = (1.4 - w_0 + 0.64 \cdot 0.5)^2 + (1.9 - w_0 + 0.64 \cdot 2.3)^2 + (1.4 - w_0 + 0.64 \cdot 2.9)^2\] Applying the chain rule

\[ \frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - w_0 + 0.64 \cdot 0.5) -2(1.9 - w_0 + 0.64 \cdot 2.3) -2(3.2 - w_0 + 0.64 \cdot 2.9) \]

\(1^{st}\) Iteration

We need to initialize a \(w_0\) to some value.

Let’s take 0. Substituting \(w_0\) to 0:

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - 0 + 0.64 \cdot 0.5) -2(1.9 - 0 + 0.64 \cdot 2.3) -2(1.4 - 0 + 0.64 \cdot 2.9) = -5.7\]

Calculating the step size:

\[\text{Step Size} = \eta \nabla L(\vec{w}^{i})\]

\[\text{Step Size} = 0.1 \cdot -5.7 = -0.57\]

\[w_2 =0 \cdot -(-0.57) = 0.57\]

The \(w_0\) value updates to \(0.57\).

\(2^{nd}\) Iteration

Substituting the updated value in the derivative function:

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - 0.57 + 0.64 \cdot 0.5) -2(1.9 - 0.57 + 0.64 \cdot 2.3) -2(1.4 - 0.57 + 0.64 \cdot 2.9) = -2.3\] \[\text{Step Size} = 0.1 \cdot -2.3 = -0.23\] \[w_3 = 0.57 \cdot -(-0.23) = 0.8\]

The \(w_0\) value updates to \(0.8\).

\(3^{rd}\) Iteration

Substituting the updated value in the derivative function:

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - 0.8 + 0.64 \cdot 0.5) -2(1.9 - 0.8 + 0.64 \cdot 2.3) -2(1.4 - 0.8 + 0.64 \cdot 2.9) = -0.9\]

\[\text{Step Size} = 0.1 \cdot -0.9 = -00.9\] \[w_4 = 0.8 \cdot -(-0.09) = 0.89\]

Acknowledgements

This blog is inspired in the you tube video Gradient Descent, Step-by-Step by StatQuest with Josh Starmer. BAM