Beginning of deep learning

AI has been impacting our lives nowadays, from image classification to ChatGPT, the evolution is very fast! Most of the AI models are implemented by using deep learning techniques, which use a huge amount of data to train the model to get better results for prediction.

A large deep learning model is very complex, but a small model is relatively understandable and can do a lot of things, like image classification, sales forecasting, recommendation, etc.

Perceptron

The basic idea of deep learning consists of perceptrons, which are used to calculate from the input data by a linear equation to a prediction output. For instance, an output of 0.3 could be a probability of an image of a cat. Even a single can make a prediction. To simplify, we can create a perceptron with 2 inputs and 1 output, and it can form an equation as $f(x) = w_1x_1 + w_2x_2 + b$ like the following diagram:

A perceptron with 2 inputs.

Where $x_1, x_2$ are the input data, and $w_1, w_2, w_b$ are the weights and the weight of bias can be updated by training to output a more accurate value by the equation. The bias can be also written in $b = w_b * 1$ .

How can it be used to predict? Let’s say there is a group of data with kilograms and heights for identifying either cats or mice, it can be plotted as a graph and the equation can draw a line to split the data to separate the two groups, which is a task can be performed by the perceptron: if $f(x) >= 0$ then it is a cat, else it is a mouse.

Loss function

Continue to the model, we can define:

\hat{y} = \begin{cases} 1 &\text{if } f(x) >= 0 \\\\ 0 &\text{if } f(x) < 0 \end{cases}

to mean it is a cat if $\hat{y} = 1$ or it is not a cat if $\hat{y} = 0$ . (in our case it is a mouse). But how do we get the weight values for the perceptron to form the equation to split the data? We could assume that when an output from $f(x)$ gets bigger, it is more like a cat. We can use that to evolve the model. But there is a problem when some outputs are large and some outputs are small. We need a function for normalizing, one commonly used is the sigmoid function:

\sigma(x) = \frac{1} {1 + e^{-x}}

The output of the sigmoid function presents probability. Now we can use sigmoid to apply to the perception to output a probability of beginning a cat as:

P(C) = \hat{y} = \sigma(f(x))

For the probability of the model correction prediction of $m$ items:

P = \prod_{i=0}^m y_i\cdot\hat{y}_i+(1-y_i)(1-\hat{y}_i)

Where $y_i$ is the expected result either 1 or 0 which means it is whether a cat or a mouse. If the result from the model is opposite to $y_i$ , then the probability is $1-\hat{y}_i$ . But there is a problem if there are a lot of predicted outputs, the $P$ will become very small. To solve it, we can take $\log$ to $p$ so that:

\log(P) = \sum_{i=0}^m y_i\cdot\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)

It will result in a negative value since every sigmoid output is greater than 0 and lesser than 1. We can take a negative value of it, and also take an average of it:

-\frac{1}{m}\sum_{i=0}^m y_i\cdot\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)

The value from this formula can indicate how well perform of a model. If the value is high that means the model didn’t predict well; if the value is low then the model predicted well. This formula is Binary Cross Entropy a kind of cross-entropy loss function for measuring loss, which can also help for optimize models.

Training

Now we have a loss function to know how good of a model. And it can also help to optimize a model. How is it possible? What need to do is try to adjust the weights of the model to make the loss function evaluated near zero. We can update the model’s weights to form $f(x)$ to result in outputs always higher or equal to 0.5 to mean it is a cat for higher heights and kilograms as inputs, and vice versa it is a mouse. But how much should we adjust the weight? The answer is by using the derivative. For example, consider a function $f(x) = x^2$ , its derivative is $2x$ . By adding the derivative with a different of $x$ values, ${x_1}^2 \approx {x_0}^2 + 2(x_1-x_0)$ . The derivative indicates the direction and the magnitude to change and we can take the oppositive of the derivative as $x_1 = x_0 - af'(x_0)$ (in this case $x_1 = x_0 - a2x_0$ ) to optimize $x$ to find the minimum point of the function. (This is called steepest descent or gradient descent, there are other ways to update weights).

Now we can use a derivative to calculate the gradient of the loss function. To simplify, let:

L = -y \cdot log(\hat{y}) - (1-y)log(1-\hat{y})

Take its partial derivative for the weight:

\begin{align*} \frac{\partial L}{\partial w_j} &= \frac{\partial}{\partial w_j}[-y \cdot log(\hat{y}) - (1-y)log(1-\hat{y})] \\ x &= -y \frac{1}{\hat y} \cdot \frac{\partial}{\partial w_j}\hat y -(1-y) \frac{1}{1-\hat y} .\frac{\partial}{\partial w_j}-\hat y \end{align*}

The derivative of the sigmoid function:

\begin{align*} \sigma'(x) &= \frac{\partial}{\partial x} \frac{1} {1 + e^{-x}} \\ &= -1 . (1 + e^{-x})^{-2} . \frac{\partial e^{-x}}{\partial x} \\ &= -1 . (1 + e^{-x})^{-2} . e^{-x} . -1 \\ &= \frac{e^{-x}}{(1 + e^{-x})^2} \\ &= \frac{1}{1 + e^{-x}} . \frac{e^{-x} + 1 - 1}{1 + e^{-x}} \\ &= \sigma(x)(1 - \sigma(x)) \end{align*}

And partial derivative of $\hat y$ for $w_j$ , where $\hat y = \sigma(y)$ , $y = \sigma(Wx+b)$

\begin{align*} \frac{\partial \hat y}{\partial w_j} &= \frac{\partial \hat y}{\partial y} \cdot \frac{\partial y}{\partial w_j}\\ &= \hat y(1 - \hat y) \cdot \frac{\partial}{\partial w_j}(w_1x_1 + \cdots + w_jx_j + \cdots + w_nx_n + b)\\ &= \hat y(1 - \hat y)x_j \end{align*}

Back to the loss function

\begin{align*} \frac{\partial L}{\partial w_j} &= -y \frac{1}{\hat y} \cdot \hat y(1 - \hat y)x_j + (1-y) \frac{1}{1-\hat y} \cdot \hat y(1 - \hat y)x_j \\ &= -y(1 - \hat y)x_j + (1-y)\hat yx_j \\ &= -(y - \hat y)x_j \end{align*}

The weight and bias will be updated as:

\begin{align*} w_j' &= w_j + \alpha (y - \hat y)x_j \\ b_j' &= b_j + \alpha (y - \hat y) \end{align*}

$\alpha$ should mean $\frac{1}{m}\alpha$ as the average of the batch multiples the learning rate, but just using one constant for that for convenience.

Sample app demo

Here is a sample app to demonstrate how it works. You can think the blue points are cats and the orange points are mice as the coordinates represent heights and kilograms. But you can also ignore the analogy to try the model, it is still able to split the points if they are aligned in a proportion way.