• Krishna Kankipati

Artificial Neural Networks(Part-3) - Loss and Cost Functions, and Gradient Descent.

In this part of the ANN, we will try to learn what is a Loss Function and how it is used to calculate a Cost function and finally Gradient Descent and its role in optimisation.

Loss Function is a method to evaluate the performance of a model by calculating the difference between the actual value and the predicted value. In general we use a Mean Squared Error function or a Log Loss error function. It measures how good our function output is.

Log Loss : L(p, y) = -(ylogp + (1-y)log(1-p)); where y is the actual value and p is the predicted value. If y = 1, then L(p, y) =-logp. If y = 0, then L(p, y) = -log(1-p)

To train the parameters w and b, we need a Cost Function. Cost Function can be viewed as the sum of all Loss functions in the model.

Gradient Descent Hyperparameters are used to tune the model training. Derivative of (y-p)² with respect to weights and biases, tell us how loss changes for a given sample. We repeatedly take small steps in the direction that minimises the loss, called Gradient Steps. This strategy is known as Gradient Descent. The gradient is nothing but the slope of a function.

The Gradient vector has both a direction and a magnitude:

Gradient of a function f

The gradient always points in the direction of steepest increase along the Loss Function. Thus by going in the backward direction of a gradient gives us a steepest decrease in the Loss Function.

i.e the weights are updated as the following

Updating the weight using Gradient Function

In order to reach the minimal loss point we are decreasing the weights, what if we decrease the weights with less magnitude or with higher magnitude?

If we decrease the weights with lower magnitude the model takes more training time, if we decrease the weights with higher magnitude the model crosses the minimal point. Thus we need to choose the learning rate carefully.

The ideal Learning Rate in 1 dimensional data is 1/f ¹¹(x) and for 2 dimensional data is 1/Hessian_matrix

Hessian Matrix

Thus the Gradient Descent can be implemented as the following:

To get a more understanding look up this Logistic Regression Gradient Descent