When minimizing a function, gradient descent uses iterative movement in the direction of steepest descent, as defined by the gradient’s inverse. It’s called gradient descent in machine learning and it’s used to update our model’s parameters. Equations are used in Linear Regression while neural networks use parameters to describe the weights and coefficients of equations.
- Gradient Descent is an algorithm that establishes the values of a function’s parameters that minimizes a cost function.
Gradient descent algorithm in machine learning is most useful when the parameters cannot be determined analytically (for example, using linear algebra) and must be found via an optimization procedure.
How to optimize Gradient Descent
- Cost/Time ratio – For each iteration, collect and plot the cost values calculated by the algorithm. Each iteration of a well-performing gradient descent run should result in a cost decrease. If it does not go down, consider slowing down your learning rate.
- Mix different values – The learning rate value is a small real value such as 0.1, 0.001, or 0.0001. Try out various values for your problem to find which one works best.
- Inputs – If the shape of the cost function is not twisted and distorted, the algorithm will reach the minimum cost faster. This can be accomplished by rescaling all input variables (X) to the same range, such as [-1, 1].
- Passes – Stochastic Gradient Descent Requires Only 1-10 Passes Through the Training Dataset to Converge on Good Coefficients
- A noisy cost plot – can be produced via stochastic gradient descent updates for each training dataset instance. To get a better picture of the algorithm’s learning tendency, average over 10, 100, or 1000 updates.
Now, let’s talk about types of gradient descent algorithms. When computing gradients for each learning step, the key difference is how much data we use. The trade-off is between the gradient’s accuracy and the update’s time complexity – the learning step.
Stochastic Gradient Descent (SGD)
This machine learning gradient descent algorithm updates the parameters on each case rather than going through all of them. As a result, learning occurs in every situation.
It has many of the same benefits and drawbacks as the mini-batch variant.
The ones that are special to SGD are listed below:
- It introduces even more noise into the learning process than mini-batch, which aids in generalization error reduction. However, this would lengthen the time it takes to complete the task.
- We can’t use vectorization for more than one example because it’s too slow. Furthermore, because we only use one example for each learning phase, the variation increases significantly.
In comparison to the mini-batch, the SGD direction is extremely noisy.
Mini-breach Gradient Descent
The Mini-batch Gradient Descent method sums up a smaller number of samples dependent on the batch size rather than going over all of them. As a result, each mini-batch learns something new.
The batch size is a variable that we can adjust. It’s commonly a power of two, such as 128, 256, 512, and so on. The reason for this is that some hardware, like GPUs, have a faster run time with common batch sizes like the power of 2.
The key benefits are as follows:
- Because it passes through a lot fewer samples than Batch, it is faster (all examples).
- By selecting instances at random, you can avoid having to deal with redundant or similar examples that don’t add anything to the learning process.
- Even if the estimate’s standard error would be smaller if there were more cases, the return is less than linear when compared to the computational cost.
Batch Gradient Descent
When executing parameter adjustments, we use Batch Gradient Descent, to sum up, all samples on each iteration. As a result, after each update, we must add up all of the examples.
The key benefits are as follows:
- Using a fixed learning rate during training eliminates the risk of learning rate decay.
- It follows a straight path to the minimum, and it is theoretically guaranteed to converge to the global minimum if the loss function is convex, and to a local minimum if the loss function is not convex.
- It calculates gradients in an unbiased manner. The lower the standard error, the more examples there are.
Machine learning includes a lot of optimization.
- Gradient descent is a straightforward optimization technique that may be applied to a variety of machine learning methods.
Before computing an update, batch gradient descent calculates the derivative from all training data.
Calculating the derivative from each training data instance and calculating the update instantly is referred to as stochastic gradient descent.