The rectified linear activation unit, or ReLU, is one of the few landmarks in the deep learning revolution. It’s simple, yet it’s far superior to previous activation functions like sigmoid or tanh.
- ReLU formula is : f(x) = max(0,x)
Both the ReLU function and its derivative are monotonic. If the function receives any negative input, it returns 0; however, if the function receives any positive value x, it returns that value. As a result, the output has a range of 0 to infinite.
ReLU is the most often used activation function in neural networks, especially CNNs, and is utilized as the default activation function.
Advantage of ReLU activation function
Because there is no difficult arithmetic, the ReLU deep learning function is simple and does not require any heavy processing. As a result, the model can train or operate in less time. Sparsity is another significant quality that we consider to be an advantage of utilizing the ReLU activation function.
A sparse matrix is one in which the majority of the entries are zero, and we want a property like this in our ReLU neural networks where some of the weights are zero. Sparsity produces compact models with more predictive ability and less overfitting and noise. In a sparse network, neurons are more likely to be processing important components of the problem.
For instance, in a model that detects human faces in photos, there may be a neuron that can identify eyes, which should obviously not be activated if the image is not of a face and is a three or bridge.
Because ReLU outputs zero for all negative inputs, it’s possible that any particular unit won’t activate at all, resulting in a sparse network.
Let’s look at how the ReLu activation function compares to other well-known activation functions like sigmoid and tanh.
ReLU vs Sigmond and Tanh
The activation functions that were often utilized before ReLU, such as sigmoid and tanh, were saturated. This means that for tanh and sigmoid, high values snap to 1.0 and small values snap to -1 or 0. Furthermore, the functions are only sensitive to changes in their input around the mid-point, such as 0.5 for sigmoid and 0.0 for tanh.
This resulted in an issue known as the vanishing gradient problem.
The gradient descent procedure is used to train neural networks. The backward propagation step in gradient descent is just a chain rule to get the change in weights in order to lower the loss after each epoch. It’s worth noting that derivatives play a significant part in weight updating. When we employ activation functions like sigmoid or tanh, whose derivatives only have decent values from -2 to 2 and are flat elsewhere, the gradient continues to decrease as the number of layers increases.
As a result, the gradient value for the early layers is reduced, and those layers are unable to learn correctly. In other words, because of the depth of the network and the activation shifting the value to zero, their gradients tend to evaporate.
- ReLU avoids this issue because its slope does not plateau as the input grows larger. As a result, models that use the ReLU converge faster.
However, the ReLU function has some flaws like exploding gradient.
It’s the polar opposite of the vanishing gradient, it occurs when significant errors accumulate during training, resulting in massive modifications to model weights. The model is unstable as a result, and it is unable to learn from your training data.
There is also a drawback to being zero for all negative values, which is known as “dying ReLU”. If a ReLU neuron is trapped on the negative side and always outputs 0, it is said to be “dead.” Because the slope of ReLU in the negative range is also 0, it’s improbable that a neuron will recover once it’s gone negative. Such neurons are essentially worthless because they don’t play any part in discerning the input.
Over time, you may find that a big portion of your network is idle. When the learning rate is excessively high or there is a substantial negative bias, the dying problem is likely to arise.
This difficulty is frequently alleviated by lower learning rates. We can also utilize Leaky ReLU, which is a better variant of the ReLU activation function. We specify the ReLU activation function as an extremely rectified linear unit of x instead of declaring it as 0 for negative values of inputs(x). This activation function’s formula is as follows:
- Leaky ReLU = f(x) = max(0.01*x, x)