Neural networks and random initialization
Neural networks have recently been the go-to solution for practically all of our machine learning problems. Because of neural nets’ ability to synthesize complicated non-linearities, they can provide previously unattainable accuracy almost all of the time.
As we progress from one thick layer to the next, they utilize the features of the given data to develop increasingly complicated features. Researchers have attempted to investigate these intricate feature-generating mechanisms, but have made little headway to date. And neural networks remain the mysterious black boxes they have always been.
Some scientists are also opposed to the use of neural networks in critical domains such as autonomous vehicles and drones. They claim that in comparison to decision-making frameworks such as support vector machines or random forests, the decisions made by a deep neural network cannot be justified.
If something goes wrong someday, say an autonomous automobile jumps off a cliff on the way to the grocery store, the cause of the problem can be easily identified and addressed if support vector machines were in charge of the car’s behavior.
On the other contrary, no one can actually foresee why the car went off the cliff and made that decision due to the highly intricate structure of neural networks.
- But, all things considered, no other approach today can learn data as precisely as Neural Networks. Image recognition is what it is now because of neural networks. Nowadays, large convolutional nets are being developed that are growing increasingly accurate at detecting objects, to the point where they can compete with humans.
Weights exist between every two layers in neural networks. To obtain the values of the following layer, the linear transformation of these weights and the values in the previous layers are passed via a nonlinear activation function.
This process occurs layer by layer during forward propagation, and the optimum values of these weights can be determined by reverse propagation in order to create correct outputs given an input.
Initializing neural networks
Let’s focus on three ways to establish machine learning weight initialization between the layers:
- Zero Initialization is pointless. The neural network does not break the symmetry. If all of the weights are set to 0, all of the neurons in all of the layers complete the same calculation, resulting in the same output, rendering the deep net meaningless. If the weights are set to 0, the deep net’s overall complexity is equal to that of a single neuron, and the predictions are nothing more than random.
- Random Initialization for neural networks aids in the symmetry-breaking process and improves accuracy. The weights are randomly initialized in this manner, very close to zero. As a result, symmetry is broken, and each neuron no longer performs the same computation.
To disrupt symmetry, the random initialization method is commonly utilized, and it provides significantly higher accuracy than zero initialization. It stops neurons from picking up on the same characteristics as their inputs. Because it quickly memorizes the training data, a neural network is particularly sensitive and prone to overfitting.
Our goal, however, is for each neuron to learn different functions from its input. If the weights initialized randomly can be very high or very low, a new difficulty may occur.
When the weights are initialized with a large number, the term grows. The value is then mapped to 1 using a sigmoid function, resulting in a slower change in gradient descending slope. Learning consumes a significant amount of time!!
When the weights are initialized with a considerably lower value, the same situation occurs. The sigmoid function has a tendency to map the value to zero, slowing down the optimization process. Let’s try the same model on the above dataset with random initialization.
- He-et-al Initialization in this technique the weights are started with the size of the previous layer in mind, allowing for a faster and more efficient global minimum of the cost function. Although the weights are still random, the range varies based on the size of the previous layer of neurons. As a result of the regulated initialization, the gradient descent is faster and more efficient.
So, the neuron memorizes the same functions almost every iteration due to zero initialization.
Random initialization is a preferable choice for breaking the symmetry; nonetheless, initializing a very high or low value can result in a slower optimization.
The above problem can be partially solved by adding an extra scale factor to He-et-al initialization. As a result, it is the most highly suggested weight initialization approach of the three.