Neural networks have recently been the go-to solution for practically all of our machine learning problems. Because of neural nets’ ability to synthesize complicated non-linearities, they can provide previously unattainable accuracy almost all of the time.
As we progress from one thick layer to the next, they utilize the features of the given data to develop increasingly complicated features. Researchers have attempted to investigate these intricate feature-generating mechanisms, but have made little headway to date. And neural networks remain the mysterious black boxes they have always been.
Some scientists are also opposed to the use of neural networks in critical domains such as autonomous vehicles and drones. They claim that in comparison to decision-making frameworks such as support vector machines or random forests, the decisions made by a deep neural network cannot be justified.
If something goes wrong someday, say an autonomous automobile jumps off a cliff on the way to the grocery store, the cause of the problem can be easily identified and addressed if support vector machines were in charge of the car’s behavior.
On the other contrary, no one can actually foresee why the car went off the cliff and made that decision due to the highly intricate structure of neural networks.
Weights exist between every two layers in neural networks. To obtain the values of the following layer, the linear transformation of these weights and the values in the previous layers are passed via a nonlinear activation function.
This process occurs layer by layer during forward propagation, and the optimum values of these weights can be determined by reverse propagation in order to create correct outputs given an input.
Let’s focus on three ways to establish machine learning weight initialization between the layers:
To disrupt symmetry, the random initialization method is commonly utilized, and it provides significantly higher accuracy than zero initialization. It stops neurons from picking up on the same characteristics as their inputs. Because it quickly memorizes the training data, a neural network is particularly sensitive and prone to overfitting.
Our goal, however, is for each neuron to learn different functions from its input. If the weights initialized randomly can be very high or very low, a new difficulty may occur.
When the weights are initialized with a large number, the term grows. The value is then mapped to 1 using a sigmoid function, resulting in a slower change in gradient descending slope. Learning consumes a significant amount of time!!
When the weights are initialized with a considerably lower value, the same situation occurs. The sigmoid function has a tendency to map the value to zero, slowing down the optimization process. Let’s try the same model on the above dataset with random initialization.
So, the neuron memorizes the same functions almost every iteration due to zero initialization.
Random initialization is a preferable choice for breaking the symmetry; nonetheless, initializing a very high or low value can result in a slower optimization.
The above problem can be partially solved by adding an extra scale factor to He-et-al initialization. As a result, it is the most highly suggested weight initialization approach of the three.