In Data Science, missing data is a common concern. Data analysis and modeling might be complicated by missing data. As a result, rows with missing values should be eliminated or replaced with suitable values.
There are three categories of data that are missing:
- Missing Completely At Random (MCAR) is an acronym that stands for “Missing Completely At Random.” It is the most random level of all. This means that the missing values in any feature are unaffected by the values of other features. In the event of lacking data, this is the preferred scenario.
- MAR denotes that the missing values of any feature are influenced by the values of other characteristics.
- MNAR stands for “Missing Not At Random.” Missing data that is not at random is a more significant problem, and in this instance, it may be prudent to investigate the data collection process further and try to figure out why the data is missing. For example, why did the majority of individuals in a poll not respond to a given question? Was the question ambiguous?
What should you do with the values that aren’t present?
After we’ve detected the missing values in our data, we’ll need to determine the magnitude of the missing values before taking any further action.
- Ignore the values that are missing– Except when the missing data is a MAR or MNAR, missing data under 10% for an individual instance or observation may usually be ignored. If the incomplete instances are not taken into account, the number of full cases must be adequate for the chosen analytic approach.
- Eliminating a variable– If the data is MCAR or MAR and a feature has a large number of missing values, that feature should be excluded from the analysis. If more than 5% of a feature of a sample’s data is missing, that feature or sample should probably be excluded. To avoid any artificial increases in relationships with independent variables, remove the dependent variable(s) if there are missing values in the instances or observations.
- Deleting– Cases with missing values for one or more characteristics are removed using this procedure. If the number of instances with missing data is minimal, it is preferable to remove them. Although this is a simple method, it may result in a large reduction in the sample size. Furthermore, the data may not always be absent at random. This might lead to erroneous parameter estimates.
- Imputation– is the process of using statistical tools to replace missing data. Imputation is beneficial in that it preserves all situations by substituting an estimated value based on other available data for missing data.
- Methods of Regression– variables having missing values are considered dependent variables, whereas those with complete cases are considered predictors or independent variables. The dependent variable’s observed values are fitted using a linear equation using the independent variables. The values for the missing data points are then predicted using this equation.
The downside of this strategy is that, as a result of selection, the detected independent variables will have a high correlation with the dependent variable. This would result in a bit too good a match for the missing data, lowering the uncertainty about that value. Furthermore, this presupposes a linear connection, which may not be the case in reality.
- KNN– To predict and substitute missing data, this technique uses k-nearest neighbor methods. The k-neighbors are picked using a distance metric, and the average of their distances is utilized as an imputation estimate. Both qualitative and quantitative qualities might be estimated using this method.
To get the best match, test multiple values of k with different distance measures. The distance measure might be determined by the data’s attributes. If the input variables are comparable in type, for example, Euclidean is a reasonable distance metric to apply. If the input variables are not of the same type, Manhattan distance is a reasonable metric to employ.
The use of KNN has the benefit of being straightforward to implement. However, it suffers from the dimensionality curse. It works well for a limited number of variables, but when the number of variables grows huge, it becomes computationally inefficient.