# Imputation

A popular approach to dealing with missing data is to use a technique called imputation, which seeks to guess the value of the missing data.

## When to use imputation

Imputation is a useful technique when wanting to estimate averages and percentages for data sets where missing values occur and the missing values are either Missing At Random or Missing Completely At Random.

It is typically counterproductive to use imputation when conducting multivariate analyses, correlations or when the missing data is Nonignorable. See How Missing Values are Addressed in Statistical Analysis for more information.

## Imputation methods

### Using averages

A popular approach to imputation is to replace the average value with the missing value. The only situation where this is a "good" idea is if wanting to confuse people regarding your sample size. Some simple analyses are OK with this approach to imputation (e.g., computing the average), but the vast majority of analyses of survey data become invalid when the average is used. Consider the following data table of data, where we have missing data on beer consumption. Of the people with no missing data, the average is 1.43. If we use this value for those with missing data, we get the new column of data shown as *Replaced by average*.

ID | Gender | Beer consumption in last week |
Replaced by average |
Replaced by average for genders |
---|---|---|---|---|

1 | Male | 5 | 5 | 5 |

2 | Male | 4 | 4 | 4 |

3 | Male | 0 | 0 | 0 |

4 | Male | 0 | 0 | 0 |

5 | Male | MISSING | 1.43 | 2.25 |

6 | Female | 0 | 0 | 0 |

7 | Female | 1 | 1 | 1 |

8 | Female | MISSING | 1.43 | 0.33 |

9 | Female | MISSING | 1.43 | 0.33 |

10 | Female | 0 | 0 | 0 |

While it may at first seem sensible to assume that in the absence of any response, each respondent should be assumed to be ‘average’, it has some dire consequences. In this example, we see two such consequences. First, the standard deviation of the measure of beer consumption has reduced substantially from 2.15 to 1.75. This is particularly problematic in multivariate techniques, such as regression, cluster analysis and principal components analysis, which are based on analyzing the variances in data.

Gender | Beer consumption in last week |
Replaced by average |
Replaced by average for genders |
---|---|---|---|

Average
| |||

Total | 1.43 | 1.43 | 1.29 |

Males | 2.25 | 2.09 | 2.25 |

Females | 0.33 | 0.77 | 0.33 |

Male / Female | 6.75 | 2.7 | 6.75 |

Standard deviation
| |||

Total | 2.15 | 1.75 | 1.84 |

Males | 2.63 | 2.31 | 2.28 |

Females | 0.58 | 0.73 | 0.41 |

A second problem is that we have changed the relationship between our variables. Prior to replacing the missing values with the means, the research suggested that males consumed, on average, 6.75 times as much beer as females, but after we have replaced the missing values, a comparison of means leads to the conclusion that males only consume 2.70 times as much as females!

### Standard implementations of regression and other predictive models

An alternative to using the average for imputation create a predictive model of some type. For example, we can see in the example above that gender is a predictor of beer consumption, and we can assign the missing values the average values for the genders, as shown in the final column of the table above. While less problematic than simply replacing values with the means, this approach is still far from perfect. While the analysis now preserves the average differences between the genders, we still have a substantial reduction in the overall variation in the data, and even greater reductions in the standard deviations within each gender. The consequence of this is that we inadvertently exaggerate the true relationship between the predictor variables and the variable where the missing data is being replaced. In this example, we can see that the correlation between the two variables increases from .48 to .55.

Many of the automated methods for imputing data are variants of this approach.

### Stochastic predictive models

A solution to the problem of imputing using standard statistical methods is to introduce randomization into the process. The basic idea that such approaches share is that rather than impute a single value to replace the missing value, we instead impute multiple values. In this example, we have measures of the beer consumption for four men, with values of 0, 0, 4 and 5. While we only have one male with missing data, we structure the data file as if we have three, giving them each of the values that we observe for the males (i.e., 0, 4 and 5). In addition, we create a weight variable, where we weight the previously missing respondents in proportion to the frequency with which we have observed the values amongst the men. This approach causes much fewer problems in the average, standard deviations and correlations.

Where this approach is impractical, other alternatives are to:

- Randomly predict a value. For example, randomly assign the missing value to the male of one of the four observed values.
- Create multiple data files, each with a single randomly predicted value and then average statistical results obtained from these multiple data files. This is known as
*multiple imputation*.

This approach is obviously a lot more complex than the previous methods. Although it is better than the simpler methods, it is still undesirable, as it implicitly assumes that the data is Missing At Random, which is often not the case. Furthermore, even when the most sophisticated imputation method is employed some errors will be introduced (there are always errors in complicated statistical models), and these can lead to misleading conclusions. In general, it is better to instead use statistical methods that can automatically address them missing values (the only exception to this is when the interest is primarily in means and percentages).