Missing Values

From Displayr
Jump to: navigation, search

Observations missing from a set of data for some reason. For example, if a question in a survey asks for people's ages, and the survey database does not record any value for a respondent, then the respondent has a missing value (or, equivalently, missing data).

Causes of missing values

  • The data was never collected (e.g., the question was optional or the questionnaire contained a skip).
  • The collected was considered uninformative. For example, a question may have asked for somebody's attitude and the respondent may have said "Don't know". Some researchers recode such data as missing values.
  • The database was corrupted.
  • The data was determined to be invalid during data cleaning and was recoded as missing values.
  • The respondents dropped out of a study (e.g., stopped answering questions after completing two-thirds of the questionnaire).

Types of missing values

When conducting an analysis of data that involves missing values it is necessary to make assumptions about what has caused the data to be missing. Making an incorrect assumption can profoundly alter any conclusions.

Missing Completely At Random (MCAR)

ID Gender Age Income
1 Male Under 30 Low
2 Female Under 30 Low
3 Female 30 or more High
4 Female 30 or more Missing Value
5 Female 30 or more High

Looking at the table above, we need to ask ourselves: what is the likely income of the fourth observation? The simplest approach is to note that 50% of the other people have high incomes and 50% have low incomes and assume that she therefore has a 50% chance that she has a high income and a 50% chance that she has a low income. This is known as assuming that the missing value is Missing Completely At Random' (MCAR).[1] When we make this assumption, we are assuming that whether or not the person has missing data or not is completely unrelated to the other information in the database.

It is relatively easy to check the assumption that data is Missing Completely At Random. If you can predict which units have missing data (e.g., using common sense, regression or some other method), then the data is not Missing Completely At Random. Thus, many of the common types of missing values in market research, such as "Don't know" responses and questions that have been skipped, are not Missing Completely At Random and any analyses that implicitly make such an assumption can give misleading results (see How Missing Values are Addressed in Statistical Analysis).

In survey analysis, the assumption of Missing Completely At Random is only appropriate when randomization has occurred (e.g., if getting people to evaluate three randomly selected brands from a list of 15 brands).

Missing At Random (MAR)

In the case of Missing Completely At Random, the assumption was that there was no pattern. An alternative assumption, known somewhat confusingly as Missing At Random (MAR)[2], [note 1], instead assumes that we can predict the value that is missing based on the other data.

The table from above is reproduced below and we return to the problem of trying to work out the value of the fourth observation on income. A simple predictive model is that income can be predicted based on gender and age. Looking at the table below, which is the same as the one above, we note that our missing value is for a Female aged 30 or more and the other female aged 30 or more has a High income and thus we can predict that the missing value should be High. Note that the idea of prediction does not mean we can perfectly predict a relationship. All that is required is a probabilistic relationship (i.e., that we have a better than chance probability of predicting the true value of the missing data).

ID Gender Age Income
1 Male Under 30 Low
2 Female Under 30 Low
3 Female 30 or more High
4 Female 30 or more Missing Data
5 Female 30 or more High

Missing At Random is a much safer assumption than Missing Completely At Random. Assumptions for dealing with data that is Missing At Random are discussed in How Missing Values are Addressed in Statistical Analysis.

Nonignorable Missing Data

It may be the case that we do not believe we can confidently make any conclusions about the likely value of missing data. For example, it is possible that the person with missing data has no data because they are unemployed and were not asked the question. Or, perhaps people with very low incomes and very high incomes are shy and tend to refuse to answer. Or there could be some other reason and we just do not know. This is known as Nonignorable Missing Data.

As the name suggests, we should not ignore Nonignorable Missing Data. If Nonignorable Missing Data is interpreted as being Missing At Random or Missing Completely At Random the result can be grossly misleading analyses.

If the missing data is Nonignorable then the actual computation of averags and percentages becomes almost meaningless. Consider the following study looking at homelessness.[3] Data was obtained from 31 women, of whom 14 were located six months later. Of these, 3 had exited from homelessness, so the estimated proportion to have exited homelessness is 3/14 = 21%. As there is no data for the 17 women who could not be contacted (i.e., 31 – 14), it is possible that none, some or all of these 17 may have exited from homelessness. This means that potentially the proportion to have exited from homelessness in the sample is between 3/31 = 10% and 20/31 = 65% and thus reporting the 21% as being the correct result is clearly meaningless. Note that in this example the missing data is nonignorable and treating it as Missing At Random would also be inappropriate, as the inability to contact the women is likely to be causally related with whether or not they have exited from homelessness. Thus, strategies designed for data which is Missing At Random, such as imputation, will not work.

When missing values are determined to be Nonignorable it is often (but not always) impossible to conduct valid analyses. See How Missing Values are Addressed in Statistical Analysis for a discussion of when and how Nonignorable missing data can be addressed in survey analysis.

How missing values are represented in data

Using special missing value codes

SPSS, for example, refers to such values as SYSMIS and presents them as a . while Q shows them as NaN (which stands for Not a Number.

As numbers

Where programs do not have special missing value codes it is common to assign a value. For example, some researchers will assign a value of 99, -9 or -1 to indicate data is missing.

As blanks

For example, typically in Excel missing values are represented as blank cells.

How Missing Values are Addressed in Statistical Analysis

By default, most statistical analysis programs make the convenient-but-rarely-plausible assumption that data is Missing Completely At Random. It is routinely made because it is the simplest assumption. In many areas of statistics, assumptions can be broken without dramatic consequences. The treatment of missing data is not such an area and incorrectly assuming data is Missing Completely At Random can lead to massively misleading results (e.g., in the case of regression, it can cause the conclusions of a model to be reversed).

When we have missing values in data, we need to go through the following process:

  1. Try and fix the data (e.g., re-contact respondents and get their answers). If possible, it is better to try and work out the correct value of the missing data. Often categories are missing because they are inapplicable. Somebody who is listed as a home maker and has income listed as Missing Data probably has no income. Often common sense tells us what the true answers must be. If a respondent has indicated that they never purchase ice cream, they may not have been asked about their frequency of buying Magnums, we will be safe in replacing the missing value with a value of No. Similarly, if someone has indicated that they Don’t Know whether it is important to have a king sized bed in a hotel room, we can be reasonably confident in assuming that it cannot be of great importance to the respondent. And, if someone cannot remember the last time they went to the cinema, we can be reasonably confident it was not in the last week. Where common sense is not enough, we need to look for clues in answers to other questions in order to replace the missing value with a meaningful response. It is not unknown for non-commercial research institutes to have junior researchers and students to read through questionnaires to determine what the likely response may have been. Again, this practice may seem suspect, but it is probably less dangerous than ignoring the problem.
  2. Determine whether the missing values are best characterized as being:
  3. (Optionally) Data imputation, which involves replacing the missing values with predictions for their likely values. Imputation is always something of a last resort and this step should only be conducted if the next step cannot be conducted appropriately. Most automated imputation methods implicitly assume that the data is Missing At Random.
  4. (Optionally) Weighting, whereby the data is weighted to correct for the missing value pattern. Theoretically this is equivalent to imputation but in practice it is a different process.
  5. Using statistical methods that make appropriate assumptions regarding the type of missing data. Where the statistical methods available make assumptions that are known to be incorrect it is sometimes advisable to use imputation. However, it is always theoretically preferable to use statistical methods which make appropriate assumptions, as inevitably the process of imputation is very inaccurate and these inaccuracies infect any statistical methods.

The rest of this section reviews the most common types of analyses that are conducted in market research and how they can be implemented depending upon the type of missing data.

Averages and percentages

When averages and percentages are computed in standard statistical software the missing values are excluded from the analysis and this implicitly involves the assumption that the data is Missing Completely At Random. If this assumption is incorrect, imputation is generally the best solution if the data is Missing At Random.

When the missing values are Nonignorable there is little that can be done to compute meaningful averages and percentages.

Correlations

Correlations implicitly assume that the data is either Missing Completely At Random or Missing At Random. It is generally not appropriate to compute correlations with data that is imputed. This is because a:

  • Good imputations use the observed correlations in the data to infer the values of the missing values, and thus using imputed data to compute correlations involve circular logic.
  • Most imputations are not very good and the correlations computed using the imputed values will be biased , whereas without the imputation they may not be biased at all, even when Missing At Random.

A simple example helps in understanding this problem. Imagine that our the true values of 10 respondents are as follows. These variables clearly have a perfect correlation of 1.

x y
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10

Now consider a situation where the y variable is missing for respondents who have values of 6 or more, which is an example of data that is Missing At Random. Using the only data that is available, we still observe a perfect correlation and thus our analysis is not ruined by the data being Missing At Random.

x y
1 1
2 2
3 3
4 4
5 5
6 Missing Data
7 Missing Data
8 Missing Data
9 Missing Data
10 Missing Data

The next table shows the results computed using SPSS Missing Value Analysis module, using the EM algorithm. Note that SPSS has done a pretty good job (and, if we had played around with the options in SPSS we could have got it do do a better job). However, the correlation is now estimated as 0.994. At first glance that may seem like being almost the same as the correct value of 1, but if you think about it for a moment you will realize that the missing data pattern was a really obvious one and still the algorithm has gotten it wrong and the consequence of this is that we have underestimated the true relationship. With weaker correlations and more variables, the problem becomes much greater and it is thus, in general, best to not using imputed values when computing correlations. If you read the Imputation page you will see another example of correlations, where the imputation causes the correlation to be exaggerated.

x y
1 1
2 2
3 3
4 4
5 5
6 5.50
7 6.35
8 7.20
9 8.05
10 8.90

When the missing values are Nonignorable there is little that can be done to compute meaningful correlations.

Principal Components Analysis

Different statistical programs make different assumptions about missing values when conducting principal components analysis. To understand the differences between these implementations it is important to understand that principal components analysis is computed from the correlation matrix (i.e., the correlations between each of the pairs of variables).[note 2]

SPSS by default has a setting of Exclude cases pairwise which means that it computes the correlations between each pair of variables. This involves an implicit assumption that the data is Missing Completely At Random.

An alternative assumption is to only compute correlations using data where each respondent has no missing values. This is the default in R (where it referred to as na.exclude and is the only option in Q. This approach to missing data is consistent with the assumptions that the data is Missing Completely At Random and sometimes Missing At Random.[note 3] SPSS can also be set to use this assumption (Options : Missing Values : Exclude cases listwise). In terms of its assumptions about the nature of the missing data, this approach is generally preferable to pairwise deletion. However, with large amounts of missing values it is often impossible to use this method.

As principal components analysis is based on correlations, and correlations are typically invalid when data imputation is involved, imputation is also not typically appropriate prior to principal components analysis. Various versions of principal components analysis have been developed which can accommodate missing values by making either Missing At Random or Missing Completely At Random assumptions, but they are not available as standard options in commonly used statistical software.

Cluster Analysis and Latent Class Analysis

See Missing Values in Cluster Analysis and Latent Class Analysis.

Regression

By default, most regression models exclude all respondents for which there is any missing data. This is consistent with Missing Completely At Random and can be consistent with Missing At Random as well.[note 4] To appreciate how it is consistent with Missing At Random, review the earlier discussion of correlation and consider the regression model which predicts a straight line through the points (i.e., you get the same correct results if using the data which has missing values, even though they are Missing At Random).

For the same reasons as discussed with correlation, regression using imputed data is general a bad idea. It is difficult to envisage a situation where it is appropriate.

As with principal components analysis, if regression is conducted using an option such as the SPSS option of Exclude cases pairwise, which essentially works by computing correlations between all the variables based on all the available data, this involves making an assumption that the data is Missing Completely At Random, which is a much stronger and less plausible assumption that occurs when all the observations are deleted that contain any missing values. It is important to appreciate that except when randomization explains the missing values, the use of the Exclude cases pairwise option is extremely difficult to justify, and is impossible to justify in situations where the missing data is caused by skips in the questionnaire or Don't know options.

When the missing data is Nonignorable, the simplest solution for regression is the same as for latent class analysis, and it is to treat the missing values as additional categories. Where the data is numeric, these can be addressed by creating additional variables. For example, if you have an predictor with values of 1,2,NaN,1,3,NaN, you can replace the missing values with values of 0 and include a separate dummy variable in the regression to model the missing data. That is, the one variable is replaced by two in the analysis:


x y
1 0
2 0
0 1
1 0
3 0
0 1

See also

Notes

  1. The logic of this term is that the data is considered to be random conditional upon the values of the other variables, whereas with MCAR the assumption is that it is unconditionally random.
  2. Or, more accurately, PCA can be computed from a correlation matrix. There are other algorithms.
  3. It is not clear whether this is always the case or not.
  4. It is not clear whether this is always the case or not.
  1. Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. Brisbane, John Wiley & Sons.
  2. Little, R. J. A. and D. B. Rubin (1987)Statistical Analysis with Missing Data. Brisbane, John Wiley & Sons.
  3. Manski, Charles F. (1995), Identification Problems in the Social Sciences: Harvard University Press.