The Analysis of Constant-Sum Data

From Displayr
Jump to: navigation, search

Simple analyses of constant-sum data involves tabulating responses and presenting them as either quantities (e.g., "on average, 3.2 tokens were allocated to Coke"), or, as proportions ("On average, 32% of tokens were allocated to Coke"), commonly with comparisons between sub-groups.

The challenge of analyzing constant-sum data is when there is a need to use it in multivariate analyses. There is no standard approach to doing this; each of the different approaches suffers from sizable problems.

The problem with constant-sum data (and the reason that there is no standard approach)

The fundamental problem with constant-sum data in multivariate techniques is that it is often ambiguous. Where the responses relate to frequency of current or potential behavior, nothing is known about the situation of each of the current or potential behaviors, which makes the data difficult to interpret (i.e., as relevant data is missing; this is discussed in more detail below in the section on stacking). Where the data are intended to measure preference, the measurement properties are unclear. Although it has the appearance of collecting Ratio Scale data, this seems very unlikely to be true (and, even if it is true, the data is discrete rather than continuous, so all the standard models for analyzing continuous data are inappropriate).

Alternative approaches

Treating the data as numeric

The simplest approach is to consider the data to be numeric. For example, if there are 10 categories, then each category is represented as a separate Numeric Variable. In practice, this approach is often problematic, as:

  1. The resulting data is unambiguously not Normally distributed (e.g., the variables are all non-negative Count Variables), and this causes problems with statistical methods that assume the data is Continuous.
  2. The variables are Linearly Dependent, which makes it impossible to analyze many standard multivariate analysis techniques, such as Regression and Principal Components Analysis. A solution to this is to omit one of the variables, but this solution is often unsatisfactory as the choice of which variable to omit will change the results of the statistical analysis.

Transforming the data

Various transformations have been proposed. The simplest is the Logit Transformation. Although this partially solves the problem of assuming that the data is Normal, it does not address the problem of linear dependency.

Stacking the data and treating the responses as weights

In Choice Modeling studies it is commonplace to Stack the data, using the responses as weights and estimate a model with a single Categorical Variable (e.g., a Multinomial Logit model), with the alternatives represented as a Predictor Variable.

The key benefit of this approach is that it permits constant-sum data to be used in standard software. However, such analyses are problematic in that they assume that the difference between a respondent's choices reflects randomness. That is, if a respondent has allocated 3 tokens to option Burger King and 1 token to McDonald's, the assumption is that at any given moment, 3 out of 4 times the respondent will choose King and 1 out of 4 times they will choose McDonald's. It seems highly unlikely that such an assumption is ever true. In particular:

  • Often constant-sum questions capture differences between consumers in terms of what they want in different situations. For example, the respondent may have chosen McDonald's 1 out of 4 times because they know that 1 out of 4 times they are eating in a town that has no Burger King.
  • Respondents may interpret constant-sum scales as measuring degree-of-preference. For example, 3 tokens for Burger King may mean that Burger King is 3 times as good as McDonald's (1 token), or, that it is simply much better (i.e., an ordinal interpretation).

In the situation where constant-sum data captures differences in situation, there is the possibility that employing the stacking situation will result in grossly incorrect predictions (e.g., a model may be estimated indicating that the respondent will be more likely to switch to McDonald's with lower prices, whereas the reality is that the respondent will never switch because their choice is driven by availability of stores). It is unclear what the magnitude of the problem will be where constant-sum scales measure degree-of-preference.