Variable Standardization

From Displayr
Jump to: navigation, search

Changing the variance (or equivalently the range) of variables prior to conducting cluster analysis.

The reason for standardizing is that cluster analysis implicitly weights analyses to concentrate on variables with a larger variance. For example, if clustering data using age, measured in years, and data from an 11-point scale, the data from the 11-point scale would play a much smaller role in the cluster analysis as the larger range of age categories translates to a larger variance.

A number of different methods of standardizing are common:

  1. Multiplying each variable by a different constant such that it has a standard deviation of 1. This method is sometimes referred to simply as standardizing.
  2. Using dimension-reduction techniques like principal components analysis (see Tandem Clustering). Note that many dimension-reduction techniques automatically create variables with a standard deviation of 1.
  3. Multiplying each variable by a different constant such that all the variables have a common range (e.g., a range of 1). For example, if one variable had a range of 1 to 7 and another had a range of 1 to 10, they could be multiplied by 1/7 and 1/10 respectively to attain a unit range.
  4. Multiplying each variable by a different constant such that all the variables have a common possible range. This is the same idea as the previous one, except that the possible range of values is used rather than the observed range (e.g., if on a 10 point scale respondents have only selected 6 or 7 then standardization involves all 10 points rather than the 1 point between the 6 and 7).

In segmentation studies it is arguable that only the last of these methods is routinely appropriate, as variables that have high standard deviations are variables that people differ on. Variables with low standard deviations are variables measuring things that people feel the same way about. Standardizing variables consequently exaggerates small differences while reduces the impact of real points of difference between consumers and thus works against the whole purpose of segmentation.

This criticism of variables standardization is specific to survey research. In many other areas of statistical analysis variable standardization of some form is appropriate. For example, if clustering measurements of the weight and height of different animals they are clearly measured on different scales and some acknowledgement of this should occur when analyzing the data. Even in these applications the practice of scaling variables to have a standard deviation of 1 is likely inferior to instead scaling to a unit range[1]

References

  1. Milligan, G. W. (1996). Clustering Validation: Results and Implications for Applied Analyses. Clustering and Classification. P. Arabie, L. J. Hubert and G. De Soete. River Edge, New Jersey, World Scientific Publishers: 341-375.