Preparing Data for Cluster Analysis
The scale of variables determines their ‘importance’ in the cluster analysis. The larger the scale of a variable – that is the larger its standard deviation, or, equivalently, the greater the range between its smallest and highest values – the greater the discrimination between the clusters on this variable and the less the difference between the clusters on all other variables in the analysis.
The most common recommendations for rescaling involve changing the range or standard deviations of each of the input variables. Most academic studies recommend the use of a unit range, which is a fancy way of saying that each variable should have the same minimum and the same maximum. Generally, most market research data used in cluster analysis (e.g., attitude scales) are automatically set up this way. Practitioners often favour scaling the variables so that they all have a standard deviation of 1 (this is sometimes referred to as normalizing).
Principal components analysis and correspondence analysis can be used to reduce the number of dimensions prior to running cluster analysis. Multiple correspondence analysis has the added benefit that it can be used to turn categorical variables into numeric variables (which are thus consistent with the assumptions of cluster analysis).
Although this approach, which is known as tandem clustering is popular in industry, it is by no means guaranteed to result in a superior cluster analysis, as:
- The dimension reduction creates variables that reflect the strongest patterns in the data. In doing this, some of the variance is removed from the data. This variance may be important and a better segmentation may result if it is left in the data prior to the cluster analysis.
- The dimension reduction increases the focus of the cluster analysis on variables that are not highly correlated with the other variables. That is, it down-weights the strongest pattern in the data.
Respondent scaling is done when it is believed that respondents are differently biased in the way they answer questions. For example, commonly there are some respondents who give systematically higher answers than others. If wanting to correct for this – and it is not always the case that one should – the usual practice is to modify the data so that each respondent’s data has a standard deviation of 1.