# Data Preparation for Cluster-Based Segmentation

When using cluster-based segmentation to form segment there are a variety of forms of data preparation that can (in some situations) assist in the forming of segments.

## Contents

## Variable Transformations

There are two basic types of transformation that are relevant:

- Changing the range of a variable, which is known as
*variable standardization*, and is discussed in the next section; and - Changing the shape of the distrution, which is discussed in this section.

Transforming variables to modify the shape of the distribution prior to cluster analysis is motivated by the same basic concerns as in other areas of statistics: extreme departures from normality can cause the resulting analyses to be misleading. When transforming variables so as to modify the shape of the distribution the key is to identify and remove *long tails* from variables (i.e., small numebrs of respondents with values substantially higher or lower than the average). Techniques for achieving this include:

- Taking the log of the variables.
- Taking the square root of the variables.
- Windsorizing the variables .

## Variable Standardization

This involves multipling each variable by a constant, such that its range (and variance) changes. The logic of this is only applicable to cluster analysis and self-organizing maps, as these algorithms implicitly weight variables according to their range (and variance). See Variable Standardization for more information.

## Variable Selection and Weighting

If attempting to discover *natural clusters* - that is, clusters which are broadly homogeneous with large gaps between them - it is beneficial to weight variables, as ineviitably some variables will contain more information about the clusters than others. The most extreme form of variable weighting is to exclude certain variables. Various algorithms have been developed for automatically weighting data, but none are widely used in market segmentation (presumably in part because in market research the interest is generally on finding segments with useful strategic implications rather than finding "natural" but uninteresting segments).

In addition to searching for natural clusters, the weighting of variables is often a useful way of avoiding the problem of a segmentation identifying unhelpful segments (e.g., if using a combination of behavioural and attitudinal data, sometimes the segments are formed entirely using the attitinal data, with the behavioural data being ignored). Increasing the relative weighting of the behavioral data can increase the extent to which they differ between the segments.

There are a number of alternative approaches to weighting;

- Including variables in the analysis multiple times.
- Changing the range of a variable. The greater a variable's range, the greater its potential impact in the segmentation. (Note that some latent class algorithms will automatically model differences in the variance of the variables which will cause this form of weighting to have no impact.)
- Tandem Clustering.
- Modifying the algorithm used to form the segments to explciitly take into account the desired weight of different variables. This approach is available in Q.

## Within-Responding Scaling

Scaling the data for each respondent. Most commonly, this involves modifying the scaling each respondent's data so that it has a mean of 0 and a standard deviation of 1. The logic of this is to remove *response bias* from the data.

## Outlier Removal

Removing extreme observations which are considered likely to distort any segmentation.

A variety of methods have been develoepd, including using hierarchical cluster analysis to identify “singleton” clusters (i.e., clusters that contain only a single observation), variants of *k*-means which automatically remove missing data and *robust* verious of cluster analysis (e.g., *k*-medoids). It is not clear that any of these approaches are helpful with cluster analysis and self-organizing maps, as both both techniques are extremely good at automatically identifying outliers (i.e., where outliers exist they are quickly identified in the form of small segments, which can be filtered and the cluster analysis re-run).

With latent class analysis models where they involve mixtures of general linear models, such as latent class logit, outliers are less likely to be automatically discovered as mixtures of general linear models suffer from the same sensitivity to outliers as regression. In theory, all the traditional tools for identifying outliers in regression can be applied to mixtures of general linear models. In practice, this is rarely done (presumably because the tools are not easy to implement due to the complexity of the models).