MaxDiff Analysis Case Study Using R
This case study illustrates alternative approaches to analyzing MaxDiff data using R. See MaxDiff for links to key resources and concepts relating to MaxDiff; that page should be read prior to this page.
R setup
- Paste into R Setup code at Analyzing MaxDiff Using the Rank-Ordered Logit Model With Ties Using R.
- Read the following libraries into R:
library(BiasedUrn)
library(foreign)
library(mlogit)
Reading in and setting up the data
This code reads a data file which contains the data in the Wide Layout and converts it to the Stacked Layout:
itData = read.spss("http://surveyanalysis.org/images/0/06/ItMaxDiff.sav", use.value.labels = FALSE, to.data.frame = TRUE)
# Selecting the variables containing the max-diff data
z = itData[,-1:-5]
# stacking the data (one set per row)
alternativeNames = c("Apple","Microsoft","IBM","Google","Intel","HewlettPackard","Sony","Dell","Yahoo","Nokia")
nAlternatives = length(alternativeNames)
nBlocks = ncol(z) / nAlternatives
nAltsPerSet = 5
n = nrow(z)
nObservations = n * nBlocks
itMaxDiffData = matrix(as.numeric(t(z)),ncol = nAlternatives,byrow = TRUE, dimnames = list(1:nObservations, alternativeNames))
Computing the overall counts for the whole sample
The following code computes a counts analysis for the entire sample and then ranks these, with a 1 for the highest brand:
counts = apply(itMaxDiffData, 2, mean, na.rm = TRUE)
ranks = nAlternatives + 1 - rank(counts)
cbind(Counts = counts, Ranks = ranks)
This generates the following output:
Counts Ranks Apple 0.14316239 4 Microsoft -0.01175214 5 IBM -0.30128205 10 Google 0.35790598 1 Intel -0.08760684 6 HewlettPackard -0.11538462 7 Sony 0.26495726 2 Dell -0.20512821 9 Yahoo -0.18910256 8 Nokia 0.14423077 3
This aggregate counts analysis suggests that Google is the most preferred of the brands, followed by Sony, Nokia and then Apple in fourth position.
Computing individual-level counts
The following code conducts a separate analysis for every respondent:
id = rep(1:n,rep(nBlocks,n))
individualCounts = aggregate(itMaxDiffData,list(id),mean, na.rm = TRUE)[,-1]
round(individualCounts[1:10,],1) #show at data for first 10 respondents
For example, the parameters for the first 10 responents are:
round(individualCounts[1:10,],1) #show at data for first 10 respondents Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia 1 0.3 -0.3 -0.3 0.3 0.0 -1.0 0.3 0.3 0.0 0.3 2 1.0 0.3 -0.3 0.0 -1.0 0.0 0.3 -0.3 -0.3 0.3 3 -0.7 0.0 0.0 0.0 0.0 0.3 -0.7 0.7 -0.7 1.0 4 -0.3 0.0 0.0 -0.7 0.3 -0.3 1.0 -0.7 0.3 0.3 5 1.0 -1.0 0.0 0.7 0.0 0.0 0.0 -0.3 -0.7 0.3 6 0.3 0.7 -1.0 0.0 0.3 0.0 0.3 0.0 -0.3 -0.3 7 1.0 -0.3 -0.3 0.3 0.0 -0.7 0.7 -0.3 -0.3 0.0 8 -0.3 0.0 -0.7 0.3 0.3 -0.3 0.3 -0.7 0.3 0.7 9 0.3 0.0 -1.0 0.3 -0.3 0.7 0.3 0.0 -0.7 0.3 10 1.0 -1.0 -0.7 0.0 0.0 0.0 0.7 0.0 0.3 -0.3
Computing individual-level ranks from the counts
As discussed here, computing the average of the individual counts is not valid. Consequently, to interpret the counts we need to compute the rankings:
set.seed(0) # setting the random number seed to enhance comparability
indidualCountsNoTies = individualCounts + matrix(runif(n * nAlternatives)/100000, n) #adding random numbers to break ties
ranks = nAlternatives + 1 - apply(indidualCountsNoTies,1,rank) #ranks
rankProportions = t(apply(ranks,1,table) / n * 100)
round(rankProportions,1)
The following table show the resulting ranks. The first column of numbers shows the proportion of people that have their highest count for each of the different brands. Note that Apple comes in first place by this measure, whereas with the aggregate analysis it was fourth. The explanation for this apparent contradiction appears in the right-most columns, which reveal that Apple has many more people that gave it its lowest rating than did any of Google, Nokia and Sony, and for this reason the aggregate analysis gave a misleading conclusion.
1 2 3 4 5 6 7 8 9 10 Apple 22.8 11.9 11.9 5.8 9.3 6.1 6.4 4.8 9.3 11.9 Microsoft 5.4 9.0 11.9 13.5 14.1 12.2 10.3 9.6 5.1 9.0 IBM 1.9 1.6 5.1 7.4 9.0 7.4 12.8 17.6 20.5 16.7 Google 21.8 20.5 15.7 10.3 7.1 11.5 4.8 4.2 2.9 1.3 Intel 3.5 5.8 8.3 14.1 13.8 12.8 11.2 12.2 10.3 8.0 HewlettPackard 4.2 8.7 9.3 7.4 8.7 11.5 10.9 16.0 11.2 12.2 Sony 16.0 15.7 16.0 13.8 9.0 10.6 7.7 6.4 3.2 1.6 Dell 5.8 8.0 3.5 8.3 10.3 7.4 11.5 7.4 19.6 18.3 Yahoo 1.9 6.4 7.7 6.1 8.0 13.8 14.7 14.1 12.5 14.7 Nokia 16.7 12.5 10.6 13.5 10.9 6.7 9.6 7.7 5.4 6.4
The following analysis produces the cumulative proportions:
rankCumProportions = t(apply(rankProportions,1,cumsum))
round(rankCumProportions,1)
The cumulative analysis better communicates the performance of Apple: while it (just) wins in terms of being most preferred, if we instead focus on which brands are in people's top 2 (i.e., the column headed by 2, Google is a clear winner.
1 2 3 4 5 6 7 8 9 10 Apple 22.8 34.6 46.5 52.2 61.5 67.6 74.0 78.8 88.1 100 Microsoft 5.4 14.4 26.3 39.7 53.8 66.0 76.3 85.9 91.0 100 IBM 1.9 3.5 8.7 16.0 25.0 32.4 45.2 62.8 83.3 100 Google 21.8 42.3 58.0 68.3 75.3 86.9 91.7 95.8 98.7 100 Intel 3.5 9.3 17.6 31.7 45.5 58.3 69.6 81.7 92.0 100 HewlettPackard 4.2 12.8 22.1 29.5 38.1 49.7 60.6 76.6 87.8 100 Sony 16.0 31.7 47.8 61.5 70.5 81.1 88.8 95.2 98.4 100 Dell 5.8 13.8 17.3 25.6 35.9 43.3 54.8 62.2 81.7 100 Yahoo 1.9 8.3 16.0 22.1 30.1 43.9 58.7 72.8 85.3 100 Nokia 16.7 29.2 39.7 53.2 64.1 70.8 80.4 88.1 93.6 100
The issue of which rank should be more relevant (e.g., first, top 2 or something else) is not a statistical issue. Such a choice needs to be determined by the problem being studied. For example, if wanting to predict market share, looking at most preferred may be more relevant, whereas if interested in measuring overall appeal or brand equity then some other metric, such as top 4 or the average rank, may be more relevant.
We can compute the average of the ranks as follows:
aveRank = rankProportions %*% (1:10)/100
cbind(aveRank, Rank = rank(aveRank))
Remembering that lower average ranks indicate higher appeal, we can see that these average ranks essentially tell the same story as the earlier aggregate analysis:
Rank Apple 4.737179 4 Microsoft 5.410256 5 IBM 7.211538 10 Google 3.612179 1 Intel 5.907051 6 HewlettPackard 6.185897 7 Sony 4.089744 2 Dell 6.596154 8 Yahoo 6.608974 9 Nokia 4.641026 3
Setting up the data for a logit model
Prior to running a logit model it is necessary to first set up the data to 'trick' the logit software. For this example, the following code does this:
nRows = sum(!is.na(itMaxDiffData)) * 2 longData = matrix(0, nRows,nAlternatives + 3) counter = 0 setCounter = 0 for (rr in 1:nObservations){ nAlts = 0 alternatives = NULL respondent = floor(rr/nBlocks) + 1 for (cc in 1:nAlternatives){ v = itMaxDiffData[rr,cc] if (!is.na(v)){ nAlts = nAlts + 1 alternatives[nAlts] = cc if (v == 1) best = cc if (v == -1) worst = cc } } setCounter = setCounter + 1 for (a in 1:nAlts){ counter = counter + 1 this_a = alternatives[a] if (this_a == best) longData[counter,3] = 1 else if (this_a == worst) longData[counter + nAlts,3] = 1 longData[counter, 1] = respondent longData[counter + nAlts,1] = respondent longData[counter, 2] = setCounter longData[counter + nAlts, 2] = setCounter + 1 longData[counter,3 + this_a] = 1 longData[counter + nAlts,3 + this_a] = -1 } setCounter = setCounter + 1 counter = counter + nAlts } longData[1:20,] longData = as.data.frame(longData) names(longData) = c("ID","Set","Choice",alternativeNames)
Aggregate logit model
The following code estimates an aggregate logit model:
logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, alt.levels = paste(1:nAltsPerSet), shape = "long")
summary(logitModel)
The key output from the analysis is the following table of parameters and p-Values:
Coefficients : Estimate Std. Error t-value Pr(>|t|) Microsoft -0.302971 0.083896 -3.6113 0.0003047 *** IBM -1.200637 0.078807 -15.2351 < 2.2e-16 *** Google 0.640305 0.081811 7.8266 5.107e-15 *** Intel -0.674497 0.081935 -8.2321 2.220e-16 *** HewlettPackard -0.776266 0.081835 -9.4857 < 2.2e-16 *** Sony 0.462314 0.081620 5.6642 1.477e-08 *** Dell -0.933707 0.083080 -11.2387 < 2.2e-16 *** Yahoo -0.994617 0.079781 -12.4668 < 2.2e-16 *** Nokia 0.199250 0.080301 2.4813 0.0130915 *
Note that this analysis also tells a similar story to the earlier aggregate analysis shown above, with Google, Sony and Nokia all out-performing Apple (where Apple has a coefficient of 0, as discussed here).
Tricked Random Parameters Logit Model
The previous model is an aggregate model and can be interpreted as assuming that all people are the same. An alternative is to estimate a mixture model. One of the most commonly used mixture models in analyzing max-diff data is the random parameters logit model (this is more commonly known by the brand name given to it by Sawtooth Software, which is Hierarchical Bayes). This model assumes that differences between people can be described by a multivariate normal distribution which permits correlations between variables, and can be estimated using:
longData$ID = rep(1:n,rep(nBlocks * 2 * nAltsPerSet,n))
rpLogitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, alt.levels = paste(1:nAltsPerSet), shape = "long", id="ID", panel = TRUE, rpar = c(Microsoft = "n", IBM = "n", Google = "n", Intel = "n", HewlettPackard = "n", Sony = "n", Dell = "n", Yahoo = "n", Nokia = "n"), correlation = TRUE)
summary(rpLogitModel)
The key output from this model in this context is the following table, which shows the estimated distributions of the parameters. Here, the Median indicates the mean value estimated in the population (assuming a normal distribution). As with the previous model, Apple has a value of 0 and, relative to the averages in the table we can see that this model again has Apple in fourth place and we have similar conclusions to those obtained from the aggregate analysis.
random coefficients Min. 1st Qu. Median Mean 3rd Qu. Max. Microsoft -Inf -2.893669 -0.68160013 -0.68160013 1.5304686 Inf IBM -Inf -4.090256 -1.89205403 -1.89205403 0.3061484 Inf Google -Inf -1.381164 0.77260038 0.77260038 2.9263643 Inf Intel -Inf -2.908703 -0.98501344 -0.98501344 0.9386766 Inf HewlettPackard -Inf -3.419995 -1.19590592 -1.19590592 1.0281828 Inf Sony -Inf -1.298854 0.50529248 0.50529248 2.3094385 Inf Dell -Inf -3.796453 -1.40240323 -1.40240323 0.9916462 Inf Yahoo -Inf -3.732578 -1.57151880 -1.57151880 0.5895402 Inf Nokia -Inf -2.068556 0.07106841 0.07106841 2.2106928 Inf
It is possible to do further analysis, computing the ranks from this model (although such an analysis is not simple to do using the mlogit software). However, this is not necessary with most max-diff studies where it is possible to instead estimate a separate model for each respondent (and thus avoid making assumptions about how people differ); this is discussed in the next section.
Respondent-level logit models
The following code estimates the logit model at the respondent level and then computes the ranks (as done earlier with the counts analysis):
individualLogit = individualCounts stackedID = rep(1:n,rep(nBlocks*2 * nAltsPerSet,n)) getCoefficients = function(id){c(0,as.numeric(mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, subset = stackedID==id, data = longData, alt.levels = paste(1:nAltsPerSet), shape = "long")$coef))} nAlternatives = 10 for (i in 1:n) individualLogit[i,] = getCoefficients(i) #adding random numbers to break ties set.seed(0) # setting the random number seed to enhance comparability ranks = nAlternatives + 1 - t(apply(individualLogit + matrix(runif(n * nAlternatives)/100000, n),1,rank)) #ranks rankProportions = t(apply(t(ranks),1,table) / n * 100) round(rankProportions,1)
The rankings computed using the respondent-level logit models are broadly similar to those from the respondent-level counts analyses, but they are not identical (we might except that the worse the experimental design, the greater the difference between the two methods, as the logit model takes the design into acccount whereas the counts analysis does not).
1 2 3 4 5 6 7 8 9 10 Apple 22.1 10.3 11.2 7.4 4.5 9.9 6.7 6.7 9.0 12.2 Microsoft 5.8 9.3 12.5 15.7 15.4 11.5 7.7 7.4 6.1 8.7 IBM 1.9 1.9 4.2 9.9 6.7 10.6 10.3 18.9 18.9 16.7 Google 21.8 18.6 17.3 9.6 8.0 7.7 8.7 3.8 3.2 1.3 Intel 3.5 5.1 6.7 7.1 15.7 18.9 12.2 11.5 10.9 8.3 HewlettPackard 3.8 7.7 9.9 5.4 7.7 13.5 13.1 14.7 11.2 12.8 Sony 15.7 17.9 16.0 9.3 13.8 9.3 5.8 7.4 3.5 1.3 Dell 7.1 7.7 3.5 11.5 9.6 6.1 11.5 7.7 16.7 18.6 Yahoo 1.9 5.4 6.1 9.6 8.0 6.4 21.2 11.9 14.7 14.7 Nokia 16.3 16.0 12.5 14.4 10.6 6.1 2.9 9.9 5.8 5.4
We can compute choice probabilities from the logit parameters using the inverse logit transformation:
shares = prop.table(as.matrix(exp(individualLogit)),1)
round(shares[1:10,]*100,1)
The probabilities (out of 100) for the first 10 respondents are:
Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia [1,] 14.4 0.0 0 17.1 0.0 0.0 21.2 21.7 5.4 20.2 [2,] 100.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [3,] 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 [4,] 0.0 0.0 0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 [5,] 100.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [6,] 13.8 42.5 0 2.7 15.4 4.5 15.8 3.4 0.8 1.1 [7,] 100.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [8,] 0.0 0.0 0 11.3 12.2 0.0 20.7 0.0 9.2 46.6 [9,] 16.7 0.0 0 12.8 0.0 38.8 13.8 0.0 0.0 17.9 [10,] 100.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
And, the average probabilities and their ranks are computed using:
aveShare = apply(shares, 2, mean)
cbind(round(100 * aveShare, 1), Rank = nAlternatives + 1 - rank(aveShare))
and we again get a similar conclusion to the other respondent-level analyses, with Google and Apple being approximately equal in terms of their shares of preference:
Rank Apple 21.9 2 Microsoft 6.1 6 IBM 2.3 9 Google 22.3 1 Intel 3.8 8 HewlettPackard 3.8 7 Sony 16.4 3 Dell 6.8 5 Yahoo 1.8 10 Nokia 14.9 4
Rank-Ordered Logit Model with Ties
This model is estimated using:
rankModel = max.diff.rank.ordered.logit.with.ties(itMaxDiffData)
ranks = nAlternatives + 1 - rank(rankModel$coef)
cbind(Parameters = rankModel$coef, Ranks = ranks)
Note that although the resulting estimates are broadly the same as those for the tricked aggregate logit model they are not identical (and, they are different enough to be meaningful, with this analysis showing Microsoft to have almost equivalent appeal to Apple):
Parameters Ranks Apple 0.00000000 4 Microsoft -0.05898306 5 IBM -0.69459763 10 Google 0.68248255 1 Intel -0.28927006 6 HewlettPackard -0.42412444 7 Sony 0.52070672 2 Dell -0.58473506 9 Yahoo -0.56228486 8 Nokia 0.24070573 3
Respondent-level rank-ordered logit model with ties
The following code estimates a rank-ordered logit model for each respondent and then computes the ranks, as done in the previous examples:
individualRankLogit = individualCounts
stackedID = rep(1:n,rep(nBlocks,n))
getRankCoefficients = function(id){max.diff.rank.ordered.logit.with.ties(itMaxDiffData[stackedID == id,])$coef}
nAlts = 10
for (i in 1:n)
individualRankLogit[i,] = getRankCoefficients(i)
set.seed(0) #setting the random number seed so that random numbers are consistently applied across the examples
ranks = nAlts + 1 - t(apply(individualRankLogit+ matrix(runif(n * nAlts)/100000, n),1,rank)) #ranks
rankProportions = t(apply(t(ranks),1,table) / n * 100)
round(rankProportions,1)
Note that this, the most rigorous of the analyses, is almost the same as the individual-level logit models:
1 2 3 4 5 6 7 8 9 10 Apple 22.1 9.3 10.3 9.0 10.9 8.0 4.8 6.1 6.1 13.5 Microsoft 5.1 9.6 13.8 14.1 17.6 9.0 8.7 7.7 5.8 8.7 IBM 1.9 1.6 4.8 8.7 5.8 12.5 9.9 18.3 19.6 17.0 Google 21.8 19.2 17.0 8.7 7.4 9.3 8.3 4.2 2.9 1.3 Intel 3.5 5.1 7.1 9.9 13.1 16.0 14.1 11.2 11.2 8.7 HewlettPackard 3.5 7.7 9.9 6.7 8.0 11.5 13.5 14.7 12.8 11.5 Sony 16.3 18.3 15.4 10.3 10.9 10.3 7.7 5.8 3.5 1.6 Dell 7.4 6.7 3.8 13.1 7.7 8.0 9.6 8.0 17.3 18.3 Yahoo 2.2 5.1 6.4 8.3 7.1 8.0 19.2 15.1 14.7 13.8 Nokia 16.0 17.3 11.5 11.2 11.5 7.4 4.2 9.0 6.1 5.8
See also
See [:category: MaxDiff|MaxDiff] for links to key resources and concepts relating to MaxDiff.