MaxDiff Analysis Case Study Using R

From Displayr
Jump to navigation Jump to search

This case study illustrates alternative approaches to analyzing MaxDiff data using R. See MaxDiff for links to key resources and concepts relating to MaxDiff; that page should be read prior to this page.

R setup

  1. Paste into R Setup code at Analyzing MaxDiff Using the Rank-Ordered Logit Model With Ties Using R.
  2. Read the following libraries into R:
library(BiasedUrn)
library(foreign)
library(mlogit)

Reading in and setting up the data

This code reads a data file which contains the data in the Wide Layout and converts it to the Stacked Layout:

itData = read.spss("http://surveyanalysis.org/images/0/06/ItMaxDiff.sav", use.value.labels = FALSE, to.data.frame = TRUE)
# Selecting the variables containing the max-diff data
z = itData[,-1:-5]
# stacking the data (one set per row)
alternativeNames = c("Apple","Microsoft","IBM","Google","Intel","HewlettPackard","Sony","Dell","Yahoo","Nokia")
nAlternatives = length(alternativeNames)
nBlocks = ncol(z) / nAlternatives
nAltsPerSet = 5
n = nrow(z)
nObservations = n * nBlocks 
itMaxDiffData = matrix(as.numeric(t(z)),ncol = nAlternatives,byrow = TRUE, dimnames = list(1:nObservations, alternativeNames))

Computing the overall counts for the whole sample

The following code computes a counts analysis for the entire sample and then ranks these, with a 1 for the highest brand:

counts = apply(itMaxDiffData, 2, mean, na.rm = TRUE)
ranks = nAlternatives + 1 - rank(counts)
cbind(Counts = counts, Ranks = ranks)

This generates the following output:

                    Counts Ranks
Apple           0.14316239     4
Microsoft      -0.01175214     5
IBM            -0.30128205    10
Google          0.35790598     1
Intel          -0.08760684     6
HewlettPackard -0.11538462     7
Sony            0.26495726     2
Dell           -0.20512821     9
Yahoo          -0.18910256     8
Nokia           0.14423077     3

This aggregate counts analysis suggests that Google is the most preferred of the brands, followed by Sony, Nokia and then Apple in fourth position.

Computing individual-level counts

The following code conducts a separate analysis for every respondent:

id = rep(1:n,rep(nBlocks,n))
individualCounts = aggregate(itMaxDiffData,list(id),mean, na.rm = TRUE)[,-1]
round(individualCounts[1:10,],1) #show at data for first 10 respondents

For example, the parameters for the first 10 responents are:

 round(individualCounts[1:10,],1) #show at data for first 10 respondents
   Apple Microsoft  IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia
1    0.3      -0.3 -0.3    0.3   0.0           -1.0  0.3  0.3   0.0   0.3
2    1.0       0.3 -0.3    0.0  -1.0            0.0  0.3 -0.3  -0.3   0.3
3   -0.7       0.0  0.0    0.0   0.0            0.3 -0.7  0.7  -0.7   1.0
4   -0.3       0.0  0.0   -0.7   0.3           -0.3  1.0 -0.7   0.3   0.3
5    1.0      -1.0  0.0    0.7   0.0            0.0  0.0 -0.3  -0.7   0.3
6    0.3       0.7 -1.0    0.0   0.3            0.0  0.3  0.0  -0.3  -0.3
7    1.0      -0.3 -0.3    0.3   0.0           -0.7  0.7 -0.3  -0.3   0.0
8   -0.3       0.0 -0.7    0.3   0.3           -0.3  0.3 -0.7   0.3   0.7
9    0.3       0.0 -1.0    0.3  -0.3            0.7  0.3  0.0  -0.7   0.3
10   1.0      -1.0 -0.7    0.0   0.0            0.0  0.7  0.0   0.3  -0.3

Computing individual-level ranks from the counts

As discussed here, computing the average of the individual counts is not valid. Consequently, to interpret the counts we need to compute the rankings:

set.seed(0) # setting the random number seed to enhance comparability
indidualCountsNoTies = individualCounts + matrix(runif(n * nAlternatives)/100000, n) #adding random numbers to break ties
ranks = nAlternatives + 1 - apply(indidualCountsNoTies,1,rank) #ranks
rankProportions = t(apply(ranks,1,table) / n * 100)
round(rankProportions,1)

The following table show the resulting ranks. The first column of numbers shows the proportion of people that have their highest count for each of the different brands. Note that Apple comes in first place by this measure, whereas with the aggregate analysis it was fourth. The explanation for this apparent contradiction appears in the right-most columns, which reveal that Apple has many more people that gave it its lowest rating than did any of Google, Nokia and Sony, and for this reason the aggregate analysis gave a misleading conclusion.

                  1    2    3    4    5    6    7    8    9   10
Apple          22.8 11.9 11.9  5.8  9.3  6.1  6.4  4.8  9.3 11.9
Microsoft       5.4  9.0 11.9 13.5 14.1 12.2 10.3  9.6  5.1  9.0
IBM             1.9  1.6  5.1  7.4  9.0  7.4 12.8 17.6 20.5 16.7
Google         21.8 20.5 15.7 10.3  7.1 11.5  4.8  4.2  2.9  1.3
Intel           3.5  5.8  8.3 14.1 13.8 12.8 11.2 12.2 10.3  8.0
HewlettPackard  4.2  8.7  9.3  7.4  8.7 11.5 10.9 16.0 11.2 12.2
Sony           16.0 15.7 16.0 13.8  9.0 10.6  7.7  6.4  3.2  1.6
Dell            5.8  8.0  3.5  8.3 10.3  7.4 11.5  7.4 19.6 18.3
Yahoo           1.9  6.4  7.7  6.1  8.0 13.8 14.7 14.1 12.5 14.7
Nokia          16.7 12.5 10.6 13.5 10.9  6.7  9.6  7.7  5.4  6.4

The following analysis produces the cumulative proportions:

rankCumProportions = t(apply(rankProportions,1,cumsum))
round(rankCumProportions,1)

The cumulative analysis better communicates the performance of Apple: while it (just) wins in terms of being most preferred, if we instead focus on which brands are in people's top 2 (i.e., the column headed by 2, Google is a clear winner.

                  1    2    3    4    5    6    7    8    9  10
Apple          22.8 34.6 46.5 52.2 61.5 67.6 74.0 78.8 88.1 100
Microsoft       5.4 14.4 26.3 39.7 53.8 66.0 76.3 85.9 91.0 100
IBM             1.9  3.5  8.7 16.0 25.0 32.4 45.2 62.8 83.3 100
Google         21.8 42.3 58.0 68.3 75.3 86.9 91.7 95.8 98.7 100
Intel           3.5  9.3 17.6 31.7 45.5 58.3 69.6 81.7 92.0 100
HewlettPackard  4.2 12.8 22.1 29.5 38.1 49.7 60.6 76.6 87.8 100
Sony           16.0 31.7 47.8 61.5 70.5 81.1 88.8 95.2 98.4 100
Dell            5.8 13.8 17.3 25.6 35.9 43.3 54.8 62.2 81.7 100
Yahoo           1.9  8.3 16.0 22.1 30.1 43.9 58.7 72.8 85.3 100
Nokia          16.7 29.2 39.7 53.2 64.1 70.8 80.4 88.1 93.6 100

The issue of which rank should be more relevant (e.g., first, top 2 or something else) is not a statistical issue. Such a choice needs to be determined by the problem being studied. For example, if wanting to predict market share, looking at most preferred may be more relevant, whereas if interested in measuring overall appeal or brand equity then some other metric, such as top 4 or the average rank, may be more relevant.

We can compute the average of the ranks as follows:

aveRank = rankProportions %*% (1:10)/100
cbind(aveRank, Rank = rank(aveRank))

Remembering that lower average ranks indicate higher appeal, we can see that these average ranks essentially tell the same story as the earlier aggregate analysis:

                        Rank
Apple          4.737179    4
Microsoft      5.410256    5
IBM            7.211538   10
Google         3.612179    1
Intel          5.907051    6
HewlettPackard 6.185897    7
Sony           4.089744    2
Dell           6.596154    8
Yahoo          6.608974    9
Nokia          4.641026    3

Setting up the data for a logit model

Prior to running a logit model it is necessary to first set up the data to 'trick' the logit software. For this example, the following code does this:

nRows = sum(!is.na(itMaxDiffData)) * 2
longData = matrix(0, nRows,nAlternatives + 3)
counter = 0
setCounter = 0
for (rr in 1:nObservations){
  nAlts = 0
  alternatives = NULL
  respondent = floor(rr/nBlocks) + 1
  for (cc in 1:nAlternatives){
    v = itMaxDiffData[rr,cc]
    if (!is.na(v)){
        nAlts = nAlts + 1
        alternatives[nAlts] = cc
        if (v == 1)
           best = cc
        if (v == -1)
           worst = cc
    }
  }
  setCounter = setCounter + 1
  for (a in 1:nAlts){
	counter = counter + 1
        this_a = alternatives[a]
        if (this_a == best)
	   longData[counter,3] = 1
        else if (this_a == worst)
	   longData[counter + nAlts,3] = 1
	longData[counter, 1] = respondent 
	longData[counter + nAlts,1] = respondent 
	longData[counter, 2] = setCounter 
	longData[counter + nAlts, 2] = setCounter + 1
	longData[counter,3 + this_a] = 1
	longData[counter + nAlts,3 + this_a] = -1
  }
  setCounter = setCounter + 1
  counter = counter + nAlts
}
longData[1:20,]
longData = as.data.frame(longData)
names(longData) = c("ID","Set","Choice",alternativeNames)

Aggregate logit model

The following code estimates an aggregate logit model:

logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, alt.levels = paste(1:nAltsPerSet), shape = "long")
summary(logitModel)

The key output from the analysis is the following table of parameters and p-Values:

Coefficients :
                Estimate Std. Error  t-value  Pr(>|t|)    
Microsoft      -0.302971   0.083896  -3.6113 0.0003047 ***
IBM            -1.200637   0.078807 -15.2351 < 2.2e-16 ***
Google          0.640305   0.081811   7.8266 5.107e-15 ***
Intel          -0.674497   0.081935  -8.2321 2.220e-16 ***
HewlettPackard -0.776266   0.081835  -9.4857 < 2.2e-16 ***
Sony            0.462314   0.081620   5.6642 1.477e-08 ***
Dell           -0.933707   0.083080 -11.2387 < 2.2e-16 ***
Yahoo          -0.994617   0.079781 -12.4668 < 2.2e-16 ***
Nokia           0.199250   0.080301   2.4813 0.0130915 *  

Note that this analysis also tells a similar story to the earlier aggregate analysis shown above, with Google, Sony and Nokia all out-performing Apple (where Apple has a coefficient of 0, as discussed here).

Tricked Random Parameters Logit Model

The previous model is an aggregate model and can be interpreted as assuming that all people are the same. An alternative is to estimate a mixture model. One of the most commonly used mixture models in analyzing max-diff data is the random parameters logit model (this is more commonly known by the brand name given to it by Sawtooth Software, which is Hierarchical Bayes). This model assumes that differences between people can be described by a multivariate normal distribution which permits correlations between variables, and can be estimated using:

longData$ID = rep(1:n,rep(nBlocks * 2 * nAltsPerSet,n))

rpLogitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, alt.levels = paste(1:nAltsPerSet), shape = "long", id="ID", panel = TRUE, rpar = c(Microsoft = "n", IBM = "n", Google = "n", Intel = "n", HewlettPackard = "n", Sony = "n", Dell = "n", Yahoo = "n", Nokia = "n"), correlation = TRUE)
summary(rpLogitModel)

The key output from this model in this context is the following table, which shows the estimated distributions of the parameters. Here, the Median indicates the mean value estimated in the population (assuming a normal distribution). As with the previous model, Apple has a value of 0 and, relative to the averages in the table we can see that this model again has Apple in fourth place and we have similar conclusions to those obtained from the aggregate analysis.

random coefficients
               Min.   1st Qu.      Median        Mean   3rd Qu. Max.
Microsoft      -Inf -2.893669 -0.68160013 -0.68160013 1.5304686  Inf
IBM            -Inf -4.090256 -1.89205403 -1.89205403 0.3061484  Inf
Google         -Inf -1.381164  0.77260038  0.77260038 2.9263643  Inf
Intel          -Inf -2.908703 -0.98501344 -0.98501344 0.9386766  Inf
HewlettPackard -Inf -3.419995 -1.19590592 -1.19590592 1.0281828  Inf
Sony           -Inf -1.298854  0.50529248  0.50529248 2.3094385  Inf
Dell           -Inf -3.796453 -1.40240323 -1.40240323 0.9916462  Inf
Yahoo          -Inf -3.732578 -1.57151880 -1.57151880 0.5895402  Inf
Nokia          -Inf -2.068556  0.07106841  0.07106841 2.2106928  Inf

It is possible to do further analysis, computing the ranks from this model (although such an analysis is not simple to do using the mlogit software). However, this is not necessary with most max-diff studies where it is possible to instead estimate a separate model for each respondent (and thus avoid making assumptions about how people differ); this is discussed in the next section.

Respondent-level logit models

The following code estimates the logit model at the respondent level and then computes the ranks (as done earlier with the counts analysis):

individualLogit = individualCounts
stackedID = rep(1:n,rep(nBlocks*2 * nAltsPerSet,n))
getCoefficients = function(id){c(0,as.numeric(mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, subset = stackedID==id, data = longData, alt.levels = paste(1:nAltsPerSet), shape = "long")$coef))}
nAlternatives  = 10
for (i in 1:n)
  individualLogit[i,] = getCoefficients(i)
 #adding random numbers to break ties
set.seed(0) # setting the random number seed to enhance comparability
ranks = nAlternatives + 1 - t(apply(individualLogit + matrix(runif(n * nAlternatives)/100000, n),1,rank)) #ranks
rankProportions = t(apply(t(ranks),1,table) / n * 100)
round(rankProportions,1)

The rankings computed using the respondent-level logit models are broadly similar to those from the respondent-level counts analyses, but they are not identical (we might except that the worse the experimental design, the greater the difference between the two methods, as the logit model takes the design into acccount whereas the counts analysis does not).

                  1    2    3    4    5    6    7    8    9   10
Apple          22.1 10.3 11.2  7.4  4.5  9.9  6.7  6.7  9.0 12.2
Microsoft       5.8  9.3 12.5 15.7 15.4 11.5  7.7  7.4  6.1  8.7
IBM             1.9  1.9  4.2  9.9  6.7 10.6 10.3 18.9 18.9 16.7
Google         21.8 18.6 17.3  9.6  8.0  7.7  8.7  3.8  3.2  1.3
Intel           3.5  5.1  6.7  7.1 15.7 18.9 12.2 11.5 10.9  8.3
HewlettPackard  3.8  7.7  9.9  5.4  7.7 13.5 13.1 14.7 11.2 12.8
Sony           15.7 17.9 16.0  9.3 13.8  9.3  5.8  7.4  3.5  1.3
Dell            7.1  7.7  3.5 11.5  9.6  6.1 11.5  7.7 16.7 18.6
Yahoo           1.9  5.4  6.1  9.6  8.0  6.4 21.2 11.9 14.7 14.7
Nokia          16.3 16.0 12.5 14.4 10.6  6.1  2.9  9.9  5.8  5.4

We can compute choice probabilities from the logit parameters using the inverse logit transformation:

shares = prop.table(as.matrix(exp(individualLogit)),1)
round(shares[1:10,]*100,1)

The probabilities (out of 100) for the first 10 respondents are:

      Apple Microsoft IBM Google Intel HewlettPackard  Sony Dell Yahoo Nokia
 [1,]  14.4       0.0   0   17.1   0.0            0.0  21.2 21.7   5.4  20.2
 [2,] 100.0       0.0   0    0.0   0.0            0.0   0.0  0.0   0.0   0.0
 [3,]   0.0       0.0   0    0.0   0.0            0.0   0.0  0.0   0.0 100.0
 [4,]   0.0       0.0   0    0.0   0.0            0.0 100.0  0.0   0.0   0.0
 [5,] 100.0       0.0   0    0.0   0.0            0.0   0.0  0.0   0.0   0.0
 [6,]  13.8      42.5   0    2.7  15.4            4.5  15.8  3.4   0.8   1.1
 [7,] 100.0       0.0   0    0.0   0.0            0.0   0.0  0.0   0.0   0.0
 [8,]   0.0       0.0   0   11.3  12.2            0.0  20.7  0.0   9.2  46.6
 [9,]  16.7       0.0   0   12.8   0.0           38.8  13.8  0.0   0.0  17.9
[10,] 100.0       0.0   0    0.0   0.0            0.0   0.0  0.0   0.0   0.0

And, the average probabilities and their ranks are computed using:

aveShare = apply(shares, 2, mean)
cbind(round(100 * aveShare, 1), Rank = nAlternatives + 1 - rank(aveShare))

and we again get a similar conclusion to the other respondent-level analyses, with Google and Apple being approximately equal in terms of their shares of preference:

                    Rank
Apple          21.9    2
Microsoft       6.1    6
IBM             2.3    9
Google         22.3    1
Intel           3.8    8
HewlettPackard  3.8    7
Sony           16.4    3
Dell            6.8    5
Yahoo           1.8   10
Nokia          14.9    4

Rank-Ordered Logit Model with Ties

This model is estimated using:

rankModel = max.diff.rank.ordered.logit.with.ties(itMaxDiffData)
ranks = nAlternatives + 1 - rank(rankModel$coef)
cbind(Parameters = rankModel$coef, Ranks = ranks)

Note that although the resulting estimates are broadly the same as those for the tricked aggregate logit model they are not identical (and, they are different enough to be meaningful, with this analysis showing Microsoft to have almost equivalent appeal to Apple):

                Parameters Ranks
Apple           0.00000000     4
Microsoft      -0.05898306     5
IBM            -0.69459763    10
Google          0.68248255     1
Intel          -0.28927006     6
HewlettPackard -0.42412444     7
Sony            0.52070672     2
Dell           -0.58473506     9
Yahoo          -0.56228486     8
Nokia           0.24070573     3

Respondent-level rank-ordered logit model with ties

The following code estimates a rank-ordered logit model for each respondent and then computes the ranks, as done in the previous examples:

individualRankLogit = individualCounts
stackedID = rep(1:n,rep(nBlocks,n))
getRankCoefficients = function(id){max.diff.rank.ordered.logit.with.ties(itMaxDiffData[stackedID == id,])$coef}
nAlts  = 10
for (i in 1:n)
  individualRankLogit[i,] = getRankCoefficients(i)
set.seed(0) #setting the random number seed so that random numbers are consistently applied across the examples
ranks = nAlts + 1 - t(apply(individualRankLogit+ matrix(runif(n * nAlts)/100000, n),1,rank)) #ranks
rankProportions = t(apply(t(ranks),1,table) / n * 100)
round(rankProportions,1)

Note that this, the most rigorous of the analyses, is almost the same as the individual-level logit models:

                  1    2    3    4    5    6    7    8    9   10
Apple          22.1  9.3 10.3  9.0 10.9  8.0  4.8  6.1  6.1 13.5
Microsoft       5.1  9.6 13.8 14.1 17.6  9.0  8.7  7.7  5.8  8.7
IBM             1.9  1.6  4.8  8.7  5.8 12.5  9.9 18.3 19.6 17.0
Google         21.8 19.2 17.0  8.7  7.4  9.3  8.3  4.2  2.9  1.3
Intel           3.5  5.1  7.1  9.9 13.1 16.0 14.1 11.2 11.2  8.7
HewlettPackard  3.5  7.7  9.9  6.7  8.0 11.5 13.5 14.7 12.8 11.5
Sony           16.3 18.3 15.4 10.3 10.9 10.3  7.7  5.8  3.5  1.6
Dell            7.4  6.7  3.8 13.1  7.7  8.0  9.6  8.0 17.3 18.3
Yahoo           2.2  5.1  6.4  8.3  7.1  8.0 19.2 15.1 14.7 13.8
Nokia          16.0 17.3 11.5 11.2 11.5  7.4  4.2  9.0  6.1  5.8

See also

See [:category: MaxDiff|MaxDiff] for links to key resources and concepts relating to MaxDiff.