Weights in Displayr

From Displayr
Jump to navigation Jump to search

Weighting is a technique which adjusts the results of a survey to bring them into line with some known characteristics of the population. For example, if a sample contains 40% males and the population contains 49% males, weighting can be used to correct the data to correct for this discrepancy.

A weight variable, which is often simply referred to as the weights, is a variable used when weighting data during analysis. In most situations, when people refer to weights they are referring to sampling weights. However, there are other types of weights.

See How to Weight Survey Data for more information.

Most outputs in Displayr that have been computed using a Variable Set as an input can be weighted by selecting the output and choosing a weight.


Creating simple weights

Weights can be created in Displayr by:

  • Selecting an output and then in the object inspector select Inputs > FILTERS AND WEIGHT > Weight > New. This creates a simple weight from one variable.
  • Going to Anything > Weighting > Multiple Variables > Configure Weight from Variable(s) allows you to create rim and target weights, to weight using numeric targets, and to weight to an existing weight variable. See How to Weight Survey Data for more information.

Creating advanced weights

More advanced weights can be created using Anything > Weighting > Multiple Variables > Configure Weight from Variable(s) and Anything > Weighting > Multiple Variables > Save Weight Variable from Configuration.

For details on how to use this tool, see our eBook on data weighting: How to weight survey data

Applying weights

A variable is applied as a weight by:

  1. Selecting the object (e.g., page) to be weighted.
  2. Choosing the appropriate weight from Inputs > FILTERS AND WEIGHT > Weight.

In order for a variable to be available as a weight, it needs to be set as Usable as a weight in the Properties tab of the Object Inspector.

R Outputs and weights

R Outputs have access to two weighting objects:

  • QPopulationWeight. This contains the values of the weight variable.
  • QCalibratedWeight. This contains the Calibrated Weight.

Where no weight is applied to an R Output, each of these will return a NULL.

Approaches to using weights when writing R code

In R, there is no standard way of addressing weights. While many R functions have a weights parameter, there is no consistency in how they are intepreted:

  1. Most commonly, weights in R are interpreted as frequency weights.
  2. Occasionally they are interpreted as sampling weights (e.g., in the survey package).

How to adapt existing R functions developed for frequency weights to deal with sampling weights

Using sampling weights in a function written for frequency weights will typically have the following consequences:

  • Parameter estimates will be appropriate. For example, if using sampling weights in lm or glm, you will correct parameter estimates, even though the weights parameter assumes the weights are frequency weights.
  • Computations of inference, such as p-values and standard errors will be wrong.

There are a variety of solutions to this problem.

Rewriting the functions to deal with sampling weights

Using Taylor series expansions to compute standard errors is the best approach, and has been implemented in the survey package, but it is the most complex of the approaches.

Weight calibration

This approach involves scaling the weights in a manner such that the inference is not "too bad".

This can be the most pragmatic approach when dealing with weights in multivariate methods, where inference is only of a secondary concern. In Standard R, this is used for most multivariate methods. It is only used in the following regression methods, as the rest employ Taylor series expansions: Multinomial Logit, Ordered Logit, NBD Regression).

The function CalibrateWeight in flipData calibrates weights.

Note that the QCalibratedWeight and the weights computed using CalibrateWeight will not necessarily be the same, as:

  • QCalibratedWeight is computed on the entire data file. If QCalibratedWeight is computed on a subset of the data, the results will be different. Consequently, it is often appropriate to apply the CalibrateWeight function to QCalibratedWeight once cases have been filtered and missing values removed.
  • QCalibratedWeight will automatically assign a weight of 0 to negative and missing values of weight variables. By contrast, CalibrateWeight will produce an error if such values are encountered.
  • QCalibratedWeight takes other settings in a project into account, such as design effects (see Weights, Effective Sample Size and Design Effects).

Stratified weight calibration

This approach involves stratifying the data and applying calibration within each strata. For example, if performing a two-sample t-test with the assumption of unequal variances, the calibration can be performed within each of the samples.

Resampling

This approach involves creating a new synthetic data set by randomly selecting cases, with replacement, from an existing data set. Cases are selected with probability proportional to the weight. That is, a weighted bootstrap is used to create the data set. This can be done using flipTransformations::AdjustDataToReflectWeights.

This approach is often better than weight calibration where the goal is inference, but the parameter estimates are less precise (due to the noise added from the randomization). This approach is always inferior to rewriting the functions to correctly deal with weights (e.g., via Taylor series linearization).

Where applying resampling, it is a good idea to:

  1. Calibrate the weight, as otherwise the sample size will be exaggerated.
  2. Set the random number seed, so that the same answer is given each time the function is used.
  3. Give the user a way of changing the seed so that they can assess sensitivity.

In Standard R, this approach is typically used for items in the Test sub-menu wherever Taylor series expansions have not been computed. Where a test is being conducted, the resampled sample size will typically be the size of the rounded effective sample size (after removal of cases with missing values, if applicable). Where a test is not being conducted (e.g., Random Forest), the resampled sample size will match the original sample size.

Ignoring weights

For a small number of analysis methods, such as hierarchical cluster analysis and distance calculations, weights are and should be ignored, where the calculations are based on differences between individual cases.

See also