The Relationship Between Value Attributes, Data Reductions, Variables, and Variable Sets

From Displayr
Jump to navigation Jump to search

Value Attributes

All Variables have properties. One particular property of a variable is the Value Attributes, which are the metadata associated with unique values of a variable. For example the Value Label of 18 to 24 may be associated with the Value of 1.

These values may be derived from the original data file, or, may have been set by the user. The user is able to modify these variables in two ways:

  • Using the graphical user interface (e.g., pressing one of the buttons in Object Inspector > DATA VALUES).
  • Using QScript (see the Q Wiki).

The labels and values in the data file are referred to as the Source Label and Source Value respectively.

A key distinction to be aware of when writing code is that only Source Values are guaranteed to be mutually exclusive and exhaustive, and thus they should in general be used when an iterator is required. However, while it is appropriate to use it as an iterator, it will typically be the case that the Label and Value rather than their source counterparts should be used when performing calculations.

Which value attributes are relevant when interpreting a variable depends upon the Structure of the Variable Set. In particular, with Binary - Multi and Binary - Grid variable sets, the Count This Value setting determines which categories are combined, and the Values are ignored (i.e., with the binary variable sets, the values are 1s if Count this Value is checked and 0 or missing otherwise).

Data Reduction

All Variable Sets have Data Reductions which store metadata that dictate how categories are created on tables and in some analyses. Here a category may refer to either:

  • The original categories of a single variable (as uniquely identified by the Source Value).
  • The original variables in a variable set.

A data reduction consists of a code frame and, if it is a grid (including Ordinal - Multi and Nominal - Multi), a secondary code frame. This is the underlying technical terminology, but often the concept of a code is referred to as a category in documentation.

Each code consists of a Label and a set of Source Values, where:

  • The Label is whatever appears when the variable set is used in analysis (e.g., on a table).
  • The label inherits from the underlying variable(s)' Label. That is, if you modify the label of a variable by modifying its value attributes, this will also cause the data reduction to be updated as well.
  • For variable sets consisting of a single variable, the Source Values correspond to the Source Values of the underlying variable.
  • For variable sets consisting of multiple variables, the Source Values are typically 0-based integers, where 0 indicates the first variable in the set or, in the case of Binary - Grid and Number - Grids, the rows or columns of the constructed table that is formed when all the variables are used in their original order prior to any being merged).

JavaScript Variables

JavaScript variables ignore data reductions.

R Variables

Nominal and Ordinal variable sets

Nominal and Ordinal variable sets are represented as factors in R. The levels of the factors are derived as follows:

  • Where the data reduction's code frame is mutually exclusive and exhaustive, it is used as the levels of the factor.
  • Where two codes share a source value, the code that is the largest is ignored. For example, if a variable has codes of Male, Female, and NET, then only Male, Female are used as levels.
  • If the resulting code frame is not exhaustive, additional codes are created from the value attributes. For example, if the value attribute has labels of 18 to 24, 25 to 29, and 30 or more, and the code frame of the data reduction has codes of 18 to 24, 18 to 29, 30 or more, and NET then the resulting factor will have levels of 18 to 24, 25 to 29, 30 or more, where the latter two codes are removed from the code frame due to overlapping with 18 to 24, and the 25 to 29 is taken from the value attributes.

Multiple variable variable sets

Variable sets involving multiple variables are represented as data.frames.

Missing values

Missing values may be represented as either NA or NaN. The default representation in R is as NA. They can be set as NaN either by:

  • Modifying via QScript.
  • Using the graphical user interface, and setting Missing Values to Include in percentages (but not averages) or by setting Value to NaN.