Writing Functions for the flip Project

From Displayr
Jump to: navigation, search

The following should be used as a guide when writing functions and packages for the The flip Project.


Standard R function signatures

A function signature defines how the inputs and outputs of a function should be written. Functions that are likely to be directly called by users should be written as described below.

Formulas

Where a formula is used in Standard R, the signature should be of the form:

FunctionName <- function(formula, data,...)

The data parameter should be a data.frame.

Filters: subset

Where a function takes variables or Variable Sets as inputs, a subset parameter should be applied, which can be used for filtering (see also Filters in R).

Weights: weights

Weights are discussed in detail in Weights in Displayr. In general, functions that take variables or Variable Sets as inputs should allow users to pass in weights, and should treat them as sampling weights. Furthermore, the functions should be designed to take weights of an arbitrary scale (i.e., if the scale of the weights should not be a factor in inference).

Missing data: missing

flip Project functions should be explicitly written to give the user a choice of how to treat missing data, where the arguments should be:

  • Informative. For example, the most common default is "Exclude cases with missing data". This phrasing verbosity is appropriate, because the terseness of missing data treatment in R causes problems (e.g., the meaning of na.omit in predict, and the non-uniform application of the na functions (e.g., cor uses use instead of na.rm).
  • Addressing of missing data within functions, rather than as a separate pre-processing step (e.g., "Multiple imputation").

See The missing data options in Q for examples.

Random number seed: seed

Where a function involves a random component, a parameter should be provided called seed. This should come with a default argument, so that the function always returns the same results unless the user has modified it (e.g., seed = 1223).

Arguments

Arguments that are intended to be regularly used by a user are written in 'English', with capitalization of the first word. This is so that graphical user interfaces can be more easily hooked up.

Transforming data

Before Displayr variables are used, they must be passed through a function called ProcessQVariables in flipTransformations. This function processes the variables so that they are in a suitable form to be used in Standard R. For example, date variables are processed such that they become factors whose levels are the date periods specified for the variable.

Matrix

Sometimes analysis methods ultimately require data to be as a matrix, or can be transformed to a matrix (e.g., kmeans). flipTransformations::AsNumeric has been designed as a general-purpose utility for making categorical variables numeric.

Often it is best that the data input should be a data.frame and not a matrix, as this generally makes it more straightforward to implement Show labels.

Labels: name, label, question and show.labels

As discussed in Data Sets in R, a variable in a Data Set in Q has name, label, and question attributes. Standard R functions should make use of these when presenting results. Typically, it is appropriate to:

  • Use these labels as defaults in R Outputs. The most straightforward way to do this is using flipFormat::Labels(variables), which reverts back to the variable names when there are no labels. Labels for parameter/coefficient names can also be created using a call to flipFormat::Labels(data, names), where data is a data frame, and names are the names as automatically created by R (e.g., a variable name, such as q2, or a factor's name, such as q2Male).
  • Use these labels as defaults in any warnings and error messages. Usually it is appropriate to also show the variable name, which can be achieved using flipFormat::Labels(variables, show.name = TRUE).
  • Have a parameter called show.labels which defaults to FALSE, which allows users to toggle between showing names and labels. Care needs to be taken to ensure that when this is false, the original variable names are returned, rather than any temporary variable names created when the variables are piped through functions. The function flipFormat::Names first looks to see if the variable name is stored in an attribute called name and, if it is not, seeks to ascertain the name from the original environment. This function should always be used when extracting the name and, when it fails, the solution is to assign the variables with a name attribute (e.g., attr(variable, "name") <- "Q5"). Ensure that this name is retained through all future transformations. The best way to do this is to use, and if necessary create, transformations of data that retain the name, label, and question attributes (e.g., using flipTranformations::Factor rather than the default R function factor). Alternatively, the attributes may be copied using flipU::CopyAttributes.

Limiting the size of objects

When an R Output is computed, the data and instructions are sent to the R Server, and then the results are returned. The bigger the object that is created, the longer it will take to return. Consequently, it can be beneficial to put a bit of effort into reducing the size of the objects to be returned. Where it is not always clear if the end-user requires all or some of the information, the parameter return.all may be used. For example, in flipAnalysisOfVariance::OneWayMANOVA, when return.all is TRUE, all details of the underlying regression contrasts are returned, whereas when FALSE, only the table of results is returned. Generally, the argument should default to FALSE.

When an external library is used by a function the returned object should be a list where the first item is the output of the external function.

Print statements: output

A function should create R outputs with class set to the function name and should have a parameter called output, where:

  • Each class should have a print function that handles the output cases.
  • The output arguments should be expressed in English (e.g., "Summary").
  • Where one of the outputs is a typical R output, it should be called "R". Typically, this will correspond to the output from the R summary function.
  • The default output should be in some sense complete, which means that:
    1. The output should be attractive enough for a user to be able to show it to others (i.e., it should not be an ugly output in courier font).
    2. Any tabular data should use best-practice principles for assisting the viewer in working out which numbers to focus on.
    3. Any key metrics (e.g., goodness-of-fit, predictive accuracy), should be shown in the subtitle or footer.

For example:


The print function is automatically called on the final object created by an R Output in Displayr.

Warnings

Warnings should:

  • Contain instructions. That is, in addition to reporting a possible problem, they should provide a way for the user to avoid the problem (e.g., suggest how to fix the algorithm, or a change in arguments).
  • Be written in a way that a non-technical person can understand.

Functions for creating new variables: predict, fitted, resid

Typically, R functions for creating new variables such as predict and fitted, only create new values for observations that were used in the actual model fitting. For example, if 100 cases were passed into a model, and 50 were excluded due to a subset argument, and another 25 were excluded due to having missing values for the Outcome Variable, predict.lm produces a vector of 25 cases.

A flip Project function should default the newdata parameter in predict and other similar functions to the complete data set that was passed into the original function. Thus, in the just-discussed example, a vector of 100 cases should be provided. Special-cases are predict and resid, which should still produce the vector of length of the input data, but should show NAs for any values not used in the fitting (e.g., in the example that has been discussed, they will contain 75 missing values, whereas predict will contain no missing values.

New generic methods: Observed, Probabilities

The flipData package contains two new generic methods for creating variables:

  • Observed, which returns the observed values of the Outcome Variable.
  • Probabilities, which returns a matrix of the probabilities of each category of a categorical variable by case.

Displaying statistical tests

Statistical test results should be displayed by passing the output object of significance testing (usually of class htest) to the function SignificanceTest in flipFormat. This ensures that test results are well formatted and consistent between different types of tests, as well as avoiding duplication of formatting-related code.

Displaying numeric values

Numeric values that are between -1 and 1 should not be displayed with a zero before the decimal point, i.e., .123 instead of 0.123. In addition, numbers shown together should have a consistent number of decimal places. Zeros should be shown at the end of numbers if necessary for consistency, i.e.: 0.19, 0.20, 0.21 instead of 0.19, 0.2, 0.21.

R GUI Controls

The R code for a Standard R page should be simple as possible, so that a casual user can manipulate the code without understanding all the technical details. It may be useful to expose parameters not provided in the form, to give users extra control without cluttering the interface. To make this easier, use self-explanatory variable names and inline assignment where possible. The R function exists is useful for Conditional controls. See Visualization - Area Chart for an example.

Templates and other resources