Flat Data File

From Displayr
Jump to navigation Jump to search

A flat data file is a computer file that contains raw data structured so that:

  • Each row represents an individual observation (e.g., transactions in a transaction database, customers in a customer database, respondents in a database of survey responses).
  • Each column represents some property of the individual cases. In the example below, which is of a very small data set involving 14 people who completed a survey, the columns represent a unique ID number, and other characteristics of the survey respondents, such as their age, education, and employment status. Each column is referred to as a variable. In this example, numbers are shown in each column, but sometimes they show text instead.

Representation of categories and metadata

All data files are not created equal, and come in three basic flavors: data files with text representations of categories, data files with numeric representations of categories, and data files with metadata. The key distinction relates to how categories are represented, and what metadata is provided. These distinctions greatly influence the time taken to analyze the resulting data file. Data files with metadata are ideal.

The example on the left shows some of the data for 14 people from a survey where text is used to represent categorical information. The same data is shown on the right, except that numbers represent the categories (e.g., a value of 2 is used to indicate that people in the variable Q2. Gender are Female). While the text representation is easier for the human eye to interpret, the numeric representation is typically a lot better. This is because the text representation can miss important information (e.g., we can tell from the numbers that Male has been ordered before Female in the list of categories for Q2, whereas we do not know this information from the text). An additional problem with text representations is that they often break down when labels are changed (e.g., if Female is changed to Females, then the data file becomes a mess).

The disadvantage of the numeric representation of categories is that there is a need for the person analyzing the data to re-enter the information about the meaning of the numbers after the data file is imported into a data science app, and this can be a very time consuming process. For this reason, lots of specialist data file formats have been developed for data science which contain both the numeric representation of the categories, as well as all the metadata required to interpret that data. For example, in the case of the age data, the metadata may include the label we want to use to refer to the data (i.e., Age.), the wording used to collect the data (What is your age?), and the categories that correspond to each value (e.g., 1 is "18 to 24", etc.).

See also