SPSS Data File Specifications

From Displayr
Jump to: navigation, search

The following specifications are designed to ensure that the resulting data file can be quickly, easily and accurately analyzed. It is usually cost-effective to provide these specifications to data collectors before they commence scripting the questionnaire for data collection.

When these specifications are not followed it may mean that considerable work is required to appropriately analyze the data, and, sometimes, that it is impossible for the data file to be appropriately analyzed.

Alternatives to SPSS Data Files

SPSS Data Files have the file extension .sav and are often referred to as 'dot SAV' files. There are a variety of other SPSS data files with different extensions.

IBM SPSS Data Collection Model files, which are also known as SPSS Dimensions files, have a file extension of .mdd, and are an improvement on SPSS Data Files. These files can be opened in more modern data analysis packages, such as Q and Survey Reporter.

Portable SPSS Files have a file extension of .por. This format is rarely used anymore and tends to only be supported by quite old programs. They can be converted to an SPSS Data File using Save as in SPSS Statistics.

SPSS Syntax Files with an extension .sps are a file format that can be used to convert a text file into an SPSS Data File. This is usually best done in SPSS (Run > All). However, if the syntax file is written in a certain way it can also be converted into an SPSS Data File using Q (Tools > Convert .sps File to .sav…; see [1]). Converting a syntax file into an SPSS data file is something of a last resort approach to obtaining SPSS data files and should be avoided where possible, as bugs can be common.

CSV Files and other Text Files are often the most readily available file formats for data. However, they should generally be avoided, as they provide data in a format that makes correct analysis difficult-to-impossible (see Getting a Data File on MktResearch.org for more detail about this).

Dates

Date variables, such as interview time, should be stored as date variables in the SPSS data file. If the intent is to change dates when reporting (e.g., move the last couple of days of interviewing in a month into the next wave’s reporting), a second date variable should be created which contains this recoded date data.

Non-Response and other types of missing data

Respondents who were not asked a particular question (i.e., were intentionally or unintentionally skipped), should have a SPSS SYSTEM-MISSING VALUE or be flagged as a SPSS USER-DEFINED MISSING VALUE. It is never appropriate to record all missing values in a data file as having a value of 0. This is very important, as for many binary variables the No response is often coded as a 0, making it impossible to determine which respondents said No and which were not asked the question.

Where there are multiple different types of missing data (e.g., where some questions were not asked of some respondents while others were asked but not answered), they should be coded with different values. E.g., SPSS SYSTEM-MISSING VALUE should be used where a question was not asked to respondents, and -99 flagged as a SPSS USER-DEFINED MISSING VALUES. Sometimes it is appropriate to treat missing values for some of the questions as being equivalent to a “No” response (e.g., giving them a value of 0). For example, if people are asked which brands they have consumed, but are only shown brands that they are aware of, then this would be appropriate. In this instance, the question should be included in the data file twice, once with the SPSS SYSTEM-MISSING VALUE values and once with the “No” responses instead.

Don’t Know

The Don't Know code needs to be different to the non-response code.

Single Response Questions

Single response questions need to be represented in SPSS as one variable. A data file that uses a different variable for each unique response code of a single response question is not useful. Where there are NETs in a single response question, these should be exported as additional variables. For example, if a question asks about a person's model of car and then this has been used to create NETs of manufacturers, the data should be exported as two variables, one which shows the model of car and another which shows the manufacturer.

Multiple Response Questions

Where there are multiple response variables, a binary variable should be created for each possible response, unless there is a huge code frame. For example, a question in a questionnaire may have been:

Q1	Which of the following products do you own?
	MULTIPLE RESPONSE
	Savings account		1
	Checking Account	2
	Credit Card		3
	Home loan		4

but the data file should be structured as if you had asked the following four questions.

Q1a	Do you have a savings account?
	No	0
	Yes	1

Q1b	Do you have a checking account?
	No	0
	Yes	1

Q1c	Do you have a credit card?
	No	0
	Yes	1

Q1d	Do you have a home loan?
	No	0
	Yes	1

Where a multiple response question contains an other specify option, the resulting text variable should appear after all of the numeric variables (i.e., if it appears in the middle, it will prevent creation of multiple response sets).

Ideally, multiple response questions should be marked as Multiple Response Sets in the SPSS data file. If this is not done, some programs, such as Displayr and Q, will automatically guess whether or not variables need to be combined into multiple response questions. Such guessing may be inaccurate and require additional work by the user of the data.

Common problems with multiple response questions

Common serious problems with the set up of multiple response questions are:

  • Failing to distinguish between missing values versus values that were not selected by respondents. See Non-Response and other types of missing data.
  • Failing to address piping/randomization when creating the data file. See Rotations and randomizations – between questions.
  • Using the max-multi format. This is generally only sensible for huge code frames. The reason it is generally not sensible in other situations is this format does not distinguish between missing data versus options not selected by users (see Non-Response and other types of missing data).
  • Confusing the Variable Label with the Value Label. See the example below.
  • Having inconsistent values. For example, having the Yes values represented by a 1 for the first options, a 2 for the second option, etc. For example, the following labeling scheme may appear at first glance to be sensible, but it creates difficulty for the user, particularly one using automatic tools to set up data files, as there is no consistent set of values or labeling, making automatic recognition that this is a multiple response set problematic.
Q1a	Do you have a savings account?
	No		0
	Savings 	1

Q1b	Do you have a checking account?
	No		0
	Checking	2

Q1c	Do you have a credit card?
	No		0
	CC		3

Multiple response questions with huge code frames

Some data files contain multiple response questions with extremely large code frames (e.g., 6,000 models of cars). Exporting these in the binary format results in data files that can be excessively large. In this situation they can be exported in max-multi format (i.e., essentially as multiple single response variables, each one recoding a separate response). However, this format is, in general, much less flexible than the binary format and should be avoided where possible. In particular, with data in this format there is no way of recording missing values (e.g., if a respondent is not shown as having selected an option, this may mean that they saw the option but did not choose it, or, it may mean that they were never shown the option).

Rotations and randomizations – between questions

Where different respondents see questions in different orders, this order needs to be removed from the data prior to creating the data file. For example, if the respondents have been asked to rate the appeal of a random selection of three of four different products, and the order has been randomized or rotated, such as in this table:

ID	Q1		Q2		Q3
1	Microsoft	Apple		IBM
2	Apple		Microsoft	IBM
3	IBM		Google		Apple
4	Google		Microsoft	IBM

then the data should be exported as if people had been asked four different questions and all respondents had seen them in the same order (where SYSMIS is SPSS’s missing value code):

ID	Q. Microsoft	Q. Apple	Q. IBM		Q. Google
1	Data from Q1	Data from Q2	Data from Q3	SYSMIS
2	Data from Q2	Data from Q1	Data from Q3	SYSMIS
3	SYSMIS		Data from Q3	Data from Q1	Data from Q2
4	Data from Q2	SYSMIS		Data from Q3	Data from Q1

Further, the order with which the data was collected should also be exported as additional variables. For example:

ID	Order Microsoft	Order Apple	Order IBM	Order Google
1	1		2		3		SYSMIS
2	2		1		3		SYSMIS
3	SYSMIS		3		1		2
4	2		SYSMIS		3		1

Looped questions and grids

Care needs to be taken with the creation of labels for looped questions and some grid questions. Consider a study containing the following three questions:

Q1a	When you think of soft drinks that are sexy, which ones come to mind?  MULTIPLE RESPONSE
	Coke
	Pepsi
	Fanta
	Other

Q1b	When you think of soft drinks that are masculine, which ones come to mind?  MULTIPLE RESPONSE
	Coke
	Pepsi
	Fanta
	Other

Q1c	When you think of soft drinks that are powerful, which ones come to mind?  MULTIPLE RESPONSE
	Coke
	Pepsi
	Fanta
	Other

If the variable labels set up for such questions follow identical structures, this will make the use of the file considerably more straightforward. Some programs, such as Displayr and Q, will automatically detect the structure in the data and present it as a grid. For example, the following labels make the interpretation of the grid straightforward.

Variable Name		Variable Label
Q1a1			Q42. Brand attitude Sexy brands: Coke
Q1a2			Q42. Brand attitude Sexy brands: Pepsi
Q1a3			Q42. Brand attitude Sexy brands: Fanta
Q1a4			Q42. Brand attitude Sexy brands: Other
Q1b1			Q42. Brand attitude Masculine brands: Coke
Q1b2			Q42. Brand attitude Masculine brands: Pepsi
Q1b3			Q42. Brand attitude Masculine brands: Fanta
Q1b4			Q42. Brand attitude Masculine brands: Other
Q1c1			Q42. Brand attitude Powerful brands: Coke
Q1c2			Q42. Brand attitude Powerful brands: Pepsi
Q1c3			Q42. Brand attitude Powerful brands: Fanta
Q1c4			Q42. Brand attitude Powerful brands: Other

Common problems with the setup of grid questions

As an example, the following contain inconsistencies which prevent any auto-detection of the underlying structure:

Q1a1	Q42. Brand attitude Sexy brands: Coke
Q1a2	Q42. Brand attitude Sexy brands: Pepsi
Q1a3	Q42. Brand attitude Sexy brands: Fanta
Q1a4	Q42. Brand attitude Sexy brand: Other
Q1b1	Q42. Brand attitude - Masculine brands: Coke
Q1b2	Q42. Brand attitude - Masculine brands:  Pepsi
Q1b3	Q42. Brand attitude - Masculine brands: Fanta
Q1b4	Q42. Brand attitude - Masculine brands: Other
Q1c1	Brand attitude - Powerful brands: Coke
Q1c2	Brand attitude - Powerful brands: Pepsi
Q1c3	Brand attitude - Powerful brands: Fanta
Q1c4	Brand attitude - Powerful brands: Others

Common problems with the setup of grid questions include:

  • Any of the problems with multiple response questions. See Common problems with multiple response questions.
  • The Label field has been set up with contradictory or inconsistent information. Two common causes of this are:
  • Typographical errors. While these may seem like minor issues, they prevent data analysis programs from automatically identifying the looped structures in the data. In the example above:
    • An additional space precedes Pepsi for Q1b2.
    • There is no s with brands in Q1a4.
    • An s has been added to Others in Q1c4.
    • Q42. is absent from labels for Q1c.
  • Truncation of the Label field by the software used to create the data file. For example, the label may read Which of the following brands do you typically consume on a hot day? with the specific brands not listed and thus there is no way to deduce the correct labeling of the rows and/or columns of the grid (other than assuming they are consistently ordered which, if an incorrect assumption, will result in incorrect analyses).
  • Repeated labels. For example if there are two Other/Specify options in the questionnaire then they should be given distinct labels like Other 1 and Other 2. Duplicated labels can prevent the automatic detection of grids, as there is no way to tell the difference between the two options. Each label in the set must be unique.
  • There are inconsistencies in terms of the number of alternatives (brands) or attributes in the grid (e.g., some brands may not be shown with some attributes). The solution to this problem is to create new variables with no data.
  • The order of the variables is inconsistent. In the example above, the four brands are shown in the same order for each attitude statement, and this is required for successful automatic identification of the grid layout.

Where multiple questions are asked in a loop, it is usually best if all the data appears question-by-question (i.e., all the looped variables for one question, then all the variables for the next, etc.). However, if the intent is to create Stacked Data, it is instead usually better to structure the data by loop iteration (i.e., first show all the data from the first iteration of the loop, then from the second, etc.).

Rankings

Ranking questions need to be recorded with a single variable for each item being ranked. Ideally, the most preferred item will have the highest value and the least preferred the lowest, except where the questionnaire expressly indicates an alternative coding.

Un-coded open-ended questions and “other specify”

Verbatim responses to open-ended questions and “other specify” options should be stored as String variables if the data is text and Numeric variables if numeric.

Coded open-ended questions

Where open-ended questions have been coded, these are then included in the data file as if they are standard single or multiple response variables (in particular, the binary format is appropriate for multiple response questions). An additional string variable should store the raw responses.

Variable Labels

Variable labels should communicate the information contained in the variable. Variable labels such as "How important is this on a scale of 1 to 10", provided for each of a set of variables, are of no use as it is impossible to determine what is being rated without referring to the questionnaire. A better variable label would be "Importance: Price". Where practical, the variable labels should correspond to the actual wording used in the questions. Most programs that write SPSS data files automatically truncate variable labels to 120 characters, which can cause automatically generated labels from looped questions to be uninformative (e.g., the first 120 characters may not include all of the information about the loop).

Value Labels

Value labels should be taken directly from the questionnaire, provided that their length is 60 characters or less (this is because most programs that write SPSS data files automatically truncate to 60 characters).

Value labels should be included even for options that were not selected.

When setting up the value labels for multiple response questions and for image grids, it is important that the same value labels be used for all the options. In particular, the value label should not contain the name of the option being evaluated. For example, if the question asks “Which of the following brands are masculine?”, the values and value labels should be set up for each variable similar to:

SYSMIS		Option not shown
0		Not selected
1 		Selected 

Variable Names

Variable names should relate to the question numbers. It is often useful if separate question numbering is used for screeners, general questions and classification variables (i.e., S1, S, …., Q1, Q2, …, C1, C2,…). Where a question is represented by multiple variables, please use a common prefix (e.g., Q4a, Q4b, Q4c), rather than each variable having a different question number (e.g., Q231, Q232, Q233). Where a question is a loop of a multiple response question, this is generally best represented via a common prefix and two separate looping suffixes (e.g., Q4a1, Q4a2, Q4b1, Q4b2). While these are only guidelines, the core principle is to employ a convention that is easily understandable, whereby the variable names are informative as to the structure of the data.

Variables To Be Excluded

Variables that have no possible meaning to the user of the data file should be excluded from the data file. Some data collection programs automatically export useless variables that only relate to the way in which the questionnaire was set up. Examples of variables with no possible meaning that may be exported include:

  • Looped variables, where one variable will have a value of 1 for every respondent, another will have a value of 2 for all respondents, and so on.
  • Variables representing unused codes in multiple response questions.

Variable Type

Quantitative variables (e.g., estimates of number of flights taken in the last 12 months) should be formatted as Numeric Variables in SPSS, not as String. This will prevent illegal numeric values such as the always popular "-" (dash).

Measure

Truly numeric variables (e.g., estimates of number of flights taken in the last 12 months) should be have their Measure set to Scale. Categorical variables should have their Measure set to Nominal. Ordered categorical variables should be set as Ordinal.

ID variable

This should be a variable that uniquely identifies each case (typically, each respondent). That is, each case should have a value that is different to those of the other cases, even if the same person has provided multiple cases of data. If respondents do provide multiple cases then a respondent identifier should be included as an additional variable.

The ID variable in the file should have a Measure of Scale. That is, it should be a numeric, rather than a text or categorical, field.

Weighting

Any weighting variables constructed by the data collectors should ideally contain a Variable Label which describes the weighting procedure (e.g.,Age-by-gender-by-country 2012). If the weight variable is given the Variable Name of weight in the data file, it will automatically be available as a weight in some programs (e.g., Q).