Vignette: Simulating a minimal SPSS dataset from R

Written on April 30, 2020

14 min read

What this is about 📖

I will simulate a minimal labelled survey dataset that can be exported as a SPSS (.SAV) file (with full variable and value labels) in R. I will also attempt to fabricate ‘meaningful patterns’ to the dataset such that it can be more effectively used for creating demo examples.

image from Giphy

Background

Simulating data is one of the most useful skills to have in R. For one, it is helpful when you’re debugging code, and you want to create a reprex (reproducible example) to ask for help more effectively (help others help you , as the saying goes.)¹ However, regardless of whether you’re a researcher or a business analyst, the data associated with your code is likely to be either confidential so you cannot share it on Stack Overflow, or way too large or complex for you to upload anyway. Creating an example dataset from a few lines of code which you can safely share is an effective way to get around this problem.

Data simulation is slightly more tricky with survey datasets, which are characterised by (1) labels on both variable and values/codes, and (2) a large proportion of ordinal / categorical variables.

For instance, a Net Promoter Score (NPS) variable is usually accompanied with the variable label “On a scale of 0-10, how likely are you to recommend X to a friend or family?” (i.e. the actual question asked in a survey), and is itself an instance of an ordinal variable. If you are trying to produce an example that hinges on an issue where labels are relevant, you would also need to simulate the labels as well.

There are also educational reasons for simulating data: it is useful to simulate data to demo an analysis or a function, because this makes it easy for the audience to reproduce the example. For this purpose, it would be especially beneficial if you can simulate a dataset where there you can introduce some arbitrary relationships between the variables, rather than them being completely random (sample() all the way).

Personally, I have in the past found it a pain to simulate datasets which are suited for demo-ing survey related functions, especially when I was working on examples for the {surveytoolbox} package 📦. Hence, this is partly an attempt to simulate a labelled dataset that is minimally sufficient for demonstrating some of the {surveytoolbox} functions.

🏷 For more information specifically on manipulating labels in R, do check out a previous post I’ve written on working with SPSS labels in R.

Getting started

To run this example, we’ll need to load {tidyverse}, {surveytoolbox}, and {haven}. Specifically, I’m using {tidyverse} for its data manipulation functions, {surveytoolbox} for functions to set up variable/value labels, and finally {haven} to export the data as a .SAV file.

Note that {surveytoolbox} is currently not available on CRAN yet, but you can install this by running devtools::install_github("martinctc/surveytoolbox"). You’ll need {devtools} installed, if you haven’t got it already.

In addition to loading the packages, we will also set the seed² with set.seed() to make the simulated numbers reproducible:

library(tidyverse)
library(surveytoolbox) # Install with devtools::install_github("martinctc/surveytoolbox")
library(haven)

set.seed(100) # Enable reproducibility - 100 is arbitrary

Create individual vectors

For the purpose of clarity and ease of debugging, my approach will be to first set up each simulated variable as individual labelled vectors, and then bind them together into a data frame at the end. To adorn variable and value labels to a numeric vector, I will use set_varl() and set_vall() from {surveytoolbox} to do these tasks respectively.

I want to create a dataset with 1000 observations, so I will start with creating v_id as an ID variable running from 1 to 1000, which can simply be generated with the seq() function.³ I will then use set_varl() from {surveytoolbox} to set a variable label for the v_id vector. The second argument of set_varl() takes in a character vector and assigns it as the variable label of the target variable - super straightforward.

## Record Identifier
v_id <-
  seq(1, 1000) %>%
  set_varl("Record Identifier")

The same goes for v_gender, but this time I want to also (1) apply an arbitrary probability to the distribution, and (2) give each value in the vector a value label (“Male”, “Female”, “Other”).

To do (1), I pass a numeric vector to the prob argument to represent the probabilities that 1, 2, and 3 will fall out for n = 1000.

To do (2), I run set_vall() and pass the desired labels to the value_labels argument. set_vall() acccepts a named character vector to be assigned as value labels.

Finally, I run set_varl() again to make sure that a variable label is present.

## Gender
v_gender <-
  sample(x = 1:3,
         size = 1000, replace = TRUE,
         prob = c(.48, .48, .04)) %>% # arbitrary probability
  set_vall(value_labels = c("Male" = 1,
                            "Female" = 2,
                            "Other" = 3)) %>%
  set_varl("Q1. Gender")

Now that we’ve got our ID variable and a basic grouping variable (gender), let’s also create some mock metric variables.

I want to create a 5-point scale KPI variable (which could represent customer satisfaction or likelihood to recommend). One way to do this is to simply run sample() again, and do the same thing we did for v_gender:

## KPI - #1 simple sampling
v_kpi <-
  sample(x = 1:5,
         size = 1000,
         replace = TRUE) %>%
  set_vall(value_labels = c("Extremely dissatisfied" = 1,
                            "Somewhat dissatisfied" = 2,
                            "Neither" = 3,
                            "Satisfied" = 4,
                            "Extremely satisfied" = 5)) %>%
  set_varl("Q2. KPI")

Whilst the above approach is straightforward, the downside is that the numbers are likely to look completely random if we try to actually analyse the results - which is what sample() is supposed to do - but clearly isn’t ideal.

I want to simulate numbers that are more realistic, i.e. data which will form a discernible pattern when grouping and summarising by gender. What I’ll therefore do is to iterate through each number in v_gender, and sample numbers based on the gender of the ‘respondent’.

The values that are passed below to the prob argument within sample() are completely arbitrary, but are designed to generate results where a bigger KPI value is more likely if v_gender == 1, followed by v_gender == 3, then v_gender == 2.

Note that I’ve used map2_dbl() here (from the {purrr} package, part of {tidyverse}), which “loops” through v_gender and returns a numeric value for each iteration.

## KPI - #2 gender-dependent sampling
v_kpi <-
  v_gender %>%
  map_dbl(function(x){
    if(x == 1){
      sample(1:5,
             size = 1,
             prob = c(10, 17, 17, 28, 28)) # Sum to 100
    } else if(x == 2){
      sample(1:5,
             size = 1,
             prob = c(11, 22, 28, 22, 17)) # Sum to 100

    } else {
      sample(1:5,
             size = 1,
             prob = c(13, 20, 20, 27, 20)) # Sum to 100
    }
  }) %>%
  set_vall(value_labels = c("Extremely dissatisfied" = 1,
                            "Somewhat dissatisfied" = 2,
                            "Neither" = 3,
                            "Satisfied" = 4,
                            "Extremely satisfied" = 5)) %>%
  set_varl("Q2. KPI")

To add a level of complexity, let me also simulate a mock NPS variable. One way to do this is to punch in random numbers like how it is done above with v_kpi, but this will involve a lot more random punching than is desirable for a 11-point scale NPS variable.

I will therefore instead write a custom function called skew_inputs() that ‘expands’ three arbitrary input numbers into 11 numbers, which will then serve as the probability anchors for my sample() functions later on.

## Generate skew inputs for sample probability
##
## `value1`, `value2` and `value3`
## generate the skewed probabilities
##
skew_inputs <- function(value1, value2, value3){
  
  all_n <-
  c(rep(value1, 7), # 0 - 6
    rep(value2, 2), # 7 - 8
    rep(value3, 2)) # 9 - 10
  
  return(sort(all_n))
}

## Outcome KPI - NPS
v_nps <-
  v_gender %>%
  map_dbl(function(x){
    if(x == 1){

      sample(0:10, size = 1, prob = skew_inputs(1, 1, 8))

    } else if(x == 2){

      sample(0:10, size = 1, prob = skew_inputs(2, 3, 5))

    } else if(x == 3){

      sample(0:10, size = 1, prob = skew_inputs(1, 3, 6))

    } else {

      stop("Error - check x")

    }
  }) %>%
  set_varl("Q3. NPS")

Admittedly that the above procedure isn’t minimal, but note that this is a trade-off to introduce some arbitrary patterns to the data. A ‘quick and dirty’ alternative simulation would simply be to run sample(x = 0:10, size = 1000, replace = TRUE) for v_nps.

There is one slight technicality: the so-called NPS question is strictly speaking a likelihood to recommend question which ranges from 0 to 10, and the Net Promoter Score itself is calculated on a recoded version of that question where Detractors (scoring 0 to 6) have to be coded as -100, Passives (scoring 7 to 8) as 0, and Promoters (scoring 9 to 10) as +100. The Net Promoter Score is simply calculated as a mean of those recoded values.

Fortunately, the {surveytoolbox} package comes shipped with a as_nps() function that does this recoding for you, and also automatically applies the value labels. let’s call this new variable v_nps2:

## Outcome KPI - Recoded NPS (NPS2)

v_nps2 <- as_nps(v_nps) %>% set_varl("Q3X. Recoded NPS")

Combine vectors

Now that all the individual variables are set up, I can simply combine them all into a tibble in one swift movement⁴:

#### Combine individual vectors ####
combined_df <-
  tibble(id = v_id,
         gender = v_gender,
         kpi = v_kpi,
         nps = v_nps,
         nps2 = v_nps2)

Results!

image from Giphy

Let’s run a few checks on our dataset to confirm that everything has worked out okay.

The classic {dplyr} glimpse():

combined_df %>% glimpse()

## Observations: 1,000
## Variables: 5
## $ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ gender <int+lbl> 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2,...
## $ kpi    <dbl+lbl> 3, 2, 5, 2, 5, 5, 2, 2, 5, 3, 3, 4, 1, 5, 4, 4, 4, 1, 2,...
## $ nps    <dbl> 10, 5, 10, 8, 10, 9, 7, 5, 2, 5, 1, 4, 5, 9, 9, 10, 9, 3, 10...
## $ nps2   <dbl+lbl> 100, -100, 100, 0, 100, 100, 0, -100, -100, -100, -100, ...

Then head() to see the first five rows:

combined_df %>% head()

## # A tibble: 6 x 5
##      id     gender                       kpi   nps             nps2
##   <int>  <int+lbl>                 <dbl+lbl> <dbl>        <dbl+lbl>
## 1     1 2 [Female] 3 [Neither]                  10  100 [Promoter] 
## 2     2 2 [Female] 2 [Somewhat dissatisfied]     5 -100 [Detractor]
## 3     3 1 [Male]   5 [Extremely satisfied]      10  100 [Promoter] 
## 4     4 2 [Female] 2 [Somewhat dissatisfied]     8    0 [Passive]  
## 5     5 2 [Female] 5 [Extremely satisfied]      10  100 [Promoter] 
## 6     6 1 [Male]   5 [Extremely satisfied]       9  100 [Promoter]

So it appears that the value labels have been properly attached, and the range of values are what we’d expect. Now what about the “fake patterns”?

Looking at the topline result of the data, we seem to have succeeded in fabricating some sensible patterns in the data. It appears that this company X will need to work harder at winning over its female customers, who have rated them lower on two KPI metrics:

combined_df %>%
  group_by(gender) %>%
  summarise(n = n_distinct(id),
            kpi = mean(kpi),
            nps2 = mean(nps2))

## # A tibble: 3 x 4
##       gender     n   kpi  nps2
##    <int+lbl> <int> <dbl> <dbl>
## 1 1 [Male]     490  3.49 31.0 
## 2 2 [Female]   464  3.07 -8.62
## 3 3 [Other]     46  3.15 17.4

Check the labels 🏷🏷🏷

Finally I’d like to share a couple of functions that enable you to explore the labels in a labelled dataset. surveytoolbox::varl_tb() accepts a labelled data frame, and returns a two-column data frame with the variable name and its corresponding variable label:

combined_df %>% varl_tb()

## # A tibble: 5 x 2
##   var    var_label        
##   <chr>  <chr>            
## 1 id     Record Identifier
## 2 gender Q1. Gender       
## 3 kpi    Q2. KPI          
## 4 nps    Q3. NPS          
## 5 nps2   Q3X. Recoded NPS

surveytoolbox::data_dict() takes this further, and shows also the value labels as a third column. This is what effectively what’s typically referred to as a code frame in a market research context:

combined_df %>%
  select(-id) %>%
  data_dict()

##      var        label_var
## 1 gender       Q1. Gender
## 2    kpi          Q2. KPI
## 3    nps          Q3. NPS
## 4   nps2 Q3X. Recoded NPS
##                                                                                label_val
## 1                                                                    Male; Female; Other
## 2 Extremely dissatisfied; Somewhat dissatisfied; Neither; Satisfied; Extremely satisfied
## 3                                                                                       
## 4                                            Detractor; Passive; Promoter; Missing value
##                              value
## 1                          1; 2; 3
## 2                    1; 2; 3; 4; 5
## 3 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10
## 4                     -100; 0; 100

I would also highly recommend the view_df() function from {sjPlot}, which exports a similar overview of variables and labels in a nicely formatted HTML table. For huge labelled datasets, this offers a fantastic light-weight way to browse through your variables and labels.

combined_df %>% sjPlot::view_df()

Once we’ve checked all the labels and we’re happy with everything, we can then export our dataset with haven::write_sav()! If everything’s worked properly, all the labels should appear properly if you choose to open your example dataset in SPSS, or Q:

combined_df %>% haven::write_sav("Simulated Dataset.sav")

End notes

I hope you’ve found this vignette useful!

If you ever get a chance to try out {surveytoolbox}, I would really appreciate if you can submit any issues/feedback on GitHub, or get in touch with me directly. I’m looking for collaborators to make the package more user-friendly and powerful, so if you’re interested, please don’t be shy and give me a shout! 😄

Check out this RStudio Community thread to learn more about reprex (the portmanteau reprex is coined by Romain Francois)↩︎
If you’re not familiar with this concept / approach, I’d recommend checking out this Stack Overflow thread.↩︎
For those who are more ambitious, I would recommend checking out the {uuid} package for generating proper GUIDs (Globally Unique Identifier). However, this then wouldn’t be minimal, so I would just stick with running a simple seq() sequence.↩︎
I shouldn’t need to footnote this, but here’s a Rocky Flintstone tribute for any Belinkers out there. 🤣↩︎

Browse by topic

r vignettes rstats tidyverse learning-r rdatatable surveys open-source interviews excel LLMs GenAI rvest rmarkdown statistics shiny political-analytics package-reviews gpx data-collection