Missingness

Multiple Imputation

Level of Difficulty1Advanced
What You’ll Mainly DoLoad an external csv file, type in syntax, and consider interpretations
Language(s) We’ll Use2

1 This is not an indicator of your personal level or abilities. You may experience varying level of ease and/or difficulties and that is OK!
2 Hover over an icon to see its name

Prerequisites

Materials

The following download contains files that are provided to help you to understand and/or complete the walkthrough

Prepping

  1. Go through the slideshow this week prior to engaging the walkthrough provided below unless you have a strong familiarity with factor validity.

  2. Like last week, open up Rstudio and create a new script by going to File > New File > R Script. Save this in an easily accessible folder. Now unzip this week’s data set and take the files - NerdyDataMissing.csv, NerdyCodebook_PM.csv, NerdyMeasures_PM.csv, Nerdy install.R, and Nerdy script.R - and drop them all in a folder of their own. The remaining files have additional data you can practice with.

  3. Open the file Nerdy install.R and run the syntax to install the packages you will need for this task. If you are asked to update any packages, please select 1: All.

  4. Now please go ahead and load up the following libraries

library("tidyverse")
library("mice")
library("naniar")
library("reactable")
  1. set the working directory to the location of the script by running
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

in your console or by selecting Session > Set Working Directory > To Source File Location.

Sampling Frame

We’ll be looking at responses from a sample of 11 people who took this test in 2018.

Loading a Local Data Set

If you remember how to load a local dataset, then please skip this section

nerdy_data <- 
  read_csv("NerdyDataMissing.csv")
## Rows: 10 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (26): Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nerdy_pm_codebook <- 
  read_csv("NerdyCodebook_PM.csv")
## Rows: 26 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Index, Item
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nerdy_pm_measures <- 
  read_csv("NerdyMeasures_PM.csv")
## Rows: 5 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Description
## dbl (1): Measure
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

and take a look at the data

nerdy_data %>%
  head()
## # A tibble: 6 × 26
##      Q1    Q2    Q3    Q4    Q5    Q6    Q7    Q8    Q9   Q10   Q11   Q12   Q13
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     3     5     3     3     5     5     5    NA    NA     5     4     5     5
## 2    NA    NA     4     3     5    NA     5     1     4    NA     1    NA     4
## 3     5    NA     5     5     5    NA     5     5     5     5     5     4     5
## 4    NA     5    NA     5     5    NA     5    NA     5     5     5    NA     4
## 5     4    NA    NA     4    NA     4    NA     4     4     5     4     4    NA
## 6     4     4     4     4     4     4     5     3    NA    NA     4     4     3
## # … with 13 more variables: Q14 <dbl>, Q15 <dbl>, Q16 <dbl>, Q17 <dbl>,
## #   Q18 <dbl>, Q19 <dbl>, Q20 <dbl>, Q21 <dbl>, Q22 <dbl>, Q23 <dbl>,
## #   Q24 <dbl>, Q25 <dbl>, Q26 <dbl>

or via the reactable() package1

reactable(nerdy_data,
          defaultPageSize = 5,
          showPageSizeOptions = TRUE)

as well as the codebook and measures

nerdy_pm_codebook
## # A tibble: 26 × 2
##    Index Item                                                                   
##    <chr> <chr>                                                                  
##  1 Q1    I am interested in science.                                            
##  2 Q2    I was in advanced classes.                                             
##  3 Q3    I like to play RPGs. (Ex. D&D)                                         
##  4 Q4    My appearance is not as important as my intelligence.                  
##  5 Q5    I collect books.                                                       
##  6 Q6    I prefer academic success to social success.                           
##  7 Q7    I watch science related shows.                                         
##  8 Q8    I spend recreational time researching topics others might find dry or …
##  9 Q9    I like science fiction.                                                
## 10 Q10   I would rather read a book than go to a party.                         
## # … with 16 more rows
nerdy_pm_measures
## # A tibble: 5 × 2
##   Measure Description      
##     <dbl> <chr>            
## 1       5 Agree            
## 2       4 Somewhat Agree   
## 3       3 Neutral          
## 4       2 Somewhat Disagree
## 5       1 Disagree

Missingness

You may be asking why should I care about missing data? and can’t I just delete them? Great questions!

Some analyses like linear regression that you covered this week require complete observations, but still in most statistical software you won’t get an error when feeding the system with data containing missing values (cough cough but R does tell you there’s an error cough cough).

Instead, most non R softwares automatically delete incomplete cases…and silently at that which is pretty rude. However, if you plan to say make conclusions about the entire data set. an interpretation just wouldn’t be appropriate since you cannot guarantee the same observations. Even worse, by dropping the observations completely we lose statistical power which in a nutshell results in a likelihood that you’ll conclude that there isn’t an effect when there is (aka a Type II error). Oh but it gets even better, the dropped observations could provide crucial information about the problem of interest resulting in biased results.

In practice, this can be the difference between being able to conclude a treatment likely works and reporting maybe something happened but we don’t know. So missing data is a big problem that we still haven’t been able to tackle directly yet2.

To tackle this issue, there are two prevailing philosophical approaches. On one side are those who believe that the data is what it is and you have to deal with what you get. On the other side you have those that believe a machine, whether it be using an algorithm by itself or coupled with artificial intelligence, can predict the likely values that should have been there given certain assumptions are met (which we cover in the next section)3.

So for the latter group, there are a number of tactics that can be used to tackle this missingness issue, but the main three are

  1. pairwise deletion: a method in which data for a variable pertinent to a specific assessment are included, even if values for the same individual on other variables are missing.

  2. listwise deletion: a method in which an entire case record is excluded from statistical analysis if values are found to be missing for any variable of interest.

  3. imputation: a procedure for filling in missing values in a data set before analyzing the resultant completed data set.

Most people in this camp use the deletion techniques as a last resort because there is wariness about removing data. In this walkthrough, we’ll cover the last approach: imputation.

Note

Please be aware that this idea of addressing missingness in a data set is a multifaceted issue, there is a lot that is not covered here, and in practice it is very easy to go down a rabbit hole. This walkthrough is written in a way that hopefully brings a high level topic down to an understandable level. In doing that, some information is lost or has been set aside for the sake of simplicity. With that said, view this through the lens of a problem that you might or will even likely come across. The issue is that few people know what to do about missing data so they either ignore or delete it without considering the ramifications. However if you are interested in learning more or need clarifications, please reach out!

Assumptions

While there are approaches to imputing discrete data, we’ll be covering the case for those items that have scales considered to be continuous (e.g. Likert type) only!

If you have taken statistics in the past, then you likely know about these things called assumptions, or criteria that have to be satisfied before running a statistical test (e.g. t-test, ANOVA, etc.). This is also true when dealing with missing data points, in that we have to determine if our data is

TypeShorthandDescriptionTraitsWhat Can You Do?
Missing Completely At RandomMCARThe locations of missing values are random and not dependent on other data.Strong assumptions; difficult or possibly impossible to detectDeletion techniques or imputation with confidence in reporting
Missing At RandomMARThe locations of missing values are random BUT depend on some other observed dataNeutral(ish) assumptions; possible to detectAdvanced imputation with caution in reporting.
Missing Not At RandomMNARThere is a pattern to the missing valuesWeak assumptions; easy to detectAn inability to report anything meaningful (get new or more data because this is bad).

What to Do About Missing Data

In the real world, data involving humans is rarely results in a complete set of values. As you can imagine, this is nearly always the case when conducting surveys where respondents are free to skip questions4. So what can you do?

Testing for MCAR

While it is difficult, if not impossible to figure out if a dataset is MCAR, we can use probability to determine if it could be. The best way to do this is to figure out a formula for the regression, and to test that - but we’re not going to take that approach given the makeup of the course. A simpler, yet less robust way is to use a technique developed by Little (1988) that basically gives is a probability that a data set is MCAR. But remember, there is always an associated probability that it may not be too!

Using the narniar package, we can use the test. In a callback to your introductory statistics course, the null hypothesis ($H_0$) is that the data is MCAR while the test statistic is a chi-squared ($\chi^2$) value.

mcar_test(nerdy_data)
## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      90.0   169    1.00               10

Given the high \(p\)-value, we can say that nerdy_data could be MCAR. To get an idea of the number of items that have missing values, you can use

n_var_miss(nerdy_data)
## [1] 26

To view the missingness by item, we can use the following

gg_miss_var(nerdy_data) +
  guides(color = "none")

which just says all 26 columns have at least one missing value. To see common missingness by items, we can use

gg_miss_upset(nerdy_data,
              nsets = n_var_miss(nerdy_data))

where nsets are the number of columns which shows us

  • all of the variables with missing values
  • all of the common values that are missing
  • each combination of missing values occurs once

but that is really difficult to read and we’re just getting a feel for the missingness, so let’s try limiting the number of variables to say the top 5

gg_miss_upset(nerdy_data,
              nsets = 5)

OK that’s better! Now its easier to see questions 15, 22, and 23 share a common missing value and so forth.

We can even use ggplot to visualize this differently by item. First we have to add a column for the respondents by

nerdy_data_rowid <- 
  nerdy_data %>% 
  rowid_to_column("respondent")

which just adds a column in front with unique ids.

nerdy_data_rowid
## # A tibble: 10 × 27
##    respondent    Q1    Q2    Q3    Q4    Q5    Q6    Q7    Q8    Q9   Q10   Q11
##         <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1          1     3     5     3     3     5     5     5    NA    NA     5     4
##  2          2    NA    NA     4     3     5    NA     5     1     4    NA     1
##  3          3     5    NA     5     5     5    NA     5     5     5     5     5
##  4          4    NA     5    NA     5     5    NA     5    NA     5     5     5
##  5          5     4    NA    NA     4    NA     4    NA     4     4     5     4
##  6          6     4     4     4     4     4     4     5     3    NA    NA     4
##  7          7     5     3     4     3    NA     4     4    NA     5    NA     3
##  8          8    NA     3     4     5     5     3    NA     5     3     5     4
##  9          9    NA     5     5    NA     2     3     0    NA     4     4    NA
## 10         10     4     5     5     5     4     3     5     4     4     5     1
## # … with 15 more variables: Q12 <dbl>, Q13 <dbl>, Q14 <dbl>, Q15 <dbl>,
## #   Q16 <dbl>, Q17 <dbl>, Q18 <dbl>, Q19 <dbl>, Q20 <dbl>, Q21 <dbl>,
## #   Q22 <dbl>, Q23 <dbl>, Q24 <dbl>, Q25 <dbl>, Q26 <dbl>

And then let’s visualize it for Q1

ggplot(data = nerdy_data_rowid, 
       aes(x = respondent,
       y = Q1)) +
  geom_miss_point()

or maybe its easier without the default grey background and bigger

ggplot(data = nerdy_data_rowid, 
       aes(x = respondent,
       y = Q1)) +
  geom_miss_point(size = 2) +
  theme_minimal()

On a side note, yes the x-axis looks odd and we could fix it, but we’re just exploring.

Multiple Imputation

Run the command mice() from the aptly named mice package to impute a data set. Here m is the number of times your computer applies the imputation algorithm to the data set

nerdy_data_imputed <- 
  mice(nerdy_data, 
       m = 15, 
       method = 'pmm')  %>%
  complete(1) %>%
  as_tibble()
## Warning: Number of logged events: 3448
nerdy_data_imputed
## # A tibble: 10 × 26
##       Q1    Q2    Q3    Q4    Q5    Q6    Q7    Q8    Q9   Q10   Q11   Q12   Q13
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     3     5     3     3     5     5     5     3     4     5     4     5     5
##  2    NA     4     4     3     5     5     5     1     4    NA     1     5     4
##  3     5     5     5     5     5     5     5     5     5     5     5     4     5
##  4    NA     5     4     5     5     3     5     4     5     5     5     4     4
##  5     4     5     5     4     4     4     5     4     4     5     4     4     5
##  6     4     4     4     4     4     4     5     3     3    NA     4     4     3
##  7     5     3     4     3     4     4     4     3     5    NA     3     3     4
##  8    NA     3     4     5     5     3     4     5     3     5     4     4     3
##  9    NA     5     5     4     2     3     0     4     4     4     1     4     5
## 10     4     5     5     5     4     3     5     4     4     5     1     4     4
## # … with 13 more variables: Q14 <dbl>, Q15 <dbl>, Q16 <dbl>, Q17 <dbl>,
## #   Q18 <dbl>, Q19 <dbl>, Q20 <dbl>, Q21 <dbl>, Q22 <dbl>, Q23 <dbl>,
## #   Q24 <dbl>, Q25 <dbl>, Q26 <dbl>

Now there are a lot of ways to impute data, some which will do a better job than others and dependent of what type of data you have. To see a list, just run ?mice::mice and scroll about halfway down the help page. You can change the approach by replacing pmm in method = 'pmm' above.

Anyway is our new data set any better because it look like there are still missing values. Well let’s take a look

n_var_miss(nerdy_data_imputed)
## [1] 5
gg_miss_var(nerdy_data_imputed) +
  guides(color = "none")
gg_miss_upset(nerdy_data_imputed)

Oh that looks a lot better! We could change m or the method to get a more complete data set.

The Data

I randomized missing data the original subset which can be seen by opening the NerdyDataSubset.csv file. To take a look at the original full data, take a look at NerdyDataFull.csv. The remaining files address measures and data from that set.

More About the Packages

  1. The narian package has alot of functionality not covered here. Take a look at the following for more help

  2. The mice package is confusing at times and can hurt your head, but its by far one of the most versatile and powerful packages to impute data. So if you find yourself with missing data, it may be beneficial to dive in headfirst. To get acquainted, a couple vignettes can be found using the link below


  1. Scroll to the right to see all columns ↩︎

  2. …though recent advances in machine learning have brought us closer in the past three years than we have had in the past 50 ↩︎

  3. In all fairness, I am in this group and absolutely biased ↩︎

  4. You’re first reaction may be to require respondents to address items -aka a forced response- but that almost always backfires resulting in a higher rate of nonresponses ↩︎