The example is intended as a realistic demonstration of the use of multiple imputation based on the principle of fully conditional specification. While the example does not show all aspects (e.g., specifying independent variables for each dependent variable, more complex relationships, passive imputation), it does show most of the points that need attention in such an analysis. The data is available in the file “dataForMice.RDS”.

1 Task

A statistician has data for a country that includes the following variables (labels shown in parentheses):

  • Gender (gndr)
  • Age (age)
  • Whether the person voted in the last election (B11) → converted to “vote,” where 1 means the person voted, and 0 means they did not. This variable also has missing values. In addition to those who did not respond, the statistician also recoded the value “not eligible to vote” as missing.
  • Place of residence (F5)
  • Education level (F6)
  • Years of education (F7)
  • Trust in political parties (B8)
  • Several variables regarding trust in various political/government institutions (parliament, government, police, judiciary, politicians, etc.) (B4-B10)
  • Variables on household income (F31-F33), employment (G59, G60 and G90a), and, if employed, the gross and net pay (salary) (G91 and G92).

The statistician wants to estimate a logistic regression model to explain how (or if) gender, age, years of education, and trust in political parties influence voter turnout. The total number of observations is 1336, and the number of observations with valid values for all variables included in the model is 575.

The statistician conducted two analyses. First, the analysis was performed only on complete cases, and second, using multiple imputations. Based on the analyses, which are not shown below, we can infer that the missing data mechanism is not MCAR.

2 Data Overview

##     vote age gndr  B8  F7    
## 575    1   1    1   1   1   0
## 407    1   1    1   1   0   1
## 122    1   1    1   0   1   1
## 66     1   1    1   0   0   2
## 56     1   1    0   1   1   1
## 38     1   1    0   1   0   2
## 15     1   1    0   0   1   2
## 5      1   1    0   0   0   3
## 28     1   0    1   1   1   1
## 14     1   0    1   1   0   2
## 2      1   0    1   0   1   2
## 3      1   0    1   0   0   3
## 3      1   0    0   1   1   2
## 1      1   0    0   1   0   3
## 1      1   0    0   0   1   3
##        0  52  119 214 534 919

3 Analysis Based on Available Units

Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.553 0.553 -6.422 0.000
gndrFemale -0.206 0.192 -1.073 0.283
age 0.052 0.006 9.015 0.000
B8 0.131 0.046 2.838 0.005
F7 0.146 0.037 3.957 0.000

4 MICE

4.1 “Naive”/Incorrect Approach

In this approach, we included only the variables in the model in the imputation model. For the number of iterations and the number of imputed datasets, we left the default values.

We also did not address whether all values labeled as NA are reasonable to impute.

4.1.1 Imputation

We see that it is not clear whether the imputation model has stabilized or converged, so it is necessary to increase the number of iterations. Additionally, only 5 imputed datasets were created, which is insufficient.

4.1.2 Analysis

term estimate std.error statistic df p.value
(Intercept) -3.098 0.473 -6.548 26.5 0.0000
gndrFemale -0.225 0.134 -1.678 177.9 0.0951
age 0.046 0.004 10.993 153.0 0.0000
B8 0.117 0.040 2.940 19.2 0.0083
F7 0.146 0.032 4.526 22.9 0.0002

4.2 Correcting the Number of Iterations and Datasets (Some Issues Remain)

Previously, we found that it is necessary to increase the number of iterations and imputed datasets. We increase the number of iterations to 20 (10 might also work, but more is even better) and the number of datasets to 30 (more would also be better).

4.2.1 Imputation

The number of iterations is probably appropriate; we do not see any clear patterns, and 20 iterations are already sufficient for such patterns to emerge.

From the visualizations, it is not evident, but the analysis is still incorrect because we have not used all the information contained in the dataset.

4.2.2 Analysis

term estimate std.error statistic df p.value
(Intercept) -3.312 0.449 -7.374 234.5 0.0000
gndrFemale -0.225 0.131 -1.713 864.4 0.0870
age 0.047 0.004 11.239 658.0 0.0000
B8 0.118 0.034 3.507 337.4 0.0005
F7 0.161 0.031 5.187 176.9 0.0000

4.3 Including Additional Variables (Some Issues Remain)

Another mistake we made is that we included only the variables from the regression model in the imputation model, even though the task description clearly indicates that there are variables in the dataset that are obviously related to the variables with missing values. From the data overview, we observed that many missing values occur primarily in variables F7 (years of education) and B8 (trust in political parties). Therefore, we will look for variables that might be related to these variables. Variables potentially related to F7 (years of education) include:

  • Education level (F6)
  • Variables related to income/salary (individual or household) and employment status –> Salary presents a slight issue here because we do not want to impute values that do not exist. This issue could be resolved by creating a new variable, “salary,” which will have a value of 0 if the person is unemployed. In this case, we would also need to include the variable indicating employment status.

Since F6 is a very good substitute (predictor) for F7 and also has very few missing values (see the graph/output below), we used only F6 here.

In this specific case, an even better solution might be to replace F7 directly with F6 in the regression model (which we must treat as an ordinal variable), but the purpose of this document is to demonstrate handling missing data, so we will not delve into or modify the model itself.

Variables potentially related to B8 (trust in political parties) include variables related to trust in other political “subjects,” such as politicians (B7), the national parliament (B4), and possibly others. For simplicity, we will focus here on these two most closely related variables.

##     vote F6 age gndr  B7  B8  B4  F7     
## 424    1  1   1    1   1   1   1   1    0
## 302    1  1   1    1   1   1   1   0    1
## 74     1  1   1    1   1   1   0   1    1
## 51     1  1   1    1   1   1   0   0    2
## 80     1  1   1    1   1   0   1   1    1
## 46     1  1   1    1   1   0   1   0    2
## 26     1  1   1    1   1   0   0   1    2
## 10     1  1   1    1   1   0   0   0    3
## 58     1  1   1    1   0   1   1   1    1
## 42     1  1   1    1   0   1   1   0    2
## 14     1  1   1    1   0   1   0   1    2
## 10     1  1   1    1   0   1   0   0    3
## 13     1  1   1    1   0   0   1   1    2
## 5      1  1   1    1   0   0   1   0    3
## 1      1  1   1    1   0   0   0   1    3
## 3      1  1   1    1   0   0   0   0    4
## 39     1  1   1    0   1   1   1   1    1
## 27     1  1   1    0   1   1   1   0    2
## 8      1  1   1    0   1   1   0   1    2
## 6      1  1   1    0   1   1   0   0    3
## 10     1  1   1    0   1   0   1   1    2
## 4      1  1   1    0   1   0   1   0    3
## 1      1  1   1    0   1   0   0   1    3
## 8      1  1   1    0   0   1   1   1    2
## 5      1  1   1    0   0   1   1   0    3
## 1      1  1   1    0   0   1   0   1    3
## 3      1  1   1    0   0   0   1   1    3
## 17     1  1   0    1   1   1   1   1    1
## 11     1  1   0    1   1   1   1   0    2
## 5      1  1   0    1   1   1   0   1    2
## 1      1  1   0    1   1   1   0   0    3
## 2      1  1   0    1   1   0   1   1    2
## 2      1  1   0    1   1   0   1   0    3
## 1      1  1   0    1   1   0   0   0    4
## 4      1  1   0    1   0   1   1   1    2
## 2      1  1   0    1   0   1   0   1    3
## 2      1  1   0    1   0   1   0   0    4
## 3      1  1   0    0   1   1   1   1    2
## 1      1  1   0    0   1   1   1   0    3
## 1      1  1   0    0   1   0   0   1    4
## 4      1  0   1    1   1   1   1   1    1
## 1      1  0   1    1   1   1   1   0    2
## 1      1  0   1    1   1   0   1   1    2
## 1      1  0   1    1   1   0   1   0    3
## 1      1  0   1    1   1   0   0   0    4
## 1      1  0   1    1   0   1   1   1    2
## 1      1  0   1    1   0   1   0   0    4
## 1      1  0   1    1   0   0   0   1    4
## 1      1  0   1    0   1   0   1   1    3
## 1      1  0   1    0   1   0   0   0    5
##        0 13  52  119 174 214 221 534 1327

From the display, we can see that there are only 5 units where both F6 and F7 have missing values and 5 units where all three variables B4, B7, and B8 have missing values.

4.3.1 Imputation

4.3.2 Analysis

term estimate std.error statistic df p.value
(Intercept) -3.501 0.394 -8.883 473.5 0.0000
gndrFemale -0.277 0.131 -2.117 991.7 0.0345
age 0.047 0.004 11.792 908.3 0.0000
B8 0.104 0.031 3.352 920.9 0.0008
F7 0.168 0.024 6.919 407.0 0.0000

4.4 MICE - Correct (Probably)

We still have not corrected a key mistake, which is that units without values (“not eligible to vote”) are treated as if they do have values. To fix this error, we exclude those without voting eligibility from the data.

Since there are relatively few such units, we do not expect this to have a significant impact.

##     vote F6 age gndr  B7  B8  B4  F7     
## 394    1  1   1    1   1   1   1   1    0
## 294    1  1   1    1   1   1   1   0    1
## 74     1  1   1    1   1   1   0   1    1
## 51     1  1   1    1   1   1   0   0    2
## 75     1  1   1    1   1   0   1   1    1
## 46     1  1   1    1   1   0   1   0    2
## 24     1  1   1    1   1   0   0   1    2
## 10     1  1   1    1   1   0   0   0    3
## 56     1  1   1    1   0   1   1   1    1
## 42     1  1   1    1   0   1   1   0    2
## 14     1  1   1    1   0   1   0   1    2
## 10     1  1   1    1   0   1   0   0    3
## 13     1  1   1    1   0   0   1   1    2
## 5      1  1   1    1   0   0   1   0    3
## 1      1  1   1    1   0   0   0   1    3
## 3      1  1   1    1   0   0   0   0    4
## 39     1  1   1    0   1   1   1   1    1
## 26     1  1   1    0   1   1   1   0    2
## 8      1  1   1    0   1   1   0   1    2
## 6      1  1   1    0   1   1   0   0    3
## 9      1  1   1    0   1   0   1   1    2
## 4      1  1   1    0   1   0   1   0    3
## 1      1  1   1    0   1   0   0   1    3
## 7      1  1   1    0   0   1   1   1    2
## 5      1  1   1    0   0   1   1   0    3
## 1      1  1   1    0   0   1   0   1    3
## 3      1  1   1    0   0   0   1   1    3
## 15     1  1   0    1   1   1   1   1    1
## 11     1  1   0    1   1   1   1   0    2
## 5      1  1   0    1   1   1   0   1    2
## 1      1  1   0    1   1   1   0   0    3
## 2      1  1   0    1   1   0   1   1    2
## 2      1  1   0    1   1   0   1   0    3
## 1      1  1   0    1   1   0   0   0    4
## 4      1  1   0    1   0   1   1   1    2
## 2      1  1   0    1   0   1   0   1    3
## 2      1  1   0    1   0   1   0   0    4
## 3      1  1   0    0   1   1   1   1    2
## 1      1  1   0    0   1   1   1   0    3
## 1      1  1   0    0   1   0   0   1    4
## 4      1  0   1    1   1   1   1   1    1
## 1      1  0   1    1   1   0   1   1    2
## 1      1  0   1    1   1   0   1   0    3
## 1      1  0   1    1   1   0   0   0    4
## 1      1  0   1    1   0   1   1   1    2
## 1      1  0   1    1   0   1   0   0    4
## 1      1  0   1    1   0   0   0   1    4
## 1      1  0   1    0   1   0   1   1    3
## 1      1  0   1    0   1   0   0   0    5
##        0 12  50  116 171 206 219 524 1298

4.4.1 Imputation

4.4.2 Analysis

term estimate std.error statistic df p.value
(Intercept) -2.857 0.396 -7.221 742.9 0.0000
gndrFemale -0.301 0.132 -2.274 988.0 0.0232
age 0.040 0.004 9.733 1026.9 0.0000
B8 0.112 0.031 3.566 931.5 0.0004
F7 0.143 0.024 5.997 612.0 0.0000

5 Comparison of Results

Below, we present all the results calculated so far, including coefficient values and \(p\)-values. However, based on these results, we cannot determine which of them is the correct one. What we can observe is that the results are generally quite similar. The largest differences are found in the estimation of the effect of gender and, to some extent, the effect of variables B8 and F7. Differences in regression coefficients are relatively small but not negligible (particularly for gender and B8). This suggests that the mechanism of missing values is not MCAR but at least MAR. Differences in \(p\)-values for gender and B8 are also noticeable. These are partly due to the change in the amount of used information (number of “usable” values) and partly due to differences in coefficient values.

As mentioned, we cannot determine which result is closer to the true one. However, based on theory, we can infer that the most accurate result is likely the one obtained from the final analysis based on multiple imputations since we systematically addressed “errors” throughout this task.

(Intercept)
gndrFemale
age
B8
F7
beta p beta p beta p beta p beta p
listwise -3.553 0.000 -0.206 0.283 0.052 0.000 0.131 0.005 0.146 0.000
miceWrong1 -3.098 0.000 -0.225 0.095 0.046 0.000 0.117 0.008 0.146 0.000
miceWrong2 -3.312 0.000 -0.225 0.087 0.047 0.000 0.118 0.001 0.161 0.000
miceWrong3 -3.501 0.000 -0.277 0.034 0.047 0.000 0.104 0.001 0.168 0.000
miceCorrect -2.857 0.000 -0.301 0.023 0.040 0.000 0.112 0.000 0.143 0.000

6 “Truth”

The dataset in this task is not real. It is based on real data but has been adjusted in the following ways: * The effect of the gender variable has been altered. * Units with originally missing values have been removed from the dataset. * Missing values have been added to the data using the MAR mechanism.

The dataset after the first two steps is effectively a dataset without missing values.

6.1 Analysis

The results of the analysis on this dataset can be considered “true” and are shown in the table below. Note that in real-world analysis, we do not have access to such “true” datasets.

Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.786 0.369 -7.558 0.000
gndrFemale -0.319 0.128 -2.502 0.012
age 0.039 0.004 9.838 0.000
B8 0.116 0.030 3.818 0.000
F7 0.137 0.021 6.464 0.000

6.2 Comparison with True Results

In this case, we can compare the results to the “true” ones (shown in the table below), where we find that the results of the final analysis are closest to the true values. However, as emphasized earlier, such a comparison is not something we can do in real-world scenarios.

(Intercept)
gndrFemale
age
B8
F7
beta p beta p beta p beta p beta p
true -2.786 0.000 -0.319 0.012 0.039 0.000 0.116 0.000 0.137 0.000
listwise -3.553 0.000 -0.206 0.283 0.052 0.000 0.131 0.005 0.146 0.000
miceWrong1 -3.098 0.000 -0.225 0.095 0.046 0.000 0.117 0.008 0.146 0.000
miceWrong2 -3.312 0.000 -0.225 0.095 0.047 0.000 0.118 0.008 0.161 0.000
miceWrong3 -3.501 0.000 -0.277 0.095 0.047 0.000 0.104 0.008 0.168 0.000
miceCorrect -2.857 0.000 -0.301 0.023 0.040 0.000 0.112 0.000 0.143 0.000