The example is intended as a realistic demonstration of the use of multiple imputation based on the principle of fully conditional specification. While the example does not show all aspects (e.g., specifying independent variables for each dependent variable, more complex relationships, passive imputation), it does show most of the points that need attention in such an analysis. The data is available in the file “dataForMice.RDS”.
A statistician has data for a country that includes the following variables (labels shown in parentheses):
gndr)age)B11) →
converted to “vote,” where 1 means the person voted, and 0 means they
did not. This variable also has missing values. In addition to those who
did not respond, the statistician also recoded the value “not eligible
to vote” as missing.F5)F6)F7)B8)B4-B10)F31-F33), employment
(G59, G60 and G90a), and, if
employed, the gross and net pay (salary) (G91 and
G92).The statistician wants to estimate a logistic regression model to explain how (or if) gender, age, years of education, and trust in political parties influence voter turnout. The total number of observations is 1336, and the number of observations with valid values for all variables included in the model is 575.
The statistician conducted two analyses. First, the analysis was performed only on complete cases, and second, using multiple imputations. Based on the analyses, which are not shown below, we can infer that the missing data mechanism is not MCAR.
## vote age gndr B8 F7
## 575 1 1 1 1 1 0
## 407 1 1 1 1 0 1
## 122 1 1 1 0 1 1
## 66 1 1 1 0 0 2
## 56 1 1 0 1 1 1
## 38 1 1 0 1 0 2
## 15 1 1 0 0 1 2
## 5 1 1 0 0 0 3
## 28 1 0 1 1 1 1
## 14 1 0 1 1 0 2
## 2 1 0 1 0 1 2
## 3 1 0 1 0 0 3
## 3 1 0 0 1 1 2
## 1 1 0 0 1 0 3
## 1 1 0 0 0 1 3
## 0 52 119 214 534 919
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.553 | 0.553 | -6.422 | 0.000 |
| gndrFemale | -0.206 | 0.192 | -1.073 | 0.283 |
| age | 0.052 | 0.006 | 9.015 | 0.000 |
| B8 | 0.131 | 0.046 | 2.838 | 0.005 |
| F7 | 0.146 | 0.037 | 3.957 | 0.000 |
In this approach, we included only the variables in the model in the imputation model. For the number of iterations and the number of imputed datasets, we left the default values.
We also did not address whether all values labeled as NA
are reasonable to impute.
We see that it is not clear whether the imputation model has stabilized or converged, so it is necessary to increase the number of iterations. Additionally, only 5 imputed datasets were created, which is insufficient.
| term | estimate | std.error | statistic | df | p.value |
|---|---|---|---|---|---|
| (Intercept) | -3.098 | 0.473 | -6.548 | 26.5 | 0.0000 |
| gndrFemale | -0.225 | 0.134 | -1.678 | 177.9 | 0.0951 |
| age | 0.046 | 0.004 | 10.993 | 153.0 | 0.0000 |
| B8 | 0.117 | 0.040 | 2.940 | 19.2 | 0.0083 |
| F7 | 0.146 | 0.032 | 4.526 | 22.9 | 0.0002 |
Previously, we found that it is necessary to increase the number of iterations and imputed datasets. We increase the number of iterations to 20 (10 might also work, but more is even better) and the number of datasets to 30 (more would also be better).
The number of iterations is probably appropriate; we do not see any clear patterns, and 20 iterations are already sufficient for such patterns to emerge.
From the visualizations, it is not evident, but the analysis is still incorrect because we have not used all the information contained in the dataset.
| term | estimate | std.error | statistic | df | p.value |
|---|---|---|---|---|---|
| (Intercept) | -3.312 | 0.449 | -7.374 | 234.5 | 0.0000 |
| gndrFemale | -0.225 | 0.131 | -1.713 | 864.4 | 0.0870 |
| age | 0.047 | 0.004 | 11.239 | 658.0 | 0.0000 |
| B8 | 0.118 | 0.034 | 3.507 | 337.4 | 0.0005 |
| F7 | 0.161 | 0.031 | 5.187 | 176.9 | 0.0000 |
Another mistake we made is that we included only the variables from the regression model in the imputation model, even though the task description clearly indicates that there are variables in the dataset that are obviously related to the variables with missing values. From the data overview, we observed that many missing values occur primarily in variables F7 (years of education) and B8 (trust in political parties). Therefore, we will look for variables that might be related to these variables. Variables potentially related to F7 (years of education) include:
Since F6 is a very good substitute (predictor) for F7 and also has very few missing values (see the graph/output below), we used only F6 here.
In this specific case, an even better solution might be to replace F7 directly with F6 in the regression model (which we must treat as an ordinal variable), but the purpose of this document is to demonstrate handling missing data, so we will not delve into or modify the model itself.
Variables potentially related to B8 (trust in political parties) include variables related to trust in other political “subjects,” such as politicians (B7), the national parliament (B4), and possibly others. For simplicity, we will focus here on these two most closely related variables.
## vote F6 age gndr B7 B8 B4 F7
## 424 1 1 1 1 1 1 1 1 0
## 302 1 1 1 1 1 1 1 0 1
## 74 1 1 1 1 1 1 0 1 1
## 51 1 1 1 1 1 1 0 0 2
## 80 1 1 1 1 1 0 1 1 1
## 46 1 1 1 1 1 0 1 0 2
## 26 1 1 1 1 1 0 0 1 2
## 10 1 1 1 1 1 0 0 0 3
## 58 1 1 1 1 0 1 1 1 1
## 42 1 1 1 1 0 1 1 0 2
## 14 1 1 1 1 0 1 0 1 2
## 10 1 1 1 1 0 1 0 0 3
## 13 1 1 1 1 0 0 1 1 2
## 5 1 1 1 1 0 0 1 0 3
## 1 1 1 1 1 0 0 0 1 3
## 3 1 1 1 1 0 0 0 0 4
## 39 1 1 1 0 1 1 1 1 1
## 27 1 1 1 0 1 1 1 0 2
## 8 1 1 1 0 1 1 0 1 2
## 6 1 1 1 0 1 1 0 0 3
## 10 1 1 1 0 1 0 1 1 2
## 4 1 1 1 0 1 0 1 0 3
## 1 1 1 1 0 1 0 0 1 3
## 8 1 1 1 0 0 1 1 1 2
## 5 1 1 1 0 0 1 1 0 3
## 1 1 1 1 0 0 1 0 1 3
## 3 1 1 1 0 0 0 1 1 3
## 17 1 1 0 1 1 1 1 1 1
## 11 1 1 0 1 1 1 1 0 2
## 5 1 1 0 1 1 1 0 1 2
## 1 1 1 0 1 1 1 0 0 3
## 2 1 1 0 1 1 0 1 1 2
## 2 1 1 0 1 1 0 1 0 3
## 1 1 1 0 1 1 0 0 0 4
## 4 1 1 0 1 0 1 1 1 2
## 2 1 1 0 1 0 1 0 1 3
## 2 1 1 0 1 0 1 0 0 4
## 3 1 1 0 0 1 1 1 1 2
## 1 1 1 0 0 1 1 1 0 3
## 1 1 1 0 0 1 0 0 1 4
## 4 1 0 1 1 1 1 1 1 1
## 1 1 0 1 1 1 1 1 0 2
## 1 1 0 1 1 1 0 1 1 2
## 1 1 0 1 1 1 0 1 0 3
## 1 1 0 1 1 1 0 0 0 4
## 1 1 0 1 1 0 1 1 1 2
## 1 1 0 1 1 0 1 0 0 4
## 1 1 0 1 1 0 0 0 1 4
## 1 1 0 1 0 1 0 1 1 3
## 1 1 0 1 0 1 0 0 0 5
## 0 13 52 119 174 214 221 534 1327
From the display, we can see that there are only 5 units where both F6 and F7 have missing values and 5 units where all three variables B4, B7, and B8 have missing values.
| term | estimate | std.error | statistic | df | p.value |
|---|---|---|---|---|---|
| (Intercept) | -3.501 | 0.394 | -8.883 | 473.5 | 0.0000 |
| gndrFemale | -0.277 | 0.131 | -2.117 | 991.7 | 0.0345 |
| age | 0.047 | 0.004 | 11.792 | 908.3 | 0.0000 |
| B8 | 0.104 | 0.031 | 3.352 | 920.9 | 0.0008 |
| F7 | 0.168 | 0.024 | 6.919 | 407.0 | 0.0000 |
We still have not corrected a key mistake, which is that units without values (“not eligible to vote”) are treated as if they do have values. To fix this error, we exclude those without voting eligibility from the data.
Since there are relatively few such units, we do not expect this to have a significant impact.
## vote F6 age gndr B7 B8 B4 F7
## 394 1 1 1 1 1 1 1 1 0
## 294 1 1 1 1 1 1 1 0 1
## 74 1 1 1 1 1 1 0 1 1
## 51 1 1 1 1 1 1 0 0 2
## 75 1 1 1 1 1 0 1 1 1
## 46 1 1 1 1 1 0 1 0 2
## 24 1 1 1 1 1 0 0 1 2
## 10 1 1 1 1 1 0 0 0 3
## 56 1 1 1 1 0 1 1 1 1
## 42 1 1 1 1 0 1 1 0 2
## 14 1 1 1 1 0 1 0 1 2
## 10 1 1 1 1 0 1 0 0 3
## 13 1 1 1 1 0 0 1 1 2
## 5 1 1 1 1 0 0 1 0 3
## 1 1 1 1 1 0 0 0 1 3
## 3 1 1 1 1 0 0 0 0 4
## 39 1 1 1 0 1 1 1 1 1
## 26 1 1 1 0 1 1 1 0 2
## 8 1 1 1 0 1 1 0 1 2
## 6 1 1 1 0 1 1 0 0 3
## 9 1 1 1 0 1 0 1 1 2
## 4 1 1 1 0 1 0 1 0 3
## 1 1 1 1 0 1 0 0 1 3
## 7 1 1 1 0 0 1 1 1 2
## 5 1 1 1 0 0 1 1 0 3
## 1 1 1 1 0 0 1 0 1 3
## 3 1 1 1 0 0 0 1 1 3
## 15 1 1 0 1 1 1 1 1 1
## 11 1 1 0 1 1 1 1 0 2
## 5 1 1 0 1 1 1 0 1 2
## 1 1 1 0 1 1 1 0 0 3
## 2 1 1 0 1 1 0 1 1 2
## 2 1 1 0 1 1 0 1 0 3
## 1 1 1 0 1 1 0 0 0 4
## 4 1 1 0 1 0 1 1 1 2
## 2 1 1 0 1 0 1 0 1 3
## 2 1 1 0 1 0 1 0 0 4
## 3 1 1 0 0 1 1 1 1 2
## 1 1 1 0 0 1 1 1 0 3
## 1 1 1 0 0 1 0 0 1 4
## 4 1 0 1 1 1 1 1 1 1
## 1 1 0 1 1 1 0 1 1 2
## 1 1 0 1 1 1 0 1 0 3
## 1 1 0 1 1 1 0 0 0 4
## 1 1 0 1 1 0 1 1 1 2
## 1 1 0 1 1 0 1 0 0 4
## 1 1 0 1 1 0 0 0 1 4
## 1 1 0 1 0 1 0 1 1 3
## 1 1 0 1 0 1 0 0 0 5
## 0 12 50 116 171 206 219 524 1298
| term | estimate | std.error | statistic | df | p.value |
|---|---|---|---|---|---|
| (Intercept) | -2.857 | 0.396 | -7.221 | 742.9 | 0.0000 |
| gndrFemale | -0.301 | 0.132 | -2.274 | 988.0 | 0.0232 |
| age | 0.040 | 0.004 | 9.733 | 1026.9 | 0.0000 |
| B8 | 0.112 | 0.031 | 3.566 | 931.5 | 0.0004 |
| F7 | 0.143 | 0.024 | 5.997 | 612.0 | 0.0000 |
Below, we present all the results calculated so far, including coefficient values and \(p\)-values. However, based on these results, we cannot determine which of them is the correct one. What we can observe is that the results are generally quite similar. The largest differences are found in the estimation of the effect of gender and, to some extent, the effect of variables B8 and F7. Differences in regression coefficients are relatively small but not negligible (particularly for gender and B8). This suggests that the mechanism of missing values is not MCAR but at least MAR. Differences in \(p\)-values for gender and B8 are also noticeable. These are partly due to the change in the amount of used information (number of “usable” values) and partly due to differences in coefficient values.
As mentioned, we cannot determine which result is closer to the true one. However, based on theory, we can infer that the most accurate result is likely the one obtained from the final analysis based on multiple imputations since we systematically addressed “errors” throughout this task.
| beta | p | beta | p | beta | p | beta | p | beta | p | |
|---|---|---|---|---|---|---|---|---|---|---|
| listwise | -3.553 | 0.000 | -0.206 | 0.283 | 0.052 | 0.000 | 0.131 | 0.005 | 0.146 | 0.000 |
| miceWrong1 | -3.098 | 0.000 | -0.225 | 0.095 | 0.046 | 0.000 | 0.117 | 0.008 | 0.146 | 0.000 |
| miceWrong2 | -3.312 | 0.000 | -0.225 | 0.087 | 0.047 | 0.000 | 0.118 | 0.001 | 0.161 | 0.000 |
| miceWrong3 | -3.501 | 0.000 | -0.277 | 0.034 | 0.047 | 0.000 | 0.104 | 0.001 | 0.168 | 0.000 |
| miceCorrect | -2.857 | 0.000 | -0.301 | 0.023 | 0.040 | 0.000 | 0.112 | 0.000 | 0.143 | 0.000 |
The dataset in this task is not real. It is based on real data but has been adjusted in the following ways: * The effect of the gender variable has been altered. * Units with originally missing values have been removed from the dataset. * Missing values have been added to the data using the MAR mechanism.
The dataset after the first two steps is effectively a dataset without missing values.
The results of the analysis on this dataset can be considered “true” and are shown in the table below. Note that in real-world analysis, we do not have access to such “true” datasets.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.786 | 0.369 | -7.558 | 0.000 |
| gndrFemale | -0.319 | 0.128 | -2.502 | 0.012 |
| age | 0.039 | 0.004 | 9.838 | 0.000 |
| B8 | 0.116 | 0.030 | 3.818 | 0.000 |
| F7 | 0.137 | 0.021 | 6.464 | 0.000 |
In this case, we can compare the results to the “true” ones (shown in the table below), where we find that the results of the final analysis are closest to the true values. However, as emphasized earlier, such a comparison is not something we can do in real-world scenarios.
| beta | p | beta | p | beta | p | beta | p | beta | p | |
|---|---|---|---|---|---|---|---|---|---|---|
| true | -2.786 | 0.000 | -0.319 | 0.012 | 0.039 | 0.000 | 0.116 | 0.000 | 0.137 | 0.000 |
| listwise | -3.553 | 0.000 | -0.206 | 0.283 | 0.052 | 0.000 | 0.131 | 0.005 | 0.146 | 0.000 |
| miceWrong1 | -3.098 | 0.000 | -0.225 | 0.095 | 0.046 | 0.000 | 0.117 | 0.008 | 0.146 | 0.000 |
| miceWrong2 | -3.312 | 0.000 | -0.225 | 0.095 | 0.047 | 0.000 | 0.118 | 0.008 | 0.161 | 0.000 |
| miceWrong3 | -3.501 | 0.000 | -0.277 | 0.095 | 0.047 | 0.000 | 0.104 | 0.008 | 0.168 | 0.000 |
| miceCorrect | -2.857 | 0.000 | -0.301 | 0.023 | 0.040 | 0.000 | 0.112 | 0.000 | 0.143 | 0.000 |