--- title: "Mice Example" output: html_document: keep_md: no toc: true toc_float: toc_collapsed: false toc_depth: 3 number_sections: true date: "`r Sys.Date()`" editor_options: chunk_output_type: console params: fromRDS: TRUE --- The example is intended as a realistic demonstration of the use of multiple imputation based on the principle of fully conditional specification. While the example does not show all aspects (e.g., specifying independent variables for each dependent variable, more complex relationships, passive imputation), it does show most of the points that need attention in such an analysis. The data is available in the file "dataForMice.RDS". ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, warning = FALSE) options(kableExtra.auto_format = FALSE) # Loading libraries library(foreign) library(ggplot2) library(ggblanket) library(mice) library(knitr) library(multiUS) library(kableExtra) ``` ```{r readingData} if(!file.exists("dataForMice.RDS")|!params$fromRDS){ dataset<-read.spss("Ess2e03_Slovenija.sav", to.data.frame=TRUE,use.missings = TRUE,use.value.labels = TRUE,max.value.labels = 4) varLabs<-attr(dataset,which = "variable.labels") dataset$B11<-factor(dataset$B11) dataset$vote<-(dataset$B11=="Yes")*1 #age computation dataset$age<-2004-dataset$yrbrn } ``` ```{r generateArtificialDataWithoutNA} # The dataset is not real. It is based on real data but has been modified in two ways: # - The effect of the gender variable has been modified # - Units with (original) missing values have been removed. # - Missing values have been added using the MAR mechanism. # This R-chunk implements the first two points; the third is in the next chunk. if(!file.exists("dataForMice.RDS")|!params$fromRDS){ missProp<-colMeans(is.na(dataset[c("gndr", "age","B4","B7","F6")])) set.seed(2021) indFemVote<-which(!is.na(dataset$vote)&dataset$vote==1&dataset$gndr%in%c("Female")) ageFem<-dataset$age[indFemVote] ageFem[is.na(ageFem)]<-sample(ageFem[!is.na(ageFem)], size = sum(is.na(ageFem))) femToNotVote<-sample(indFemVote,size = round(length(indFemVote)*0.05), prob = ageFem) dataset$B11[femToNotVote]<-"No" dataset$vote<-(dataset$B11=="Yes")*1 cmp<-complete.cases(dataset[,c("B11", "gndr", "age","B8","F7")]) dataset<-dataset[cmp,] use<-dataset$B11!="Not eligible to vote" use[is.na(use)]<-TRUE datasetOrg<-dataset[use,] } ``` ```{r addMissingValues} if(!file.exists("dataForMice.RDS")|!params$fromRDS){ # This R-chunk adds missing values using the MAR mechanism. pr<-as.integer(dataset$F6) pr[is.na(pr)]<-0 indNA<-sample(nrow(dataset),size = nrow(dataset)*0.4,prob = pr^2) #table(dataset$F7,indNA,exclude = NULL) dataset$F7[indNA]<-NA tmp<-sum(dataset$gndr%in%c("Female")) dataset$B8[dataset$gndr%in%c("Female")][sample(tmp,size=tmp*0.3)]<-NA age2<-dataset$age age2[is.na(age2)]<-0 nNew<-nrow(dataset) for(iVar in names(missProp)){ dataset[sample(nNew,size = round(nNew*missProp[iVar]*4), prob = age2),iVar]<-NA } dataset$vote<-(dataset$B11=="Yes")*1 # age<-dataset$age # age[is.na(age)]<-0 # tmp<-sample(1:nrow(dataset), prob=age, size = 0.4*nrow(dataset)) # dataset$B8[tmp]<-NA saveRDS(dataset, "dataForMice.RDS") saveRDS(datasetOrg, "dataForMiceOrg.RDS") } else{ dataset<-readRDS("dataForMice.RDS") datasetOrg<-readRDS("dataForMiceOrg.RDS") } ``` # Task A statistician has data for a country that includes the following variables (labels shown in parentheses): * Gender (`gndr`) * Age (`age`) * Whether the person voted in the last election (`B11`) → converted to "vote," where 1 means the person voted, and 0 means they did not. This variable also has missing values. In addition to those who did not respond, the statistician also recoded the value "not eligible to vote" as missing. * Place of residence (`F5`) * Education level (`F6`) * Years of education (`F7`) * Trust in political parties (`B8`) * Several variables regarding trust in various political/government institutions (parliament, government, police, judiciary, politicians, etc.) (`B4-B10`) * Variables on household income (`F31-F33`), employment (`G59`, `G60` and `G90a`), and, if employed, the gross and net pay (salary) (`G91` and `G92`). The statistician wants to estimate a logistic regression model to explain how (or if) gender, age, years of education, and trust in political parties influence voter turnout. The total number of observations is `r nrow(dataset)`, and the number of observations with valid values for all variables included in the model is `r sum(complete.cases(dataset[c("vote", "gndr", "age", "B8", "F7")]))`. The statistician conducted two analyses. First, the analysis was performed only on complete cases, and second, using multiple imputations. Based on the analyses, which are not shown below, we can infer that the missing data mechanism is not MCAR. # Data Overview ::::: columns ::: column ```{r overview} mdp1<-md.pattern(dataset[c("vote","gndr","age","B8","F7")]) ``` ::: ::: column ```{r overviewText} mdp1 ``` ::: ::::: # Analysis Based on Available Units ```{r AnalysisAvailable} tmp<-summary(glm(vote~gndr+age+B8+F7,family =binomial,data = dataset)) kable(tmp$coef,digits = c(3,3,3,3,1,4)) ``` # MICE ## "Naive"/Incorrect Approach In this approach, we included only the variables in the model in the imputation model. For the number of iterations and the number of imputed datasets, we left the default values. We also did not address whether all values labeled as `NA` are reasonable to impute. ### Imputation ```{r MiceWrong, include = FALSE} if(file.exists("dataMiceWrong.RDS")¶ms$fromRDS){ dataMiceWrong<-readRDS("dataMiceWrong.RDS") }else{ dataMiceWrong<-mice(dataset[c("vote","gndr","age","B8","F7")]) saveRDS(dataMiceWrong, file="dataMiceWrong.RDS") } ``` ```{r MiceWrongPlot} plot(dataMiceWrong, layout = c(2,5)) ``` We see that it is not clear whether the imputation model has stabilized or converged, so it is necessary to increase the number of iterations. Additionally, only 5 imputed datasets were created, which is insufficient. ### Analysis ```{r MiceAnalysisWrong} resMiceWrong<-with(glm(vote~gndr+age+B8+F7,family =binomial),data = dataMiceWrong) kable(summary(pool(resMiceWrong)),digits = c(3,3,3,3,1,4)) ``` ## Correcting the Number of Iterations and Datasets (Some Issues Remain) Previously, we found that it is necessary to increase the number of iterations and imputed datasets. We increase the number of iterations to 20 (10 might also work, but more is even better) and the number of datasets to 30 (more would also be better). ### Imputation ```{r MiceWrong2, fig.height=10, include = FALSE} if(file.exists("dataMiceWrong2.RDS")¶ms$fromRDS){ dataMiceWrong2<-readRDS("dataMiceWrong2.RDS") }else{ dataMiceWrong2<-mice(dataset[c("vote","gndr","age","B8","F7")],m = 30, maxit = 20 ) saveRDS(dataMiceWrong2, file="dataMiceWrong2.RDS") } ``` ```{r MiceWrong2Plot, fig.height=10} plot(dataMiceWrong2, layout = c(2,5)) ``` The number of iterations is probably appropriate; we do not see any clear patterns, and 20 iterations are already sufficient for such patterns to emerge. From the visualizations, it is not evident, but the analysis is still incorrect because we have not used all the information contained in the dataset. ### Analysis ```{r MiceAnalysisWrong2} resMiceWrong2<-with(glm(vote~gndr+age+B8+F7,family =binomial),data = dataMiceWrong2) kable(summary(pool(resMiceWrong2)),digits = c(3,3,3,3,1,4)) ``` ## Including Additional Variables (Some Issues Remain) Another mistake we made is that we included only the variables from the regression model in the imputation model, even though the task description clearly indicates that there are variables in the dataset that are obviously related to the variables with missing values. From the data overview, we observed that many missing values occur primarily in variables F7 (years of education) and B8 (trust in political parties). Therefore, we will look for variables that might be related to these variables. Variables potentially related to F7 (years of education) include: * Education level (F6) * Variables related to income/salary (individual or household) and employment status --> Salary presents a slight issue here because we do not want to impute values that do not exist. This issue could be resolved by creating a new variable, "salary," which will have a value of 0 if the person is unemployed. In this case, we would also need to include the variable indicating employment status. Since F6 is a very good substitute (predictor) for F7 and also has very few missing values (see the graph/output below), we used only F6 here. In this specific case, an even better solution might be to replace F7 directly with F6 in the regression model (which we must treat as an ordinal variable), but the purpose of this document is to demonstrate handling missing data, so we will not delve into or modify the model itself. Variables potentially related to B8 (trust in political parties) include variables related to trust in other political "subjects," such as politicians (B7), the national parliament (B4), and possibly others. For simplicity, we will focus here on these two most closely related variables. ```{r preparation2} dataset$F6<-makeFactorLabels(dataset$F6) ``` ::::: columns ::: column ```{r overview2, fig.height=10} mdp2<-md.pattern(dataset[c("vote","gndr","age","B8","B4","B7","F7","F6")]) ``` ::: ::: column ```{r overview2Text} mdp2 nBothF7<-sum(as.numeric(rownames(mdp2)[apply(mdp2[,c("F6","F7")]==0, 1, all)])) nAllB8<-sum(as.numeric(rownames(mdp2)[apply(mdp2[,c("B4","B7","B8")]==0, 1, all)])) ``` ::: ::::: From the display, we can see that there are only `r nBothF7` units where both F6 and F7 have missing values and `r nAllB8` units where all three variables B4, B7, and B8 have missing values. ### Imputation ```{r MiceWrong3, fig.height=14, include = FALSE} set.seed(2021) if(file.exists("dataMiceWrong3.RDS")¶ms$fromRDS){ dataMiceWrong3<-readRDS("dataMiceWrong3.RDS") }else{ dataMiceWrong3<-mice(dataset[c("vote","gndr","age","B8","B4","B7","F7","F6")],m = 30, maxit = 20 ) saveRDS(dataMiceWrong3, file="dataMiceWrong3.RDS") } ``` ```{r MiceWrong3Plot, fig.height=14} plot(dataMiceWrong3, layout = c(2,8)) ``` ### Analysis ```{r MiceAnalysisWrong3} resMiceWrong3<-with(glm(vote~gndr+age+B8+F7,family =binomial),data = dataMiceWrong3) kable(summary(pool(resMiceWrong3)),digits = c(3,3,3,3,1,4)) ``` ## MICE - Correct (Probably) We still have not corrected a key mistake, which is that units without values ("not eligible to vote") are treated as if they do have values. To fix this error, we exclude those without voting eligibility from the data. Since there are relatively few such units, we do not expect this to have a significant impact. ::::: columns ::: column ```{r overview3, fig.height=10} dataset$F6<-makeFactorLabels(dataset$F6) use<-dataset$B11!="Not eligible to vote" use[is.na(use)]<-TRUE dataset3<-dataset[use,] mdp3<-md.pattern(dataset3[c("vote","gndr","age","B8","B4","B7","F7","F6")]) ``` ::: ::: column ```{r overview3Text} mdp3 ``` ::: ::::: ### Imputation ```{r Mice, fig.height=14, include = FALSE} if(file.exists("datasetMice.RDS")¶ms$fromRDS){ datasetMice<-readRDS("datasetMice.RDS") }else{ datasetMice<-mice(dataset3[c("vote","gndr","age","B8","B4","B7","F7","F6")],m = 30, maxit = 20 ) saveRDS(datasetMice, file="datasetMice.RDS") } ``` ```{r MicePlot, fig.height=14} plot(datasetMice, layout = c(2,8)) ``` ### Analysis ```{r MiceAnalysis} resMice<-with(glm(vote~gndr+age+B8+F7,family =binomial),data = datasetMice) kable(summary(pool(resMice)),digits = c(3,3,3,3,1,4)) ``` # Comparison of Results Below, we present all the results calculated so far, including coefficient values and $p$-values. However, based on these results, we cannot determine which of them is the correct one. What we can observe is that the results are generally quite similar. The largest differences are found in the estimation of the effect of gender and, to some extent, the effect of variables B8 and F7. Differences in regression coefficients are relatively small but not negligible (particularly for gender and B8). This suggests that the mechanism of missing values is not MCAR but at least MAR. Differences in $p$-values for gender and B8 are also noticeable. These are partly due to the change in the amount of used information (number of "usable" values) and partly due to differences in coefficient values. As mentioned, we cannot determine which result is closer to the true one. However, based on theory, we can infer that the most accurate result is likely the one obtained from the final analysis based on multiple imputations since we systematically addressed "errors" throughout this task. ```{r Comparison} listwise<-summary(glm(vote~gndr+age+B8+F7,family =binomial,data = dataset)) miceWrong1<-summary(pool(resMiceWrong)) miceWrong2<-summary(pool(resMiceWrong2)) miceWrong3<-summary(pool(resMiceWrong3)) miceCorrect<-summary(pool(resMice)) tab<-cbind( beta=listwise$coefficients[,1],p=listwise$coefficients[,4], beta=miceWrong1$estimate, p=miceWrong1$p.value, beta=miceWrong2$estimate, p=miceWrong2$p.value, beta=miceWrong3$estimate, p=miceWrong3$p.value, beta=miceCorrect$estimate, p=miceCorrect$p.value ) tab<-t(tab) r1<-seq(1, nrow(tab), by=2) r2<-seq(2, nrow(tab), by=2) tab2<-matrix(NA, nrow = nrow(tab)/2, ncol=ncol(tab)*2) c1<-seq(1, ncol(tab2), by=2) c2<-seq(2, ncol(tab2), by=2) tab2[,c1]<-tab[r1,] tab2<-as.data.frame(tab2) tab2[,c2]<-formatC(tab[r2,], digits=3,format = "f" ) colnames(tab2)<-rep(c("beta","p"), length.out=ncol(tab2)) rownames(tab2)<-c("listwise", "miceWrong1", "miceWrong2", "miceWrong3", "miceCorrect") ktab<-kable(tab2, digits = rep(3,8), format = "html", align="r") h<-rep(2, ncol(tab2)/2) names(h)<-colnames(tab) add_header_above(ktab, c(" ", h)) %>% kable_styling() ``` # "Truth" The dataset in this task is not real. It is based on real data but has been adjusted in the following ways: * The effect of the gender variable has been altered. * Units with originally missing values have been removed from the dataset. * Missing values have been added to the data using the MAR mechanism. The dataset after the first two steps is effectively a dataset without missing values. ## Analysis The results of the analysis on this dataset can be considered "true" and are shown in the table below. Note that in real-world analysis, we do not have access to such "true" datasets. ```{r AnalysisAvailableOriginal} tmp<-summary(glm(vote~gndr+age+B8+F7,family =binomial,data = datasetOrg)) kable(tmp$coef,digits = c(3,3,3,3,1,4)) ``` ## Comparison with True Results In this case, we can compare the results to the "true" ones (shown in the table below), where we find that the results of the final analysis are closest to the true values. However, as emphasized earlier, such a comparison is not something we can do in real-world scenarios. ```{r ComparisonTrue} true<-summary(glm(vote~gndr+age+B8+F7,family =binomial,data = datasetOrg)) listwise<-summary(glm(vote~gndr+age+B8+F7,family =binomial,data = dataset)) miceWrong1<-summary(pool(resMiceWrong)) miceWrong2<-summary(pool(resMiceWrong2)) miceWrong3<-summary(pool(resMiceWrong3)) miceCorrect<-summary(pool(resMice)) tab<-cbind(beta=true$coefficients[,1],p=true$coefficients[,4], beta=listwise$coefficients[,1],p=listwise$coefficients[,4], beta=miceWrong1$estimate, p=miceWrong1$p.value, beta=miceWrong2$estimate, p=miceWrong1$p.value, beta=miceWrong3$estimate, p=miceWrong1$p.value, beta=miceCorrect$estimate, p=miceCorrect$p.value ) tab<-t(tab) r1<-seq(1, nrow(tab), by=2) r2<-seq(2, nrow(tab), by=2) tab2<-matrix(NA, nrow = nrow(tab)/2, ncol=ncol(tab)*2) c1<-seq(1, ncol(tab2), by=2) c2<-seq(2, ncol(tab2), by=2) tab2[,c1]<-tab[r1,] tab2<-as.data.frame(tab2) tab2[,c2]<-formatC(tab[r2,], digits=3,format = "f" ) colnames(tab2)<-rep(c("beta","p"), length.out=ncol(tab2)) rownames(tab2)<-c("true", "listwise", "miceWrong1", "miceWrong2", "miceWrong3", "miceCorrect") ktab<-kable(tab2, digits = rep(3,8), format = "html", align = "r") h<-rep(2, ncol(tab2)/2) names(h)<-colnames(tab) add_header_above(ktab, c(" ", h)) %>% kable_styling() ```