Incomplete data are ubiquitous in social sciences; as a consequence, available data are inefficient (ineffective) and often biased. In the literature, multiple imputation is known to be the standard method to handle missing data. While the theory of multiple imputation has been known for decades, the implementation is difficult due to the complicated nature of random draws from the posterior distribution. Thus, there are several computational algorithms in software: Data Augmentation (DA), Fully Conditional Specification (FCS), and Expectation-Maximization with Bootstrapping (EMB). Although the literature is full of comparisons between joint modeling (DA, EMB) and conditional modeling (FCS), little is known about the relative superiority between the MCMC algorithms (DA, FCS) and the non-MCMC algorithm (EMB), where MCMC stands for Markov chain Monte Carlo. Based on simulation experiments, the current study contends that EMB is a confidence proper (confidence-supporting) multiple imputation algorithm without between-imputation iterations; thus, EMB is more user-friendly than DA and FCS.

Generally, it is quite difficult to obtain complete data in social surveys (

While the theoretical concept of multiple imputation has been around for decades, the implementation is difficult because making a random draw from the posterior distribution is a complicated matter. Accordingly, there are several computational algorithms in software (

By way of organization, Section 2 introduces the notations in this article. Section 3 gives a motivating example of missing data analysis in social sciences. Section 4 presents the assumptions of imputation methods. Section 5 shows the traditional methods of handling missing data. Section 6 introduces the three multiple imputation algorithms. Section 7 surveys the literature on multiple imputation. Sections 8 gives the results of the Monte Carlo experiments, showing the impact of between-imputation iterations on multiple imputation. Section 9 concludes with the findings and the limitations in the current research.

_{p}(μ, Σ), where all of the variables are continuous. Let _{1}, …, _{p}_{j}_{–j}_{j}_{j}_{obs}_{mis}_{obs}, Y_{mis}

At the imputation stage, there is no concept of the dependent and independent variables, because imputation is not a causal model, but a predictive model (_{j}_{j}_{–j}_{1}, …, _{p–1}.

Let

Social scientists have long debated the determinants of economic development across countries (

Variables and Missing Rates.

Variables | Missing Rates |
---|---|

GDP per capita (purchasing power parity) | 0.0% |

Freedom House index | 15.4% |

Central bank discount rate | 32.9% |

Life expectancy at birth | 2.6% |

Unemployment rate | 10.5% |

Distribution of family income: Gini index | 37.3% |

Public debt | 22.4% |

Education expenditures | 24.6% |

Taxes and other revenues | 6.1% |

Military expenditures | 43.0% |

Data sources: CIA (

Table

Multiple Regression Analyses on GDP Per Capita.

Incomplete Data | Multiply-Imputed Data | |||
---|---|---|---|---|

Variables | Coef. | Std. Err. | Coef. | Std. Err. |

Intercept | –7.323 | 3.953 | –11.545* | 3.495 |

Freedom | –0.321* | 0.127 | –0.362* | 0.127 |

– |
–0.107 | 0.049 | ||

Life Expectancy | 3.922* | 0.794 | 4.908* | 0.655 |

Unemployment | –0.205* | 0.087 | –0.214* | 0.070 |

Gini | 0.114 | 0.253 | –0.018 | 0.363 |

– |
–0.002 | 0.093 | ||

0.035 | 0.164 | – |
||

Tax | 0.357* | 0.174 | 0.471* | 0.151 |

0.123 | 0.085 | |||

Number of obs. | 86 | 228 |

Missing data analyses always involve assumptions (

There are three common assumptions of missing data mechanisms in the literature (_{obs}_{obs}

To be strict, the missing data mechanism is ignorable if both of the following conditions are satisfied: (1) The MAR condition; and (2) the distinctness condition, which stipulates that the parameters in the missing data mechanism are independent of the parameters in the data model (

However, the MAR condition is said to be more relevant in real data applications (

Imputation is said to be Bayesianly proper if imputed values are independent realizations of _{mis}_{obs}_{mis}

van Buuren (

Congeniality means that the imputation model is equal to the substantive analysis model. It is widely known that the imputation model can be larger than the substantive analysis model, but the imputation model cannot be smaller than the substantive analysis model (

This section introduces listwise deletion, deterministic single imputation, and stochastic single imputation, which are used as baseline methods for comparisons in Section 8.

Listwise deletion (LD), also known as complete-case analysis, throws away any rows that have at least one missing value (

Deterministic single imputation (D-SI) replaces a missing value with a reasonable guess. The most straightforward version calculates predicted scores for missing values based on a regression model (

Stochastic single imputation (S-SI) also utilizes a regression model to predict missing values, but it adds to imputed values random components drawn from the residual distribution (

However, both D-SI and S-SI tend to underestimate the standard error in imputed data because imputed values are treated as if they were real (

Multiple imputation was made widely known by Rubin (

However, using the analytical methods, it is not easy to randomly draw sufficient statistics from the posterior distribution (

The traditional algorithm of multiple imputation is the Data Augmentation (DA) algorithm, which is a Markov chain Monte Carlo (MCMC) technique (

The DA algorithm works as follows (

These two steps are repeated

There are two ways of generating multiple imputations by DA (_{mis}_{mis}

The software using this algorithm is R-Package NORM2, which was originally developed by Schafer (

An alternative algorithm to DA is the Fully Conditional Specification (FCS) algorithm, which specifies the multivariate distribution by way of a series of conditional densities, through which missing values are imputed given the other variables (

The FCS algorithm works as follows (

The entire process is repeated for

The software using this algorithm is R-Package MICE (

Another emerging algorithm is the Expectation-Maximization with Bootstrapping (EMB) algorithm, which combines the Expectation-Maximization (EM) algorithm with the nonparametric bootstrap to create multiple imputation (

The EMB algorithm works as follows (

These two steps are repeated until convergence is attained, where the converged value is a Maximum Likelihood Estimate (MLE) under well-behaved conditions (

The software using this algorithm is R-Package AMELIA II (

The three algorithms share certain characteristics with each other, but not exactly the same as summarized in Table

Relations among DA, EMB, and FCS.

Joint Modeling | Conditional Modeling | |
---|---|---|

MCMC | DA | FCS |

Non-MCMC | EMB |

DA and EMB are joint modeling while FCS is conditional modeling (

DA and FCS are different versions of MCMC techniques. On the other hand, EMB is not an MCMC technique. It is said that DA and FCS require between-imputation iterations to be confidence proper (

Table

Summary of the 20 Studies on Multiple Imputation.

Authors | MI Algorithms | Sample Size | Number of Variables | Number of Imputations | Number of Iterations | Missing Rate |
---|---|---|---|---|---|---|

Barnard and Rubin ( |
DA | 10, 20, 30 | 2 | 3, 5, 10 | Unknown | 10%, 20%, 30% |

Horton and Lipsitz ( |
DA, FCS | 10000 | 3 | 10 | 200 | 50% |

Schafer and Graham ( |
DA | 50 | 2 | 20 | Unknown | 73% |

Donders et al. ( |
FCS | 500 | 2 | 10 | Unknown | 40% |

Abe and Iwasaki ( |
DA | 100 | 4 | 5 | 100 | 20%, 30% |

133774 | 10 | 10 | 5 | 41% | ||

Stuart et al. ( |
FCS | 9186 | 400 | 10 | 10 | 18% |

Lee and Carlin ( |
DA, FCS | 1000 | 8 | 20 | 10 | 33% |

Leite and Beretvas ( |
DA | 400 | 10 | 10 | Unknown | 10%, 30%, 50% |

50, 100, 200 | 3, 13, 23, 43, 83 | 20 | 20%, 50% | |||

Lee and Carlin ( |
DA | 1000 | 8 | 20 | Unknown | 10%, 25%, 50%, 75%, 90% |

Cranmer and Gill ( |
EMB, MHD | 500 | 5 | Unknown | NA | 20%, 50%, 80% |

Cheema ( |
FCS | 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000 | 4 | Unknown | Unknown | 1%, 2%, 5%, 10%, 20% |

1000 | 8 | 5 | 25% | |||

Shara et al. ( |
Unknown | 2246 | 8 | Unknown | Unknown | 20%, 30%, 40% |

Deng et al. ( |
FCS | 100 | 200, 1000 | 10 | 20 | 40% |

von Hippel ( |
DA | 25, 100 | 2 | 5 | Unknown | 50% |

Hughes, Sterne, and Tilling ( |
Unknown | 100, 1000 | 5 | 50 | Unknown | 40%, 60% |

McNeish ( |
DA, FCS | 20, 50, 100, 250 | 4 | 5, 25, 100 | Unknown | 10%, 20%, 30%, 50% |

Four studies investigated specialized situations for multiple imputation, such as small-sample degrees of freedom in DA (

Seven studies compared different multiple imputation algorithms (

Ten studies did not explicitly state the number of iterations

Thus, no studies in Table

Section 4 introduced MAR, proper imputation, and congeniality as crucial assumptions. To make the assumptions of MAR and congeniality realistic, an inclusive analysis strategy is recommended in the literature (

When assumptions do not hold in statistical methods, analytical mathematics does not often provide answers about the properties of the methods (

The current study prepares two versions of simulation data, (1) theoretical and (2) realistic. Auxiliary variables

The first setting is theoretical. The number of observations is 1000, which is equivalent to the 75^{th} percentile of the sample sizes found in the studies listed in Table ^{th} percentile of the number of variables found in the studies listed in Table _{j}_{p}_{–1}(0, 1), where the number of auxiliary variables is _{j}_{i}_{j}_{j}_{i}_{j}_{0} and

The second setting is realistic. The number of observations is 228, which is the full sample size of the real data in Table _{j}_{j}_{i}_{j}_{j}_{0}) reflects the coefficients in multiple regression models using the empirical data and _{i}_{resid}_{resid}

In both settings, _{j}_{i}_{ij}_{ji}_{i}_{ij}_{ji}_{i}_{i}_{ij}_{ji}_{i}_{i}_{ij}_{j}_{i}_{1i} is income. The missingness of income depends on age and some random components. Income is missing if age is less than the median of age and uniform random numbers are less than 0.5. Also, income is missing if age is larger than the median of age and uniform random numbers are larger than 0.9.

Although the literature (^{th} percentile of the number of multiply-imputed data found in the studies listed in Table

As for

The estimand in all of the simulation runs is _{1} in

Unbiasedness can be assessed by equation (9), because an estimator

Unbiasedness and efficiency can be simultaneously assessed by the Root Mean Square Error (RMSE), defined as equation (10). The RMSE measures the spread around the true value of the parameter, placing slightly more emphasis on efficiency than bias (

Confidence validity can be assessed by the coverage probability of the nominal 95% confidence interval (CI), which ‘is the proportion of simulated samples for which the estimated confidence interval includes the true parameter’ (

The standard error of the 95% CI coverage over 1000 iterations is

Abbreviations in this section are explained in Table

Abbreviations and the Missing Data Methods.

Abbreviations | Missing Data Methods |
---|---|

CD | Complete data without missing values |

LD | Listwise deletion |

EMB | MI by AMELIA II |

DA1 | MI by NORM2 with no iterations |

DA2 | MI by NORM2 with 2*EM iterations |

FCS1 | MI by MICE with no iterations |

FCS2 | MI by MICE with 2*EM iterations |

D-SI | Deterministic SI by |

S-SI | Stochastic SI by |

This section presents the results of the Monte Carlo simulation for the theoretical case, where the correlation matrix and the regression coefficients are randomly generated.

Table _{1}. The Bias and RMSE values for listwise deletion and single imputation methods indicate that these methods are not recommended at all. All of the Bias and RMSE values from EMB, DA1, DA2, and FCS2 are almost identical, showing that they are generally unbiased. However, FCS1 is rather biased, quite similar to S-SI. Therefore, when between-imputation iterations are ignored, there are no discernible effects on bias and efficiency in EMB and DA, but FCS may suffer from some bias.

Bias and RMSE (Theoretical Data).

Number of Variables |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ||

CD | Bias | 0.001 | 0.003 | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0.002 | 0.001 |

RMSE | 0.040 | 0.047 | 0.038 | 0.039 | 0.058 | 0.026 | 0.046 | 0.039 | 0.047 | |

LD | Bias | |||||||||

RMSE | 0.059 | 0.153 | 0.122 | 0.121 | 0.349 | 0.103 | 0.160 | 0.228 | 0.155 | |

EMB | Bias | 0.000 | 0.004 | 0.002 | 0.000 | 0.005 | 0.001 | 0.005 | 0.005 | 0.002 |

RMSE | 0.046 | 0.053 | 0.050 | 0.051 | 0.075 | 0.041 | 0.069 | 0.059 | 0.072 | |

DA1 | Bias | 0.001 | 0.002 | 0.003 | 0.001 | 0.001 | 0.000 | 0.003 | 0.003 | 0.002 |

RMSE | 0.046 | 0.053 | 0.050 | 0.051 | 0.074 | 0.041 | 0.069 | 0.058 | 0.072 | |

DA2 | Bias | 0.002 | 0.001 | 0.005 | 0.002 | 0.001 | 0.000 | 0.001 | 0.003 | 0.000 |

RMSE | 0.046 | 0.053 | 0.050 | 0.051 | 0.074 | 0.041 | 0.069 | 0.058 | 0.072 | |

FCS1 | Bias | 0.002 | 0.001 | |||||||

RMSE | 0.047 | 0.053 | 0.097 | 0.062 | 0.116 | 0.065 | 0.109 | 0.052 | 0.239 | |

FCS2 | Bias | 0.001 | 0.002 | 0.004 | 0.002 | 0.001 | 0.000 | 0.001 | 0.002 | 0.001 |

RMSE | 0.046 | 0.053 | 0.050 | 0.051 | 0.075 | 0.041 | 0.069 | 0.058 | 0.071 | |

D-SI | Bias | |||||||||

RMSE | 0.192 | 0.248 | 0.182 | 0.110 | 0.207 | 0.109 | 0.248 | 0.099 | 0.189 | |

S-SI | Bias | 0.002 | 0.000 | |||||||

RMSE | 0.050 | 0.057 | 0.102 | 0.066 | 0.124 | 0.076 | 0.119 | 0.062 | 0.241 |

Table _{1}. The CIs for listwise deletion and single imputation methods are not confidence valid. When the number of auxiliary variables is small (and hence the overall missing rate is small), the between-imputation iterations may be ignored, where all of the multiple imputation CIs are confidence valid. However, as the number of auxiliary variables becomes large, DA1 and FCS1 drift away from the confidence validity. EMB, DA2, and FCS2 are confidence valid regardless of the number of variables and the missing rate. This shows that EMB is confidence proper even if it does not iterate. This is an important finding in the current study.

Coverage of the 95% CI (Theoretical Data).

Number of Variables |
|||||||||
---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

CD | 95.3 | 94.9 | 94.2 | 94.0 | 96.0 | 96.0 | 95.3 | 94.9 | 94.6 |

LD | |||||||||

EMB | 95.0 | 95.1 | 94.2 | 95.5 | 94.9 | 94.4 | 94.3 | 94.1 | 95.0 |

DA1 | 94.6 | 94.9 | 94.1 | ||||||

DA2 | 94.3 | 95.8 | 95.1 | 94.1 | 94.8 | 94.3 | 94.2 | 94.9 | |

FCS1 | 94.2 | 95.0 | 95.5 | ||||||

FCS2 | 94.7 | 95.6 | 94.4 | 93.9 | 95.4 | 94.5 | 94.2 | 95.0 | 95.0 |

D-SI | |||||||||

S-SI |

Table

Lengths of the 95% CI (Theoretical Data).

Number of Variables |
|||||||||
---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

CD | 0.157 | 0.184 | 0.144 | 0.148 | 0.236 | 0.102 | 0.184 | 0.151 | 0.180 |

LD | 0.189 | 0.259 | 0.226 | 0.235 | 0.384 | 0.213 | 0.358 | 0.339 | 0.390 |

EMB | 0.178 | 0.209 | 0.196 | 0.200 | 0.301 | 0.160 | 0.275 | 0.229 | 0.281 |

DA1 | 0.176 | 0.207 | 0.187 | 0.192 | 0.293 | 0.145 | 0.256 | 0.208 | 0.253 |

DA2 | 0.177 | 0.208 | 0.194 | 0.198 | 0.298 | 0.158 | 0.271 | 0.223 | 0.274 |

FCS1 | 0.178 | 0.209 | 0.237 | 0.211 | 0.324 | 0.248 | 0.306 | 0.223 | 0.299 |

FCS2 | 0.178 | 0.209 | 0.197 | 0.201 | 0.302 | 0.161 | 0.275 | 0.228 | 0.281 |

D-SI | 0.143 | 0.174 | 0.133 | 0.149 | 0.244 | 0.103 | 0.205 | 0.150 | 0.188 |

S-SI | 0.157 | 0.184 | 0.161 | 0.155 | 0.238 | 0.145 | 0.188 | 0.149 | 0.186 |

Table

Computational Time (Theoretical Data).

Number of Variables |
|||||||||
---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

EMB | 0.46 | 0.53 | 0.53 | 0.59 | 0.71 | ||||

DA2 | 1.09 | 1.39 | 2.22 | 3.63 | |||||

FCS2 | 2.47 | 5.98 | 14.48 | 21.33 | 25.40 | 54.71 | 59.14 | 85.69 | 133.17 |

This section presents the results of the Monte Carlo simulation for the realistic case, where the correlation matrix and the regression coefficients are based on the real data (

Table _{1}. The overall conclusions are similar to Table

Bias and RMSE (Realistic Data).

Number of Variables |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ||

CD | Bias | 0.003 | 0.002 | 0.002 | 0.002 | 0.001 | 0.002 | 0.000 | 0.002 | 0.002 |

RMSE | 0.074 | 0.086 | 0.068 | 0.067 | 0.066 | 0.065 | 0.070 | 0.069 | 0.075 | |

LD | Bias | |||||||||

RMSE | 0.095 | 0.128 | 0.104 | 0.118 | 0.141 | 0.154 | 0.157 | 0.159 | 0.188 | |

EMB | Bias | 0.001 | 0.002 | 0.002 | 0.005 | 0.001 | 0.000 | 0.000 | 0.002 | 0.006 |

RMSE | 0.084 | 0.113 | 0.091 | 0.090 | 0.089 | 0.092 | 0.102 | 0.099 | 0.110 | |

DA1 | Bias | 0.006 | 0.001 | 0.003 | 0.003 | 0.001 | 0.001 | 0.001 | 0.001 | 0.002 |

RMSE | 0.084 | 0.112 | 0.090 | 0.089 | 0.087 | 0.091 | 0.100 | 0.096 | 0.105 | |

DA2 | Bias | 0.009 | 0.000 | 0.002 | 0.004 | 0.002 | 0.004 | 0.000 | 0.001 | 0.001 |

RMSE | 0.084 | 0.111 | 0.089 | 0.088 | 0.086 | 0.090 | 0.098 | 0.094 | 0.102 | |

FCS1 | Bias | 0.007 | 0.006 | 0.005 | 0.002 | 0.008 | 0.006 | 0.000 | ||

RMSE | 0.084 | 0.106 | 0.081 | 0.081 | 0.080 | 0.081 | 0.086 | 0.083 | 0.088 | |

FCS2 | Bias | 0.007 | 0.001 | 0.002 | 0.002 | 0.003 | 0.005 | 0.002 | 0.003 | 0.005 |

RMSE | 0.084 | 0.112 | 0.088 | 0.088 | 0.086 | 0.090 | 0.097 | 0.093 | 0.100 | |

D-SI | Bias | |||||||||

RMSE | 0.207 | 0.163 | 0.115 | 0.118 | 0.118 | 0.123 | 0.130 | 0.127 | 0.151 | |

S-SI | Bias | 0.005 | 0.007 | 0.006 | 0.002 | 0.006 | 0.005 | 0.009 | 0.006 | |

RMSE | 0.089 | 0.116 | 0.096 | 0.095 | 0.091 | 0.094 | 0.100 | 0.102 | 0.105 |

Table _{1}. The overall conclusions are similar to Table

Coverage of the 95% CI (Realistic Data).

Number of Variables |
|||||||||
---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

CD | 94.6 | 95.3 | 95.8 | 94.7 | 95.2 | 96.4 | 94.6 | 95.3 | 94.8 |

LD | |||||||||

EMB | 94.3 | 94.1 | 94.7 | 93.9 | 96.1 | 94.2 | 94.0 | 94.4 | 94.7 |

DA1 | 94.1 | 94.4 | 95.7 | ||||||

DA2 | 94.0 | 94.0 | 94.8 | 94.4 | 95.9 | 94.5 | 93.8 | 95.0 | 95.0 |

FCS1 | 94.6 | 94.7 | 96.3 | ||||||

FCS2 | 94.7 | 93.8 | 95.5 | 95.7 | 96.4 | 94.3 | 94.8 | 95.2 | 96.1 |

D-SI | |||||||||

S-SI |

Table

Lengths of the 95% CI (Realistic Data).

Number of Variables |
|||||||||
---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

CD | 0.279 | 0.334 | 0.268 | 0.266 | 0.267 | 0.261 | 0.278 | 0.274 | 0.289 |

LD | 0.333 | 0.441 | 0.389 | 0.412 | 0.436 | 0.457 | 0.516 | 0.543 | 0.631 |

EMB | 0.314 | 0.429 | 0.364 | 0.356 | 0.362 | 0.359 | 0.397 | 0.396 | 0.432 |

DA1 | 0.313 | 0.414 | 0.348 | 0.342 | 0.343 | 0.337 | 0.370 | 0.364 | 0.390 |

DA2 | 0.315 | 0.423 | 0.356 | 0.351 | 0.353 | 0.351 | 0.383 | 0.380 | 0.410 |

FCS1 | 0.315 | 0.416 | 0.353 | 0.348 | 0.350 | 0.350 | 0.382 | 0.380 | 0.406 |

FCS2 | 0.316 | 0.429 | 0.359 | 0.355 | 0.358 | 0.352 | 0.389 | 0.386 | 0.413 |

D-SI | 0.288 | 0.380 | 0.292 | 0.289 | 0.291 | 0.278 | 0.302 | 0.294 | 0.315 |

S-SI | 0.281 | 0.325 | 0.262 | 0.257 | 0.259 | 0.255 | 0.269 | 0.267 | 0.277 |

Table

Computational Time (Realistic Data).

Number of Variables |
|||||||||
---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

EMB | 0.14 | 0.15 | 0.16 | 0.20 | 0.23 | 0.28 | 0.36 | ||

DA2 | 0.47 | 0.67 | |||||||

FCS2 | 1.05 | 2.55 | 4.22 | 8.92 | 12.02 | 15.59 | 20.82 | 26.78 | 35.95 |

This article assessed the relative performance of the three multiple imputation algorithms (DA, FCS, and EMB). In both theoretical and realistic settings (Table

DA and FCS can be both confidence valid under the large number of iterations; however, the assessment of convergence in MCMC is notoriously difficult. Furthermore, the convergence properties of FCS are currently under debate due to possible incompatibility (

No simulation studies can include all the patterns of relevant data (

The additional file for this article can be found as follows:

Political and Economic Data from CIA (

The author wishes to thank Dr. Manabu Iwasaki (Seikei University), Dr. Michiko Watanabe (Keio University), and Dr. Takayuki Abe (Keio University) for the helpful comments. The author also wishes to thank the two anonymous reviewers for their comments that improved the quality of the article. Note that part of this article in its very early version was presented at the 59^{th} World Statistics Congress of the International Statistical Institute (

The author has no competing interests to declare.