Step 3: Imputation of missing data

The idea of imputation is both seductive and dangerous

Like most statistical series, composite indicators are plagued by problems of missing values. In many cases, data are only available for a limited number of countries or only for certain data components. Missing values can render the composite indicator less reliable for the countries for which only limited information is available and can distort the relative standing of all countries in the composite. There are a number of approaches for dealing with missing values, all of which have flaws:

  • data deletion - omitting entire records (for variables or countries) when there is a substantial number of missing data;
  • mean substitution - substituting a variable's mean value computed from available cases to fill in missing values;
  • regression - using regressions based on other variables to estimate the missing values;
  • multiple imputation - using a large number of sequential regressions with indeterminate outcomes, which are run multiple times and averaged;
  • nearest neighbour - identifying and substituting the most similar case for the one with a missing value; or
  • ignore them - take the average index of the remaining indicators.

“The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can legitimately handled in this way and situations where standard estimators applied to real and imputed data have substantial bias.” (Dempster A.P. and Rubin D.B. (1983) Introduction pp.3-10, in Incomplete Data in Sample Surveys (vol. 2): Theory and Bibliography (W.G. Madow, I. Olkin and D.B. Rubin eds.) New York: Academic Press.)