E-Book, Englisch, 234 Seiten, eBook
ISBN: 978-0-387-69505-1
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
Zielgruppe
Professional/practitioner
Autoren/Hrsg.
Weitere Infos & Material
Data Quality: What It is, Why It is Important, and How to Achieve It.- What is Data Quality and Why Should We Care?.- Examples of Entities Using Data\break to their Advantage/Disadvantage.- Properties of Data Quality and Metrics for Measuring It.- Basic Data Quality Tools.- Specialized Tools for Database Improvement.- Mathematical Preliminaries for Specialized Data Quality Techniques.- Automatic Editing and Imputation of Sample Survey Data.- Record Linkage – Methodology.- Estimating the Parameters of the Fellegi–Sunter Record Linkage Model.- Standardization and Parsing.- Phonetic Coding Systems for Names.- Blocking.- String Comparator Metrics for Typographical Error.- Record Linkage Case Studies.- Duplicate FHA Single-Family Mortgage Records.- Record Linkage Case Studies in the Medical, Biomedical, and Highway Safety Areas.- Constructing List Frames and Administrative Lists.- Social Security and Related Topics.- Other Topics.- Confidentiality: Maximizing Access to Micro-data while Protecting Privacy.- Review of Record Linkage Software.- Summary Chapter.
7 Automatic Editing and Imputation of Sample Survey Data (p. 61)
7.1. Introduction
As discussed in Chapter 3, missing and contradictory data are endemic in computer databases. In Chapter 5, we described a number of basic data editing techniques that can be used to improve the quality of statistical data systems. By an edit we mean a set of values for a specified combination of data elements within a database that are jointly unacceptable (or, equivalently, jointly acceptable). Certainly, we can use edits of the types described in Chapter 5.
In this chapter, we discuss automated procedures for editing (i.e., cleaning up) and imputing (i.e., filling in) missing data in databases constructed from data obtained from respondents in sample surveys or censuses. To accomplish this task, we need efficient ways of developing statistical data edit/imputation systems that minimize development time, eliminate most errors in code development, and greatly reduce the need for human intervention.
In particular, we would like to drastically reduce, or eliminate entirely, the need for humans to change/correct data. The goal is to improve survey data so that they can be used for their intended analytic purposes.
One such important purpose is the publication of estimates of totals and subtotals that are free of self-contradictory information. We begin by discussing editing procedures, focusing on the model proposed by Fellegi and Holt [1976]. Their model was the first to provide fast, reproducible, table-driven methods that could be applied to general data. It was the first to assure that a record could be corrected in one pass through the data.
Prior to Fellegi and Holt, records were iteratively and slowly changed with no guarantee that any final set of changes would yield a record that satisfied all edits. We then describe a number of schemes for imputing missing data elements, emphasizing the work of Rubin [1987] and Little and Rubin [1987, 2002].
Two important advantages of the Little–Rubin approach are that (1) probability distributions are preserved by the use of defensible statistical models and (2) estimated variances include a component due to the imputation. In some situations, the Little–Rubin methods may need extra information about the non-response mechanism.
For instance, if certain high-income individuals have a stronger tendency to not report or misreport income, then a specific model for the income-reporting of these individuals may be needed. In other situations, the missing-data imputation can be done via methods that are straightforward extensions of hot-deck. We provide details of hot-deck and its extensions later in this chapter.
Ideally, we would like to have an all-purpose, unified edit/imputation model that incorporates the features of the Fellegi–Holt edit model and the Little– Rubin multiple imputation model. Unfortunately, we are not aware of such a model. However, Winkler [2003] provides a unified approach to edit and imputation when all of the data elements of interest can be considered to be discrete.