Statistical methods to improve estimates obtained from probability and nonprobability samples

  1. Ferri García, Ramón
Supervised by:
  1. María del Mar Rueda García Director

Defence university: Universidad de Granada

Fecha de defensa: 28 June 2021

Committee:
  1. Ana María Aguilera del Pino Chair
  2. María Jesús García Ligero Ramírez Secretary
  3. Domingo Morales González Committee member
  4. Yves Tillé Committee member
  5. María José Lombardía Committee member
Department:
  1. ESTADÍSTICA E INVESTIGACIÓN OPERATIVA

Type: Thesis

Abstract

Since their theoretical development in the rst half of the XXth century, surveys have been the standard procedure to obtain information from a population of interest. The statistical properties of the estimators of population parameters, such as totals, means or proportions, allow researchers to make inferences about a target population using only a reduced sample of it, as well as obtain a measure of the variability of the estimations. The rst surveys were administrated by directly interviewing the respondents in person, a mode known as face-to-face surveying. This administration mode has been considered the "gold standard"practice in surveys, but their increasing costs and the advances in communication technologies favored the rise of telephone surveys and self-administered questionnaires, such as those used in mail surveys. In the last decades, these modes have also experienced an increase in costs and coverage problems, as well as a decline in response rates. Again, the development of new technologies has been the factor that has allowed the appearance of a new set of questionnaire administration techniques known as online surveys. Some examples include SMS surveys, e-mail surveys, smartphone surveys, and especially Web surveys, which are those that are administered and completed in web browsers. Online surveys comprise many advantages for researchers to conduct their studies. Recruitment of participants can be done much faster than in other survey modes, and at largely reduced costs. In addition, the use of technology allows researchers to design questionnaires with a wider spectrum of possibilities than in face-to-face, telephone or mail surveys. On the other hand, online surveys present several relevant sources of error. By de nition, such surveys can only reach online users or people with some kind of access to information and communication technology networks. This is an important coverage issue that can lead to biased estimates if the composition of the o ine population di ers signi cantly from that of the online population, which is often the case as the di erences are associated to demographics such as education level or age. In addition, the impossibility to nd any reliable sampling frame of the online population contributes to the use of self-selection procedures in online surveys. This practice constitutes an example of nonprobability sampling where the estimators of population parameters and their variance cannot be calculated because of the inability of inclusion probabilities to meet the requirements of a probability sampling. The main consequence of the application of these procedures is selection bias, which can be very relevant if there is any relationship between propensity to participate (self-select) in the survey and the variables of interest of the study. In those cases where a sampling frame is available for an online survey, and therefore it is possible to design a sampling scheme, non-response bias is also prone to appear. This is a particularly relevant issue in online panel surveys, and it has been associated with factors such as questionnaire length, incentives or invitation reminders. Some methods have been developed in survey methodology literature to address these issues. Non-response error is a common problem to all probability sampling surveys, and in consequence many methods have been developed to mitigate it, from which imputation and reweighting techniques can be pointed out. The correction of coverage and self-selection biases depends on the auxiliary information available. If only population totals for a set of covariates are available, calibration procedures can be applied; these have been proven to reduce coverage error, but their use in the correction of self-selection bias in online surveys is unclear. In some cases, a probability survey of reference, conducted in the same target population, is available. The variable of interest has not been measured on it, but if some auxiliary covariates (also measured in the online survey) are available, some adjustments can be considered. The most remarkable ones are Propensity Score Adjustment (PSA) and Statistical Matching or Mass Imputation. These adjustments focus on the mitigation of self-selection bias. Finally, if a population census is available for some auxiliary covariates (also measured in the online survey), methods based on superpopulation modeling can be considered, such as model-based, model-adjusted and model-calibrated estimators. These methods have been mostly considered in probability sampling contexts, although some recent works adapt some of them to nonprobability sampling problems. To contribute with the development of online surveys, we propose some methodological advances, such as the development of estimators of general parameters and the estimator of their variance, the study of the properties of the combination of PSA and calibration, the use of modern prediction techniques and variable selection methods in PSA, and the adaptation of all the superpopulation modeling approaches to the nonprobability sampling context considering modern prediction techniques as well. We also adapt the weight smoothing strategy, developed for increasing the e ciency of the estimators in multipurpose probability surveys, to the nonprobability sampling context. Adapting the weighting adjustments existent for such samples to multipurpose surveys could be the key to their adoptation in the production of o cial statistics or their inclusion in large-scale studies. Finally, we use PSA in the study of health-related variables in healthcare professionals using data from an online survey as the main source of information and the population census as the reference sample. We compare the results to the unadjusted case and evaluate the performance of the aforementioned adjustment. Note: This thesis is presented as a compendium of seven publications in relation with the contents of the thesis. The full version of the papers is included in Appendices A1 - A7.