Scientific Papers

Using linked administrative data to aid the handling of non-response and restore sample representativeness in cohort studies: the 1958 national child development study and hospital episode statistics data | BMC Medical Research Methodology


Data

1958 National child development study (NCDS)

The NCDS follows the lives of 18,558 people born in Great Britain in a single week of 1958 [8]. Since the birth sweep, NCDS cohort members have been followed up 10 times, with the eleventh sweep currently underway with the cohort members now aged 64. The study includes information on cohort members’ physical and educational development, economic circumstances, employment, family life, health behaviours, wellbeing, social participation, biological data and attitudes. Although response rates in recent sweeps of NCDS remain relatively high considering the decades-long duration of the study, non-response is a sufficient issue to require careful handling. For example, of the 15,613 NCDS cohort members remaining in the target population (still alive and living in Great Britain) at wave 9 (2013, age 55), 9,137 (58.5%) responded and 6,476 (41.5%) were not observed for one of a number of reasons (refusal, the survey team not been being able to establish contact, or because contact was not attempted, for example because of long-term refusal). Item non-response among respondents is of a relatively lower level, typically less than 10% [15].

Hospital episode statistics (HES)

HES is a collection of databases containing details of all admissions (Admitted Patient Care (APC) and Critical Care (CC)), Accident and Emergency (A&E) attendances and Outpatient (OP) appointments at NHS hospitals in England, maintained by NHS Digital [9]. Each HES dataset provides detailed information on admission and discharge or appointment dates, diagnoses, procedures, basic patient demographics, and hospital characteristics [16]. The period of data availability differs by dataset, from 1997 for APC, from 2007 for A&E, from 2009 for CC and from 2003 for OP.

Linked NCDS-HES data

Linkage between NCDS and all four HES datasets has recently been undertaken, on the basis of consents obtained at NCDS wave 8 (2008, age 50) [10, 11]. Matching was carried out in two stages: in the first, NHS Digital used information provided by the NCDS team on the cohort members’ name, sex, date of birth and postcode to identify their NHS number; in the second, NHS Digital used the identified NHS number to extract HES data for each cohort member, with pseudo-anonymised linked HES data returned. Because HES data relate to NHS hospitals in England only, we restricted our attention to NCDS cohort members who we considered eligible for HES linkage due to having lived in England for at least one wave been wave 6 (2000, age 42) and wave 9 (2013, age 55) (the period corresponding to HES data availability). The flow of data, from the full sample of NCDS cohort members to the linked samples for each HES dataset, is shown in the data flow diagram in Supplementary Fig. S1. Recent analyses suggest the linkage quality of the NCDS-HES data to be high and the linked sample to retain a good level of population representativeness [17].

In this study we restricted our attention to cohort members who were in the wave 9 (2013, age 55) target population (those who were alive and still living in Great Britain at this point). Individuals outside the target population would not have been in the issued sample for the wave 9 follow-up and therefore could not have responded. As our aim was to identify predictors of non-response and not of mortality or emigration, such individuals were excluded rather than being considered as non-respondents. We used linked HES data from the earliest available date until the end of 2012 to ensure that we only used HES information which pre-dated the point at which response was sought. The impact of these additional criteria on the sample is shown in the data flow diagram in Fig. 1.

Fig. 1
figure 1

Flow diagram showing 1958 British National Child Development Study-Hospital Episode Statistics data linkage and data availability. APC: admitted patient care; CC: critical care; A&E: accident and emergency; OP: outpatients

Annual population survey (APS)

The Annual Population Survey (APS) is a large survey administered yearly by the Office for National Statistics (ONS) [18]. It contains approximately 320,000 respondents and covers social and economic aspects of individuals’ lives. In this study, we used the APS January-December 2013 survey [19] to derive population estimates for the variables of interest, limiting our analysis to 55-year-olds.

Variables

NCDS

In the present analysis we focus on NCDS non-response at wave 9 (age 55). This was captured as a binary variable, defined as cohort members who did not take part in the survey, either because of refusal, the survey team not been being able to establish contact, or because contact was not attempted, for example because of long-term refusal.

Predictors of age 55 NCDS non-response, listed in Supplementary Table S1, were previously identified using survey data from the 10 preceding sweeps (birth to age 50) of NCDS [12].

To assess how effective the identified HES predictors of NCDS age 55 non-response were at restoring sample representativeness despite selective attrition, we considered representativeness with respect to two NCDS variables observed in early life and two NCDS variables in observed in later life (subsequently collectively referred to as “analysis variables”): father’s social class at birth (binary variable for father being in the professional social class), cognitive ability at age 7 (continuous principal component analysis score derived using the scores from the problem arithmetic test, copying designs test, drawing a man test and Southgate Group Reading Test), educational qualifications at age 55 (binary variable for no educational qualifications), and marital status at age 55 (binary variable for single and never married).

Linked NCDS-HES data

A total of 58 variables to be considered as potential predictors of NCDS non-response at age 55 were derived across the APC, OP and A&E HES datasets. We aimed to derive as many variables as we could using the information available, though intentionally avoided variables with low sample prevalence which would be unlikely to prove useful as auxiliary variables. We therefore derived variables relating to diagnoses and treatments at a high level (e.g. International Classification of Diseases (ICD)-10 chapters) rather than considering more granular coding. The derived variables relate to the numbers of admissions and appointments, missed appointments, investigations undertaken, diagnoses and treatments received (full details in Supplementary Table S2).

APS

For 55-year-olds in APS, we derived the percentage of individuals who were single and had never been married and the percentage of individuals with no educational qualifications using survey information weighted to the mid-2013 population estimate using the weights provided by the ONS [19].

Statistical analysis

HES predictors of NCDS non-response at wave 9 (age 55)

In order to identify which of the 58 derived HES variables were important predictors of non-response at age 55 in NCDS, we employed the least absolute shrinkage and selection operator (LASSO) [20]. We included all 58 HES variables in a logistic regression model for non-response and used the LASSO lambda value that minimised mean cross-validated error using 10-fold cross-validation.

In a secondary analysis we used a multi-stage P value-based variable selection approach, similar to that employed by Mostafa et al. [12], for comparison with the primary approach using the LASSO (see Supplementary Methods S1).

Restoring sample representativeness

We undertook several analyses to assess how effective the identified HES predictors of NCDS age 55 non-response were at restoring sample representativeness despite selective attrition. The basic idea underlying each analysis is the same: comparison of a statistic calculated when using data from wave 9 respondents only (so subject to non-response bias) and the same statistic estimated using predictors of non-response as auxiliary variables in MI analyses (to make estimates for the broader sample, including non-respondents) to a known benchmark value. Full details of the analyses are provided in Supplementary Methods S2 but are briefly summarised here.

We first explored the associations between the analysis variables of interest and the identified HES and survey predictors of non-response. This allowed us to assess whether the HES/survey predictors of non-response were sufficiently well associated with the analysis variables to constitute potentially useful auxiliary variables. Associations were explored using linear or logistic regression (as appropriate), with P values from Wald tests of the parameter(s) presented to summarise the strength of evidence for each association.

The first restoring sample representativeness analysis (“Analysis A”) focused on HES linkage consenters who were eligible for linkage and who were within the wave 9 target population. These individuals are non-missing for all HES variables since we assumed those with no linked HES record truly had no relevant hospital interactions. These analyses considered sample representativeness in terms of variables observed in early life (father’s social class at birth and cognitive ability at age 7). The percentage of fathers in professional social class at birth and mean cognitive ability at age 7 were calculated in several different ways: (i) using all available data from respondents at that point in time (i.e. birth and age 7 respectively); (ii) using data from wave 9 respondents only (to assess bias due to non-response at wave 9); and (iii) using HES and/or survey predictors of non-response as auxiliary variables in MI analyses (to assess to what extent sample representativeness can be restored using the selected predictors of non-response).

The second analysis (“Analysis B”) focused on all NCDS cohort members within the wave 9 target population. This includes individuals who did not consent to HES linkage (or who did consent but were ineligible for linkage) and are therefore missing for all HES variables. Analyses related to restoring sample representativeness of early life NCDS variables (father’s social class at birth and cognitive ability at age 7) involved similar comparisons to those outlined for Analysis A. Analyses related to restoring sample representativeness of later life NCDS variables (educational qualifications at age 55 and marital status at age 55) instead considered the percentage without educational qualifications by age 55 and the percentage single and never married by age 55 calculated: (i) among wave 9 respondents; and (ii) using survey or survey and HES predictors of non-response as auxiliary variables in MI analyses. These were then compared to population benchmark values derived from APS.

In each analysis we utilised MI with chained equations [21], generating 20 imputed datasets. All analyses were conducted in Stata 16 and R 4.0.3.



Source link