ehrQL tutorial: Handling missing values🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Example dataset definition 5a: Handling missing values🔗

Learning objectives🔗

By the end of this tutorial, you should be able to:

describe how missing values are represented in ehrQL
check for missing values
replace missing values

In the tutorial examples so far, all the data rows have been fully populated with values. There have been no missing values.

This is a somewhat idealised situation. In the real world, data is often incomplete with some values missing. Missing data values in ehrQL are represented with a special "null" value.

We will explore how ehrQL's null values work in the below dataset definition and some approaches to dealing with missing values.

Full example🔗

Dataset definition: 5a_multiple3_dataset_definition.py

from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import hospitalisations, patient_address

dataset = Dataset()

most_recent_hospitalisation = hospitalisations.sort_by(
    hospitalisations.date
).last_for_patient()

lowest_imd_address = patient_address.sort_by(
    patient_address.index_of_multiple_deprivation_rounded
).first_for_patient()

population = most_recent_hospitalisation.exists_for_patient()
dataset.define_population(population)

dataset.most_recent_hospitalisation_code = most_recent_hospitalisation.code
dataset.most_recent_hospitalisation_system = (
    most_recent_hospitalisation.system.if_null_then("UnknownCodeSystem")
)
dataset.lowest_imd = (
    lowest_imd_address.index_of_multiple_deprivation_rounded.if_null_then(-1)
)
dataset.lowest_imd_is_valid = (
    lowest_imd_address.index_of_multiple_deprivation_rounded.is_not_null()
)

In this section, we will build up a dataset using data that has missing values. We will use two different tables:

hospitalisations
patient_address

Both of these tables contain missing data:

the hospitalisations table has a column called system with missing values
the patient address table has a column called index_of_multiple_deprivation_rounded We came across this column before in a previous tutorial but this time, we have included some missing data.

For brevity, the tables will not be displayed here but can be reviewed in the example-data/multiple3/ folder.

The output of the query above should generate the table below:

Output dataset: outputs/5a_multiple3_dataset_definition.csv

patient_id	most_recent_hospitalisation_code	most_recent_hospitalisation_system	lowest_imd	lowest_imd_is_valid
1	h5	TutorialCodeSystem	-1	F
2	h1	UnknownCodeSystem	29874	T
4	h10	UnknownCodeSystem	1500	T
6	h8	TutorialCodeSystem	-1	F

Line by line explanation🔗

This dataset definition:

sets the population to be those patients with a hospitalisation entry
adds the most recent hospitalisation date for a patient to the dataset
adds details of the index of multiple deprivation

We will handle the missing data as follows:

Where most_recent_hospitalisation_system in the hospitalisation table is missing, we will replace this with UnknownCodeSystem
Where imd is missing, we will replace this with a -1

Most recent hospitalisation🔗

As we have seen before, we can sort and select an entry per patient with methods like sort_by(), and first_for_patient().

This time, we are sorting and taking the last hospitalisation for the patient with last_for_patient().

Lowest IMD address🔗

This is similar to what we have done before. In this case we sort rows by index_of_multiple_deprivation_rounded, and take the first row for the patient.

Define population🔗

We are trying to capture patients who have a recent hospitalisation. For this, we check if a patient has row in the most_recent_hospitalisation subset (created above). We can use exists_for_patient() as we did in a previous tutorial.

Find code for hospitalisation🔗

We want to find the code that is associated with the hospitalisation. This is directly accessible as a column in the hospitalisation table.

Replacing nulls: null hospitalisation coding system values🔗

We now need to deal with the missing data in the hospitalisation code. We can specify a replacement value for nulls as we have done in this dataset definition.

This is via the if_null_then() method.

The result is that the dataset contains UnknownCodeSystem in the data in place of the nulls.

Replacing nulls: IMD🔗

We now need to deal with the missing data in the patient address table in the column of index_of_multiple_deprivation_rounded. You might reasonably think that, since we selected the lowest value of index of multiple deprivation, that this lowest value would be a non-null value.

However, ehrQL sorts null values before non-null values. If a patient has null and non-null values, then the first_for_patient() will be a null value.

This results in the "lowest" IMD value in some cases being null.

In the dataset definition, we replace the missing values with a negative value known to be invalid, but of the correct integer type (-1). This is via the if_null_then() method.

Check if lowest IMD is valid🔗

We want to create a variable that checks if an IMD is valid. If so, returns a True, and if not, returns a False. We use is_not_null() to handle the nulls.

Here, we checked that values were not null with the is_not_null() method. We could have also used the is_null() method to check if values are null. In both cases, the result is a Boolean True or False for each row.

Your Turn🔗

Question

Can you modify the dataset definition to eliminate the IMD value nulls? (Hint: you may find except_where() useful to filter out unwanted rows.)
Even with that modification, we still get a null IMD value in the dataset? Why? (Hint: look at the patient ID values.)
Can you further modify the dataset definition to add an extra criterion in define_population() to remove this row entirely? (Hint: you may find the table operators covered earlier useful.)