Installing Data Builder with Python🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.
Warning
We recommend that you use Data Builder with the OpenSAFELY CLI as instructed in the ehrQL tutorial.
Limitations🔗
This option is a fall back if:
- you are a competent Python user,
- and you understand how to install Python packages yourself with
pip
This installation option will allow you to run ehrQL dataset definitions only.
You will not be able to run a full OpenSAFELY project via a project.yaml
pipeline.
If you are unable to run Data Builder via Docker, you can try installing Data Builder directly using Python.
As Python configurations vary between operating systems, and how users have Python configured, we will not give detailed instructions.
Warning
This option may not work on Windows currently: https://github.com/opensafely-core/databuilder/issues/790
Todo
Can we fix that issue?
Requirements🔗
You will need to:
- have a suitable Python version installed (currently Python 3.9)
- configure a suitable virtual environment to run Data Builder
for example with
conda
orvenv
- install the Data Builder package into that virtual environment;
Installation🔗
Install the latest version of Data Builder into your new virtual environment with pip
pip install git+https://github.com/opensafely-core/databuilder@main#egg=opensafely-databuilder`
Todo
It's probably better to advocate installing the same version we're using to build the definitions.
This will be a tagged version in databuilder/requirements.prod.in
.
Todo
Are we going to ever publish Data Builder to PyPI?
Checking the installation🔗
Make sure that you can run Data Builder's "help" command:
databuilder --help
If that command succeeds, you should see some help text and Data Builder should be correctly installed.
Using Data Builder's command-line interface🔗
This section explains how to load dataset definitions into Data Builder.
Each dataset definition used in this tutorial has a filename of the form:
IDENTIFIER_DATASOURCENAME_dataset_definition.py
For example, for
1a_minimal_dataset_definition.py
the identifier is 1a
and the data source name is minimal
.
The identifier associates the dataset definition with a specific tutorial page.
Todo
Check how compatible this is cross-platform.
To run this dataset definition with Data Builder,
- In a terminal, enter the
ehrql-tutorial-examples
directory that you extracted from the sample data. - Run this command:
databuilder generate-dataset "1a_minimal_dataset_definition.py" --dummy-tables "example-data/minimal/" --output "outputs.csv"
outputs.csv
file in the ehrql-tutorial-examples
directory
that you were working in.
Tip
In general, the command to run a dataset defintion looks like:
databuilder generate-dataset "IDENTIFIER_DATASOURCENAME_dataset_definition.py --dummy-tables "example-data/DATASOURCENAME/" --output "outputs.csv"
You need to substitute DATASOURCENAME
with the appropriate dataset name,
and IDENTIFIER_DATASOURCENAME_dataset_definition.py
to match the specific dataset definition
that you want to run.
Tip
The output in this example is called outputs.csv
,
but you can choose any valid filename.