Using Pathoplexus in Nextstrainο
Attention
When using Pathoplexus (PPX) as a data source, please review the latest PPX Data Use Terms. This guide is intended to give recommendations for how to use PPX data in Nextstrain workflows following the Nextstrain teamβs interpretation of the PPX Data Use Terms as of January 26, 2026.
Please take special care to comply with the RESTRICTED Data Use terms - if you used RESTRICTED data you need to create and cite a DOI and may have authorship obligations.
This page is for users who are already familiar with the following:
This guide will refer to the RSV repository as the example Nextstrain repository for using PPX data.
Ingest Workflowο
Data Sourceο
Fetch data from the PPX LAPIS query engines that are available per pathogen.
Define the URLs in the workflow config, with query parameters to download the
metadata in the CSV format and only include the LATEST_VERSION of records:
ppx_fetch:
a:
seqs: https://lapis.pathoplexus.org/rsv-a/sample/unalignedNucleotideSequences?versionStatus=LATEST_VERSION
meta: https://lapis.pathoplexus.org/rsv-a/sample/details?dataFormat=csv&versionStatus=LATEST_VERSION
b:
seqs: https://lapis.pathoplexus.org/rsv-b/sample/unalignedNucleotideSequences?versionStatus=LATEST_VERSION
meta: https://lapis.pathoplexus.org/rsv-b/sample/details?dataFormat=csv&versionStatus=LATEST_VERSION
Use an additional config param ppx_metadata_fields to define a subset
of fields to include in the metadata to reduce the size the downloaded file.
General metadata fields such as sampleCollectionDate are standardized across
pathogens, but please refer to the LAPIS query engines to see pathogen specific fields.
Curationο
The PPX data curation steps are similar to the NCBI curation steps with the main differences below.
Add accession URLsο
It is important to add the URLs for the PPX and INSDC accessions during curation to be included in the final metadata TSV and the phylogenetic dataset.
Accession field |
URL field |
URL |
|---|---|---|
PPX_accession |
PPX_accession__url |
|
INSDC_accession |
INSDC_accession__url |
|
The RSV repo uses a custom curate-urls script to add the URLs.
Geolocation fieldsο
PPX follows INSDC geo_loc_name standards which only standardizes country names. Nextstrain tries to standardized region, country, division, and locations so this requires additional curation for geolocation fields.
Instead of using the augur curate parse-genbank-location command, RSV uses
a custom parse-ppx-division script to split the PPX geoLocAdmin1 field
into division and location fields. These can then be standardized with
the geolocation rules via augur curate apply-geolocation-rules.
Starting with Augur v31.4.0, default geolocation rules shipped with Augur also adds region per country.
Onward data sharingο
Important
Even if you do not plan to share the ingest outputs, the metadata for a Nextstrain dataset is available for download within Auspice. This is also considered onward data sharing!
The metadata TSV is required to include these columns to adhere to the PPX Data Use Terms
PPX_accession
PPX_accession__url
INSDC_accession
INSDC_accession__url
dataUseTerms
dataUseTerms__url
restrictedUntil
Nextstrain automated workflows upload the outputs to a public S3 bucket, where the default files are only OPEN data and the RESTRICTED data are in separate files. This is reflected in the inputs of the phylogenetic workflow.
Phylogenetic Workflowο
The phylogenetic workflow does not require significant modification to use PPX data.
The only parts that will need to be changed are the inputs and the augur export step.
Inputsο
Attention
If you are using the RESTRICTED data for your own analysis, please take special care to comply with the RESTRICTED Data Use terms.
The default inputs for Nextstrain pathogens include both the OPEN and RESTRICTED data to utilize all available PPX data:
inputs:
- name: ppx_open
metadata: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/metadata.tsv.gz"
sequences: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/sequences.fasta.xz"
- name: ppx_restricted
metadata: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/metadata_restricted.tsv.gz"
sequences: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/sequences_restricted.fasta.xz"
Augur exportο
The augur export step that produces the Nextstrain dataset needs to adhere
to PPX Data Use Terms for both web display and onward data sharing since the
metadata is available for download within Auspice.
Auspice configο
The Auspice config should include the proper attributions for PPX:
Include Pathoplexus as a
data_provenanceAdd
dataUseTermsas a coloring optionInclude additional metadata columns
PPX_accession,INSDC_accession,restrictedUntil. Starting with Augur v31.4.0, the associated*__urlcolumns should be automatically exported with the metadata columns.
{
"data_provenance": [
{
"name": "Pathoplexus",
"url": "https://pathoplexus.org"
}
],
"colorings": [
{
"key": "dataUseTerms",
"title": "Data use terms",
"type": "categorical"
}
],
"metadata_columns": [
"PPX_accession",
"INSDC_accession",
"restrictedUntil"
]
}
Descriptionο
It is strongly encouraged to include a description.md that acknowledges
PPX as the data source and point to any provisioned data files.
Please see the example in the RSV description.
Updating existing workflowsο
If you are updating an existing workflow that previously used NCBI as a data source, do the following updates in addition to the changes above.
Update all NCBI accessions to PPX accessions in configuration files, e.g. ingest annotations.tsv and phylogenetic include/exclude files. This can be done programmatically with the new ingest output as documented in WNV.
If using example data for CI tests, update the data to PPX OPEN records.
To reduce confusion of the data source, remove NCBI related config params, scripts, and Snakemake rules.