Using Pathoplexus in Nextstrain

Attention

When using Pathoplexus (PPX) as a data source, please review the latest PPX Data Use Terms. This guide is intended to give recommendations for how to use PPX data in Nextstrain workflows following the Nextstrain team’s interpretation of the PPX Data Use Terms as of January 26, 2026.

Please take special care to comply with the RESTRICTED Data Use terms - if you used RESTRICTED data you need to create and cite a DOI and may have authorship obligations.

This page is for users who are already familiar with the following:

This guide will refer to the RSV repository as the example Nextstrain repository for using PPX data.

Ingest Workflow 

Data Source 

Fetch data from the PPX LAPIS query engines that are available per pathogen. Define the URLs in the workflow config, with query parameters to download the metadata in the CSV format and only include the LATEST_VERSION of records:

ppx_fetch:
   a:
      seqs: https://lapis.pathoplexus.org/rsv-a/sample/unalignedNucleotideSequences?versionStatus=LATEST_VERSION
      meta: https://lapis.pathoplexus.org/rsv-a/sample/details?dataFormat=csv&versionStatus=LATEST_VERSION
   b:
      seqs: https://lapis.pathoplexus.org/rsv-b/sample/unalignedNucleotideSequences?versionStatus=LATEST_VERSION
      meta: https://lapis.pathoplexus.org/rsv-b/sample/details?dataFormat=csv&versionStatus=LATEST_VERSION

Use an additional config param ppx_metadata_fields to define a subset of fields to include in the metadata to reduce the size the downloaded file. General metadata fields such as sampleCollectionDate are standardized across pathogens, but please refer to the LAPIS query engines to see pathogen specific fields.

Curation 

The PPX data curation steps are similar to the NCBI curation steps with the main differences below.

Add accession URLs

It is important to add the URLs for the PPX and INSDC accessions during curation to be included in the final metadata TSV and the phylogenetic dataset.

Accession field	URL field	URL
PPX_accession	PPX_accession__url	`https://pathoplexus.org/seq/<PPX_accession>`
INSDC_accession	INSDC_accession__url	`https://www.ncbi.nlm.nih.gov/nuccore/<INSDC_accession>`

The RSV repo uses a custom curate-urls script to add the URLs.

Geolocation fields

PPX follows INSDC geo_loc_name standards which only standardizes country names. Nextstrain tries to standardized region, country, division, and locations so this requires additional curation for geolocation fields.

Instead of using the augur curate parse-genbank-location command, RSV uses a custom parse-ppx-division script to split the PPX geoLocAdmin1 field into division and location fields. These can then be standardized with the geolocation rules via augur curate apply-geolocation-rules.

Starting with Augur v31.4.0, default geolocation rules shipped with Augur also adds region per country.

Phylogenetic Workflow 

The phylogenetic workflow does not require significant modification to use PPX data. The only parts that will need to be changed are the inputs and the augur export step.

Inputs 

Attention

If you are using the RESTRICTED data for your own analysis, please take special care to comply with the RESTRICTED Data Use terms.

The default inputs for Nextstrain pathogens include both the OPEN and RESTRICTED data to utilize all available PPX data:

inputs:
   - name: ppx_open
     metadata: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/metadata.tsv.gz"
     sequences: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/sequences.fasta.xz"
   - name: ppx_restricted
     metadata: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/metadata_restricted.tsv.gz"
     sequences: "https://data.nextstrain.org/files/workflows/rsv/{a_or_b}/sequences_restricted.fasta.xz"

Augur export 

The augur export step that produces the Nextstrain dataset needs to adhere to PPX Data Use Terms for both web display and onward data sharing since the metadata is available for download within Auspice.

Auspice config

The Auspice config should include the proper attributions for PPX:

Include Pathoplexus as a data_provenance
Add dataUseTerms as a coloring option
Include additional metadata columns PPX_accession, INSDC_accession, restrictedUntil. Starting with Augur v31.4.0, the associated *__url columns should be automatically exported with the metadata columns.

{
   "data_provenance": [
      {
         "name": "Pathoplexus",
         "url": "https://pathoplexus.org"
      }
   ],
   "colorings": [
      {
         "key": "dataUseTerms",
         "title": "Data use terms",
         "type": "categorical"
      }
   ],
   "metadata_columns": [
      "PPX_accession",
      "INSDC_accession",
      "restrictedUntil"
   ]
}

Description

It is strongly encouraged to include a description.md that acknowledges PPX as the data source and point to any provisioned data files. Please see the example in the RSV description.

Updating existing workflows 

If you are updating an existing workflow that previously used NCBI as a data source, do the following updates in addition to the changes above.

Update all NCBI accessions to PPX accessions in configuration files, e.g. ingest annotations.tsv and phylogenetic include/exclude files. This can be done programmatically with the new ingest output as documented in WNV.
If using example data for CI tests, update the data to PPX OPEN records.
To reduce confusion of the data source, remove NCBI related config params, scripts, and Snakemake rules.