Filtering and Subsampling

Below are some examples of using augur filter and augur subsample to sample data.

Overview 

Augur provides two tools to choose different subsets of input data for various types of analysis.

augur filter: Tool with command-line configuration options
augur subsample: Tool with file-based configuration options

This guide contains examples for both tools. For augur filter examples, the options should be added to the following call:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  <OPTIONS>
  --output-sequences filtered_sequences.fasta \
  --output-metadata filtered_metadata.tsv

For augur subsample examples, the options should be added to a YAML file. If it is named samples.yaml, the following call should be used:

augur subsample \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --config samples.yaml \
  --output-sequences filtered_sequences.fasta \
  --output-metadata filtered_metadata.tsv

Tip

Use augur filter when:

You need simple, single-step filtering and subsampling
You are comfortable with command-line flags

Use augur subsample when:

You need complex, multi-tier subsampling (e.g., geographic tiers)
You prefer configuration files over long command lines

Options can be categorized based on the selection method.

Preliminary options work by selecting or dropping sequences that match certain criteria. Criteria is one of either
1. metadata-based: works with information available from --metadata
2. sequence-based: works with information available from --sequences/--sequence-index.
Subsampling options work by selecting sequences using rules based on final output size. These are applied after all preliminary options and before any force-inclusive options.
Force-inclusive options work by ensuring sequences that match certain criteria are always included in the output, ignoring all other options.

Categories for sampling options
	`augur filter`	`augur subsample`
Preliminary (metadata-based)	`--min-date` `--max-date` `--exclude-ambiguous-dates-by` `--exclude` `--exclude-where` `--query`	`min_date` `max_date` `exclude_ambiguous_dates_by` `exclude` `exclude_where` `query`
Preliminary (sequence-based)	`--min-length` `--max-length` `--non-nucleotide`	`min_length` `max_length` `non_nucleotide`
Subsampling	`--subsample-max-sequences` `--group-by` `--group-by-weights` `--sequences-per-group` `--probabilistic-sampling` `--no-probabilistic-sampling` `--priority`	`max_sequences` `group_by` `group_by_weights` `sequences_per_group` `probabilistic_sampling`
Force-inclusive	`--include` `--include-where`	`include` `include_where`

Preliminary & force-inclusive selection 

A common sampling operation is to select sequences according to rules on individual sequence attributes. Examples:

Select all sequences with a collection date in 2012 or later using the minimum date option.
augur filter options

augur subsample config
--min-date 2012 \
samples: my_sample: min_date: 2012
Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation). First, create a text file exclude.txt with one line per sequence ID.
```
BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...
```
The name of the file is given to the file-based exclusion option.
augur filter options

augur subsample config
--min-date 2012 \ --exclude exclude.txt \
samples: my_sample: min_date: 2012 exclude: exclude.txt

Include sequences from a specific region using the query option.

`augur filter` options	`augur subsample` config
--min-date 2012 \ --exclude exclude.txt \ --query 'region=="Asia"' \	samples: my_sample: min_date: 2012 exclude: exclude.txt query: region=="Asia"

Tip

The query region=="Asia" is functionally equivalent to the column-based exclusion region!=Asia. However, the query option allows for more complex expressions such as (region in {"Asia", "Europe"}) & (coverage >= 0.95).

The query region=="Asia" is not equivalent to a column-based force-inclusion region=Asia since force-inclusive options ignore other options (i.e. minimum date and file-based exclusion in the examples above).

Force-inclusive options work similarly, and override all other options. Example:

Include specific sequences (e.g. root sequence). First, create a text file include.txt with one line per sequence ID.
```
Wuhan/Hu-1/2019
...
```
The name of the file is given to the file-based force-inclusion option.
augur filter options

augur subsample config
--min-date 2020 \ --include include.txt \
samples: my_sample: min_date: 2020 include: include.txt
Wuhan/Hu-1/2019 will still be included even if it does not pass the date filter.

Subsampling 

Another common operation is subsampling: selection of data using rules based on output size rather than individual sequence attributes. These are the sampling methods supported by Augur:

Random sampling 

The simplest scenario is a reduction of dataset size to more manageable numbers. For example, limit the output to 100 sequences:

`augur filter` options	`augur subsample` config
--min-date 2012 \ --exclude exclude.txt \ --subsample-max-sequences 100 \	samples: my_sample: min_date: 2012 exclude: exclude.txt max_sequences: 100

Random sampling is easy to define but can expose sampling bias in some datasets. Consider using grouped sampling to reduce sampling bias.

Grouped sampling 

Grouping columns specified by --group-by (for augur filter) and group_by (for augur subsample) allow you to partition the data into groups based on column values and sample a number of sequences per group.

Grouped sampling can be further divided into two types with a final section for caveats:

Uniform sampling 

By default (i.e. without weights), Augur will sample uniformly across groups. For example, sample evenly across regions over time:

augur filter options

augur subsample config

--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--subsample-max-sequences 100 \

samples:
  my_sample:
    min_date: 2012
    exclude: exclude.txt
    group_by:
      - region
      - year
      - month
    max_sequences: 100

An alternative to targeting a total sample size is to target a fixed number of sequences per group. For example, target one sequence per month from each region:

augur filter options

augur subsample config

--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--sequences-per-group 1 \

samples:
  my_sample:
    min_date: 2012
    exclude: exclude.txt
    group_by:
      - region
      - year
      - month
    sequences_per_group: 1

Weighted sampling 

Weights can be specified in addition to grouping columns to allow different target sizes per group. For example, target twice the amount of sequences from Asia compared to other regions using this weights.tsv file:

region	weight
Asia	2
default	1

The name of the file is given to the grouping column weights option.

augur filter options

augur subsample config

--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--group-by-weights weights.tsv \
--subsample-max-sequences 100 \

samples:
  my_sample:
    min_date: 2012
    exclude: exclude.txt
    group_by:
      - region
      - year
      - month
    group_by_weights: weights.tsv
    max_sequences: 100

Multiple grouping columns are supported in the weights file, as well as a default weight for unspecified groups. A weight of 0 can be used to exclude all matching sequences.

region	month	weight
Asia	2024-01	2
Asia	2024-02	3
Africa	2024-01	2
default	default	0

More information can be found in the documentation page for each tool.

Caveats 

Probabilistic sampling 

It is possible to encounter situations where the number of groups exceeds the target sample size. For example, consider a command with groups defined by grouping columns region, year, month and target sample size of 100 sequences. If the input contains data from 5 regions over a span of 24 months, that could result in 120 groups.

The only way to target 100 sequences from 120 groups is to apply probabilistic sampling which randomly determines a whole number of sequences per group. This is noted in the output:

WARNING: Asked to provide at most 100 sequences, but there are 120 groups.
Sampling probabilistically at 0.83 sequences per group, meaning it is
possible to have more than the requested maximum of 100 sequences after
filtering.

This is automatically enabled. For augur filter, --no-probabilistic-sampling can be used with uniform sampling to force the command to exit with an error in these situations. For augur subsample, probabilistic_sampling: False can be used. It must be enabled for weighted sampling.

Undersampling 

For these sampling methods, the number of targeted sequences per group does not take into account the actual number of sequences available in the input data. For example, consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. When grouping by year, targeting 300 total sequences is equivalent to targeting 150 sequences per group. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.

Complex subsampling strategies 

There are some subsampling strategies in which a single call to augur filter does not suffice or is difficult to create. One such strategy is “tiered subsampling”. In this strategy, mutually exclusive sets of filters, each representing a “tier”, are sampled with different subsampling rules. This is commonly used to create geographic tiers. In most situations it is recommended to use augur subsample.

Using `augur subsample`…

Note

augur subsample is only available in Augur version 31.5.0 and later. If you are using an older version of Augur, refer to the augur filter examples.

… with a dedicated config file 

Consider the following task:

Sample 200 sequences from Washington state and 100 sequences from the rest of the United States.

This can be represented in an augur subsample config file:

samples:
  state:
    query: state == 'WA'
    max_sequences: 200
  country:
    query: state != 'WA' & country == 'USA'
    max_sequences: 100

… within Snakemake workflow config 

Snakemake is a workflow manager where configuration is often written in YAML files. A clean pattern is to keep your augur subsample config inside your workflow config under a dedicated section, write the section from the config variable to a YAML file at run time, and instruct augur subsample to read from the file.

Consider the following workflow config:

builds:
  build1:
    subsample:
      samples:
        state:
          query: state == 'WA'
          max_sequences: 200
        country:
          query: state != 'WA' & country == 'USA'
          max_sequences: 100

Below is an example of how it can be used in Snakemake.

import yaml
from augur.subsample import get_referenced_files, get_parallelism

with open("results/subsample_config.yaml", "w") as f:
    yaml.dump(config["builds"]["build1"]["subsample"], f, sort_keys=False)

rule subsample:
    input:
        metadata = "data/metadata.tsv",
        config = "results/subsample_config.yaml",
        referenced_files = get_referenced_files("results/subsample_config.yaml"),
    output:
        metadata = "results/subsampled.tsv",
    threads: get_parallelism("results/subsample_config.yaml", limit=workflow.cores)
    shell:
        """
        augur subsample \
          --metadata {input.metadata} \
          --config {input.config} \
          --nthreads {threads} \
          --output-metadata {output.metadata}
        """

Tip

augur.subsample.get_referenced_files() is a helper function that returns the file paths referenced by the augur subsample config. While optional, using it ensures that changes to those files will trigger this rule to re-run.

augur.subsample.get_parallelism() is a helper function that returns the degree of parallelism based on the augur subsample config. While optional, using it allows Snakemake to allocate an optimal amount of threads.

Other options (less ideal):

Maintain a separate YAML dedicated to augur subsample. This is not ideal because it splits workflow configuration across files.
Instruct augur subsample to read directly from a Snakemake config file. This is not ideal because values may be overridden at run time via --configfile or --config. The in-memory config variable is the ultimate source of truth, so writing it out as recommended is more robust.

Using `augur filter`…

Tip

Using augur subsample is recommended for Augur version 31.5.0 and later.

… with weighted sampling 

Consider the following task:

Sample 200 sequences from Washington state and 100 sequences from the rest of the United States.

This can be approximated by first selecting all sequences from the United States then sampling with these weights:

state	weight
WA	200
default	2.04

augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --query "country == 'USA'" \
  --group-by state \
  --group-by-weights weights.tsv \
  --subsample-max-sequences 300 \
  --output-sequences subsampled_sequences.fasta \
  --output-metadata subsampled_metadata.tsv

This approach has some caveats:

It relies on a calculation to determine weights, making it less intuitive:

\[{n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}} = 100 * \frac{1}{49} \approx 2.04\]
Achieving a full 100 sequences from the rest of the United States requires at least 2 sequences from each of the remaining states. This may not be possible if some states are under-sampled.

Intuitiveness for caveat (1) can be improved by adding a comment to the weights file. However, caveat (2) is an inherent limitation of what is effectively uniform sampling across all other states. The only way to get around this in augur filter is random sampling across states, but that is not possible when state is used as a grouping column.

An alternative approach is to decompose this into multiple schemes, each handled by a single call to augur filter. Additionally, there is an extra step to combine the intermediate samples.

Sample 200 sequences from Washington state.

Sample 100 sequences from the rest of the United States.

Combine the samples.

… with multiple calls 

A basic approach is to run the augur filter commands directly. This works well for ad-hoc analyses.

# 1. Sample 200 sequences from Washington state
augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --query "state == 'WA'" \
  --subsample-max-sequences 200 \
  --output-strains sample_strains_state.txt

# 2. Sample 100 sequences from the rest of the United States
augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --query "state != 'WA' & country == 'USA'" \
  --subsample-max-sequences 100 \
  --output-strains sample_strains_country.txt

# 3. Combine using augur filter
augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --exclude-all \
  --include sample_strains_state.txt \
            sample_strains_country.txt \
  --output-sequences subsampled_sequences.fasta \
  --output-metadata subsampled_metadata.tsv

Each intermediate sample is represented by a strain list file obtained from --output-strains. The final step uses augur filter with --exclude-all and --include to sample the data based on the intermediate strain list files. If the same strain appears in both files, augur filter will only write it once in each of the final outputs.

Note

The 2nd sample does not use --group-by, implying random sampling across states. This differs from previous approach that used a single augur filter command with weighted sampling.

… generalized within a workflow 

The approach above can be cumbersome with more intermediate samples. To generalize this process and allow for more flexibility, a workflow management system can be used. The following examples use Snakemake.

Add a section in the config file.

subsampling:
  state: --query "state == 'WA'" --subsample-max-sequences 200
  country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 100

Add two rules in a Snakefile. If you are building a standard Nextstrain workflow, the output files should be used as input to sequence alignment. See Parts of a whole to learn more about the placement of this step within a workflow.

# 1. Sample 200 sequences from Washington state
# 2. Sample 100 sequences from the rest of the United States
rule intermediate_sample:
    input:
        metadata = "data/metadata.tsv",
    output:
        strains = "results/sample_strains_{sample_name}.txt",
    params:
        augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
    shell:
        """
        augur filter \
            --metadata {input.metadata} \
            {params.augur_filter_args} \
            --output-strains {output.strains}
        """
# 3. Combine using augur filter
rule combine_intermediate_samples:
    input:
        sequences = "data/sequences.fasta",
        metadata = "data/metadata.tsv",
        intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys()))
    output:
        sequences = "results/subsampled_sequences.fasta",
        metadata = "results/subsampled_metadata.tsv",
    shell:
        """
        augur filter \
            --sequences {input.sequences} \
            --metadata {input.metadata} \
            --exclude-all \
            --include {input.intermediate_sample_strains} \
            --output-sequences {output.sequences} \
            --output-metadata {output.metadata}
        """

Run Snakemake targeting the second rule.
```
snakemake combine_intermediate_samples
```

Explanation:

The configuration section consists of one entry per intermediate sample in the format sample_name: <augur filter arguments>.
The first rule is run once per intermediate sample using wildcards and an input function. The output of each run is the sampled strain list.
The second rule uses expand() to define input as all the intermediate sampled strain lists, which are passed directly to --include as done in the previous example.

It is easy to add or remove intermediate samples. The configuration above can be updated to add another tier in between state and country:

subsampling:
  state: --query "state == 'WA'" --subsample-max-sequences 100
  neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75
  country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50