Data formats

TSV

Nextstrain strongly prefers using TSV files for metadata even though Augur commands support other delimiters as inputs. If you are using other formats, we recommend using augur curate passthru to convert them to TSV.

Nextstrain tools and workflows produce RFC 4180 CSV-like TSVs.

When using csvtk

  • the --lazy (-l) option should not be necessary

  • the fix-quotes/del-quotes commands should not be necessary

When using tsv-utils

  • pass the inputs through csv2tsv --csv-delim $'\t'

  • pass the final tsv-util outputs through csvtk fix-quotes --tabs

csv2tsv --csv-delim $'\t' metadata.tsv \
  | tsv-select -H -f strain,date \
  | tsv-uniq -H -f strain \
  | csvtk fix-quotes --tabs > output.tsv

If you are writing Python scripts that process TSV files, we recommend using the csv module for file I/O.

Note

Be sure to follow csv module’s recommendation to open files with newline=''.

Reading a TSV file:

with open(input_file, 'r', newline='') as handle:
  reader = csv.reader(handle, delimiter='\t')
  for row in reader:
    ...

Writing a TSV file:

with open(output_file, 'w', newline='') as output_handle:
  tsv_writer = csv.writer(output_handle, delimiter='\t')
  tsv_writer.writerow(header)
  for record in records:
    tsv_writer.writerow(record)

See our internal discussion on TSV standardization for more details.

JSON

Nextstrain uses a few different kinds of JSON files at various stages in a typical build.

The primary JSON files used by Nextstrain are those consumed by Auspice to display a dataset. Without these dataset files, Auspice has nothing to display. These files are typically the final output of a build and produced by the Augur command augur export. They come in two versions:

v2

Newer format, with a filename of your choosing like ${name}.json. This is often referred to as the “main” file.

v1

Original format, with filenames like ${name}_tree.json and ${name}_meta.json, often referred to as the “tree” and “meta” files.

Secondary JSON files used by Nextstrain come in two flavors: sidecar files and node data files.

Sidecar files are produced by Augur for direct consumption by Auspice, alongside the primary JSON files described above. They come in three types with filenames enforced by convention:

root-sequence

Filenames like ${name}_root-sequence.json, produced by augur export v2’s --include-root-sequence option.

tip-frequencies

Filenames like ${name}_tip-frequencies.json, produced by augur frequencies with the --output-format auspice --output options.

measurements

Filenames like ${name}_measurements.json, produced by one of the augur measurements subcommands, export or concat.

Node data files are typically produced by various Augur commands such as augur traits or augur ancestral and are then fed into augur export to be merged together into a final output for Auspice. Node data files can have any filename you want but some common names are:

  • nt_muts.json

  • aa_muts.json

  • traits.json

  • branch_lengths.json

  • ${name}_aa-mutation-frequencies.json

  • ${name}_entropy.json

  • ${name}_frequencies.json

  • ${name}_sequences.json

  • ${name}_titers.json

Node data files have a generic structure to allow them to contain all kinds of data about your tree.

In advanced builds, custom node data files are often produced by build-specific scripts in addition to the ones produced by Augur commands. For example, our ncov build produces a custom epiweeks.json node data file using this workflow step and this script.

Similarly, it’s possible for other bioinformatics software to produce compatible dataset JSONs (primary or sidecars) for use by Auspice; they aren’t required to be generated by Augur, although that is the most common way. Augur’s validation command can check that dataset JSONs have the required schema.

Once you have Nextstrain JSON files, you can visualize and share them in a variety of ways. See our guide to sharing your results to find a way that meets your needs for privacy and collaboration.