Data formats
TSV
Nextstrain strongly prefers using TSV files for metadata even though Augur commands support other delimiters as inputs. If you are using other formats, we recommend using augur curate passthru to convert them to TSV.
Nextstrain tools and workflows produce RFC 4180 CSV-like TSVs.
When using csvtk
the
--lazy(-l) option should not be necessarythe
fix-quotes/del-quotescommands should not be necessary
When using tsv-utils
pass the inputs through
csv2tsv --csv-delim $'\t'pass the final
tsv-utiloutputs throughcsvtk fix-quotes --tabs
csv2tsv --csv-delim $'\t' metadata.tsv \
| tsv-select -H -f strain,date \
| tsv-uniq -H -f strain \
| csvtk fix-quotes --tabs > output.tsv
If you are writing Python scripts that process TSV files, we recommend using the csv module for file I/O.
Note
Be sure to follow csv module’s recommendation
to open files with newline=''.
Reading a TSV file:
with open(input_file, 'r', newline='') as handle:
reader = csv.reader(handle, delimiter='\t')
for row in reader:
...
Writing a TSV file:
with open(output_file, 'w', newline='') as output_handle:
tsv_writer = csv.writer(output_handle, delimiter='\t')
tsv_writer.writerow(header)
for record in records:
tsv_writer.writerow(record)
See our internal discussion on TSV standardization for more details.
JSON
Nextstrain uses a few different kinds of JSON files at various stages in a typical build.
The primary JSON files used by Nextstrain are those consumed by Auspice to display a dataset. Without these dataset files, Auspice has nothing to display. These files are typically the final output of a build and produced by the Augur command augur export. They come in two versions:
- v2
Newer format, with a filename of your choosing like
${name}.json. This is often referred to as the “main” file.- v1
Original format, with filenames like
${name}_tree.jsonand${name}_meta.json, often referred to as the “tree” and “meta” files.
Secondary JSON files used by Nextstrain come in two flavors: sidecar files and node data files.
Sidecar files are produced by Augur for direct consumption by Auspice, alongside the primary JSON files described above. They come in three types with filenames enforced by convention:
- root-sequence
Filenames like
${name}_root-sequence.json, produced byaugur export v2’s--include-root-sequenceoption.- tip-frequencies
Filenames like
${name}_tip-frequencies.json, produced by augur frequencies with the--output-format auspice --output …options.- measurements
Filenames like
${name}_measurements.json, produced by one of the augur measurements subcommands,exportorconcat.
Node data files are typically produced by various Augur commands such as augur traits or augur ancestral and are then fed into augur export to be merged together into a final output for Auspice. Node data files can have any filename you want but some common names are:
nt_muts.json
aa_muts.json
traits.json
branch_lengths.json
${name}_aa-mutation-frequencies.json
${name}_entropy.json
${name}_frequencies.json
${name}_sequences.json
${name}_titers.json
Node data files have a generic structure to allow them to contain all kinds of data about your tree.
In advanced builds, custom node data files are often produced by build-specific
scripts in addition to the ones produced by Augur commands. For example, our
ncov build produces a custom
epiweeks.json node data file using this workflow step
and this script.
Similarly, it’s possible for other bioinformatics software to produce compatible dataset JSONs (primary or sidecars) for use by Auspice; they aren’t required to be generated by Augur, although that is the most common way. Augur’s validation command can check that dataset JSONs have the required schema.
Once you have Nextstrain JSON files, you can visualize and share them in a variety of ways. See our guide to sharing your results to find a way that meets your needs for privacy and collaboration.