============
Data formats
============

.. contents:: Table of Contents
   :local:

TSV
===

Nextstrain strongly prefers using TSV files for metadata even though Augur commands support other delimiters as inputs.
If you are using other formats, we recommend using :doc:`augur curate passthru  <augur:usage/cli/curate/passthru>` to convert them to TSV.

Nextstrain tools and workflows produce `RFC 4180 CSV-like TSVs <https://datatracker.ietf.org/doc/html/rfc4180>`__.

When using `csvtk <https://bioinf.shenwei.me/csvtk/>`__

* the ``--lazy`` (``-l``) option should not be necessary
* the ``fix-quotes``/``del-quotes`` commands should not be necessary

When using `tsv-utils <https://opensource.ebay.com/tsv-utils/>`__

* pass the inputs through ``csv2tsv --csv-delim $'\t'``
* pass the final ``tsv-util`` outputs through ``csvtk fix-quotes --tabs``

.. code-block:: bash

  csv2tsv --csv-delim $'\t' metadata.tsv \
    | tsv-select -H -f strain,date \
    | tsv-uniq -H -f strain \
    | csvtk fix-quotes --tabs > output.tsv

If you are writing Python scripts that process TSV files, we recommend using the
`csv module <https://docs.python.org/3/library/csv.html>`__ for file I/O.

.. note::

  Be sure to follow `csv module's recommendation <https://docs.python.org/3/library/csv.html#id4>`__
  to open files with ``newline=''``.

Reading a TSV file:

.. code-block:: Python

  with open(input_file, 'r', newline='') as handle:
    reader = csv.reader(handle, delimiter='\t')
    for row in reader:
      ...

Writing a TSV file:

.. code-block:: Python

  with open(output_file, 'w', newline='') as output_handle:
    tsv_writer = csv.writer(output_handle, delimiter='\t')
    tsv_writer.writerow(header)
    for record in records:
      tsv_writer.writerow(record)


See our internal `discussion on TSV standardization <https://github.com/nextstrain/augur/issues/1566>`__ for more details.

.. _data-formats-json:

JSON
====

Nextstrain uses a few different kinds of `JSON
<https://en.wikipedia.org/wiki/JSON>`__ files at various stages in a typical
build.

The primary JSON files used by Nextstrain are those consumed by Auspice to
display a dataset.  Without these **dataset files**, Auspice has nothing to
display.  These files are typically the final output of a build and produced by
the Augur command :doc:`augur export <augur:usage/cli/export>`.  They come in
two versions:

v2
  Newer format, with a filename of your choosing like ``${name}.json``.  This
  is often referred to as the **"main" file**.

v1
  Original format, with filenames like ``${name}_tree.json`` and
  ``${name}_meta.json``, often referred to as the **"tree" and "meta" files**.

Secondary JSON files used by Nextstrain come in two flavors: **sidecar** files
and **node data** files.

**Sidecar files** are produced by Augur for direct consumption by Auspice,
alongside the primary JSON files described above.  They come in three types with
filenames enforced by convention:

.. _data-formats-root-sequence:

root-sequence
  Filenames like ``${name}_root-sequence.json``, produced by ``augur export
  v2``'s ``--include-root-sequence`` option.

tip-frequencies
  Filenames like ``${name}_tip-frequencies.json``, produced by :doc:`augur
  frequencies <augur:usage/cli/frequencies>` with the ``--output-format auspice
  --output …`` options.

measurements
  Filenames like ``${name}_measurements.json``, produced by one of the :doc:`augur
  measurements <augur:usage/cli/measurements>` subcommands, ``export`` or ``concat``.

**Node data** files are typically produced by various Augur commands such as
:doc:`augur traits <augur:usage/cli/traits>` or :doc:`augur ancestral
<augur:usage/cli/ancestral>` and are then fed into :doc:`augur export
<augur:usage/cli/export>` to be merged together into a final output for
Auspice.  Node data files can have any filename you want but some common names
are:

  - ``nt_muts.json``
  - ``aa_muts.json``
  - ``traits.json``
  - ``branch_lengths.json``
  - ``${name}_aa-mutation-frequencies.json``
  - ``${name}_entropy.json``
  - ``${name}_frequencies.json``
  - ``${name}_sequences.json``
  - ``${name}_titers.json``

Node data files have a :doc:`generic structure <augur:usage/json_format>` to
allow them to contain all kinds of data about your tree.

In advanced builds, custom node data files are often produced by build-specific
scripts in addition to the ones produced by Augur commands.  For example, our
`ncov build <https://github.com/nextstrain/ncov>`__ produces a custom
``epiweeks.json`` node data file using `this workflow step
<https://github.com/nextstrain/ncov/blob/cee806f/workflow/snakemake_rules/main_workflow.smk#L1127-L1143>`__
and `this script
<https://github.com/nextstrain/ncov/blob/cee806f/scripts/calculate_epiweek.py>`__.

Similarly, it's possible for other bioinformatics software to produce
compatible dataset JSONs (primary or sidecars) for use by Auspice; they aren't
required to be generated by Augur, although that is the most common way.
Augur's :doc:`validation command <augur:usage/cli/validate>` can check that
dataset JSONs have the required schema.

Once you have Nextstrain JSON files, you can visualize and share them in a
variety of ways.  See :doc:`our guide to sharing your results
</guides/share/index>` to find a way that meets your needs for privacy and
collaboration.