========================== Running an ingest workflow ========================== This tutorial uses the :term:`Nextstrain CLI` to help you get started running :term:`ingest workflows`. Ingest workflows download public data from NCBI and output :term:`ingest datasets `, which include curated metadata and sequences that can be used as input for :term:`phylogenetic ` or :term:`Nextclade workflows `. .. note:: You only need to run an ingest workflow if you do not want to use the data files already publicly hosted by Nextstrain. Individual pathogen repositories include documentation that links to their data files. In this tutorial, you will run the ingest workflow of our `Zika repository `_ and view outputs on your computer. You will have a basic understanding of how to run ingest workflows for other pathogens and a foundation for understanding how to customize ingest workflows. .. contents:: Table of Contents :local: Prerequisites ============= 1. :doc:`Install Nextstrain `. These instructions will install all of the software you need to complete this tutorial. Download the Zika repository ============================ All pathogen ingest workflows are stored in :term:`pathogen repositories` (version-controlled folders) to track changes over time. Download the `Zika repository `_. .. code-block:: console $ git clone https://github.com/nextstrain/zika Cloning into 'zika'... [...more output...] When it's done, you'll have a new directory called ``zika/``. Run the default workflow ======================== The zika :term:`ingest workflow` uses the `NCBI Datasets CLI tools `_ to download public data and uses a combination of :doc:`augur curate ` and other data manipulation tools to curate the downloaded data into a format suitable for :term:`phylogenetic workflows `. 1. Change directory to the Zika pathogen repository downloaded in the previous step .. code-block:: console $ cd zika 2. Run the default ingest workflow with the :term:`Nextstrain CLI`. .. code-block:: console $ nextstrain build ingest Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. Building DAG of jobs… [...a lot of output...] This should take just a few minutes to complete. There should be two final output files: * ``ingest/results/metadata.tsv`` * ``ingest/results/sequences.fasta`` The output files should have the same data formats as the public data files hosted by Nextstrain, available at: * https://data.nextstrain.org/files/workflows/zika/metadata.tsv.zst * https://data.nextstrain.org/files/workflows/zika/sequences.fasta.zst Your results may have additional records depending on whether new data has been released since the public data files were last uploaded. Configuring the ingest workflow =============================== Now that you've seen the default outputs of the ingest workflow, you can try configuring the ingest workflow to change the outputs. Inspecting the uncurated metadata --------------------------------- If you want to see the uncurated NCBI Datasets data to decide what changes you would like to make to the workflow, you can download the uncurated NCBI data. .. hint:: These commands are very similar to the commands run by the ingest workflow with some minor differences. The ingest workflow restricts the columns to those defined in ``config["ncbi_datasets_fields"]`` and keeps the header names as the more computer friendly "Mnemonic" of the `NCBI Datasets' available fields `_. 1. Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately. .. code-block:: console $ nextstrain shell . 2. Create the ``ingest/data`` directory if it doesn't already exist. .. code-block:: console $ mkdir -p ingest/data 3. Download the dataset with the pathogen NCBI taxonomy ID. .. code-block:: console $ datasets download virus genome taxon \ --filename ingest/data/ncbi_dataset.zip 4. Extract and format the metadata as a TSV file for easy inspection .. code-block:: console $ dataformat tsv virus-genome \ --package ingest/data/ncbi_dataset.zip \ > ingest/data/raw_metadata.tsv 5. Exit the Nextstrain shell to return to your usual shell environment. .. code-block:: console $ exit The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets. Updating the workflow config ---------------------------- We'll walk through an example custom config to include an additional column in the curated output. For example, examining the raw NCBI metadata shows us that ``virus-name`` is a NCBI Datasets field that is not currently downloaded by the default Zika ingest workflow. If you wanted this field to be included in your outputs, you could perform the following steps. 1. Create a new build config directory ``ingest/build-configs/tutorial/`` .. code-block:: console $ mkdir ingest/build-configs/tutorial 2. Copy the default config to ``ingest/build-configs/tutorial/config.yaml`` .. code-block:: console $ cp ingest/defaults/config.yaml ingest/build-configs/tutorial/config.yaml 3. Modify the config parameters within your new custom config ``ingest/build-configs/tutorial/config.yaml``. * Add ``virus-name`` to the ``ncbi_datasets_fields`` to make the workflow parse the column from the downloaded NCBI data. * Update the ``curate.field_map`` with an entry for the new field to match the underscore naming scheme of column names. .. code-block:: yaml curate: field_map: virus-name: virus_name * Add ``virus_name`` to the ``curate.metadata_columns`` to configure the workflow to include the new column in the final output file. * (Optional) Remove any other config parameters that you are not modifying .. note:: Config parameters that are dictionaries will merge with the parameters defined in ``ingest/defaults/config.yaml`` while all other types will overwrite the default. See `Snakemake documentation `_ for more details on how configuration files work. All config parameters available are listed in the ``ingest/defaults/config.yaml`` file. Any of the config parameters can be overridden in a custom config file. 4. Run the ingest workflow again with the custom config file. .. code-block:: console $ nextstrain build ingest --configfile build-configs/tutorial/config.yaml --forceall Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. Config file defaults/config.yaml is extended by additional config specified via the command line. Building DAG of jobs… [...a lot of output...] 5. Inspect the new ``ingest/results/metadata.tsv`` to see that it now includes the additional ``virus_name`` column. Advanced usage: Customizing the ingest workflow =============================================== .. note:: This section of the tutorial requires an understanding of `Snakemake `_ workflows. In addition to configuring the ingest workflow, it is also possible to extend the ingest workflow with your own custom steps. We'll walk through an example customization that joins additional metadata to the public data that you've curated in the previous steps. 1. Create an additional metadata file ``ingest/build-configs/tutorial/additional-metadata.tsv`` .. code-block:: none genbank_accession column_A column_B column_C AF013415 AAAAA BBBBB CCCCC AF372422 AAAAA BBBBB CCCCC AY326412 AAAAA BBBBB CCCCC AY632535 AAAAA BBBBB CCCCC EU303241 AAAAA BBBBB CCCCC EU074027 AAAAA BBBBB CCCCC EU545988 AAAAA BBBBB CCCCC NC_012532 AAAAA BBBBB CCCCC DQ859059 AAAAA BBBBB CCCCC JN860885 AAAAA BBBBB CCCCC 2. Create a new rules file ``ingest/build-configs/tutorial/merge-metadata.smk`` .. code-block:: python rule merge_metadata: input: metadata="results/metadata.tsv", additional_metadata="build-configs/tutorial/additional-metadata.tsv", output: merged_metadata="results/merged-metadata.tsv" shell: """ tsv-join -H \ --filter-file {input.additional_metadata} \ --key-fields "genbank_accession" \ --append-fields "*" \ --write-all "?" \ {input.metadata} > {output.merged_metadata} """ This rule uses `tsv-join `_ to merge the additional metadata with the metadata output from the ingest workflow. The records will be merged using the ``genbank_accession`` column and all fields from the ``additional-metadata.tsv`` file will be appended to the metadata. Any record in the ``metadata.tsv`` that does not have a matching record in the ``additional-metadata.tsv`` will have a default ``?`` value in the new columns. 3. Add the following to the custom config file ``ingest/build-configs/tutorial/config.yaml`` .. code-block:: yaml custom_rules: - build-configs/tutorial/merge-metadata.smk The ``custom_rules`` config tells the ingest workflow to include your custom rules so that you can run them as part of the workflow. 4. Run the ingest workflow again with the customized rule. .. code-block:: console $ nextstrain build ingest merge_metadata --configfile build-configs/tutorial/config.yaml Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. Config file config/defaults.yaml is extended by additional config specified via the command line. Building DAG of jobs... [...a lot of output...] 5. Inspect the ``ingest/results/merged-metadata.tsv`` file to see that it includes the additional columns ``column_A``, ``column_B``, and ``column_C``. The records with the ``genbank_accession`` listed in the ``additional-metadata.tsv`` file should have the placeholder data in the new columns, while other records should have the default ``?`` value. Next steps ========== * Run the `zika phylogenetic workflow `_ with new ingested data as input by running .. code-block:: console $ mv ingest/results/* phylogenetic/data/ $ nextstrain build phylogenetic If you've customized the ingest workflow then you may need to modify the phylogenetic workflow to use the new ingested data file. We are planning to write another tutorial to cover other modifications to your phylogenetic workflow. * :doc:`Learn how to create an ingest workflow `