Molecular markers - computer practicals
DNA sequences    Microsatellites    NGS    home

Basic analysis of NGS data

Ensete


Presentation is here.

Training dataset A is here. It is a subset of Illumina pair-end reads of Amomum subulatum (Zingiberaceae) enriched for family specific loci.
Training dataset B is here. It is an Illumina pair-end dataset of Ensete superbum (Musaceae) which we use to assemble the whole plastome DNA.

A ZIP file with necessary software for Windows is here. Download it and unzip, no installation is necessary. All the files are running under Windows (they were compiled under Cygwin and can be run as they are from commandline).
Unzip all files to a single folder together with the data and do all the work in this folder.
FastQC - program for assessing the quality of NGS reads, run run_fastqc.bat (requires Java)
Trimmomatic - software for adapter and quality trimming, run jar file with parameters from commandline (requires Java)
fastq-dump - command used to download NGS sequences from GenBank (part of the SRA toolkit), run with parameters from commandline
fastuniq - software for duplicate read removal
BWA - Burrows-Wheeler aligner for read mapping to the reference
samtools - a suite of programs for interacting with high-throughput sequencing data, run with parameters from commandline
bedtools - utilities for genomic analysis tasks, run with parameters from commandline
Velvet - de novo assembler, run with parameters from commandline
many other GNU tools (head, gzip, ls, sort, uniq, wc, cat, cut, sed, grep etc.) - run with parameters from commandline

The software Tablet should be downloaded here. This is a graphical viewer for NGS alignments and assemblies.

Other software recommended to install under Windows

Under Linux you need to install/download all the tools by yourself. Trimmomatic and FastQC are Java programs.
The remaining software should be either installed from repositories (depending on your distribution) or compiled using using these instructions.


Tasks A (to work with Amomum dataset)
NOTE: everything what is in curled brackets - {} - mean that you should type a proper filename insteads (without the brackets!)
NOTE2: these command are for illustration only and should be tuned for a real data analysis!

Full set of commands (fully working and for copy&past to terminal) can be downloaded here (for Windows) or here (for Linux).

1. Looking at the sequences

2. Checking the quality of sequences using FastQC

3. Trimming sequences with Trimmomatic

4. Duplicate removal using fastuniq

5. Read mapping with BWA

6. View BAM file with Tablet

7. Variant calling using SAMtools/BCFtools


Tasks B (to work with Ensete dataset)
Full set of commands can be downloaded here (for Windows) or here (for Linux).

1. Downloading data from SRA (Sequence Read Archive) - this is optional, you can work with dataset B

2. Repeat the steps from Tasks A for Ensete

3. De-novo assembly with Velvet

4. Comparing assemblies using Quast

5. Check the assembly in Geneious

6. Annotating the resulting contigs with GeSeq

 

Good luck and thank you for joining the practical course...