Molecular markers - computer practicals
DNA sequences    Microsatellites    NGS    home

Basic analysis of NGS data

Ensete


Presentation is here.

Training dataset A is here. It is a subset of Illumina pair-end reads of Amomum subulatum (Zingiberaceae) enriched for family specific loci.
Training dataset B is here. It is an Illumina pair-end dataset of Ensete superbum (Musaceae) which we use to assemble the whole plastome DNA.

A ZIP file with necessary software is here. Download it and unzip, noinstallation is necessary. All the files are running under Windows (they were compiled under Cygwin and can be run as they are from commandline).
Unzip all files to a single folder together with the data and do all the work in this folder.
FastQC - program for assessing the quality of NGS reads, run run_fastqc.bat (requires Java)
Trimmomatic - software for adapter and quality trimming, run jar file with parameters from commandline (requires Java)
fastq-dump - command used to download NGS sequences from GenBank (part of the SRA toolkit), run with parameters from commandline
fastuniq - software for duplicate read removal
BWA - Burrows-Wheeler aligner for read mapping to the reference
samtools - a suite of programs for interacting with high-throughput sequencing data, run with parameters from commandline
bedtools - utilities for genomic analysis tasks, run with parameters from commandline
Velvet - de novo assembler, run with parameters from commandline
many other GNU tools (head, gzip, ls, sort, uniq, wc, cat, cut, sed, grep etc.) - run with parameters from commandline

The software Tablet should be downloaded here. This is a graphical viewer for NGS alignments and assemblies.

Other software recommended to install


Tasks A (to work with Amomum dataset)
Full set of commands can be downloaded here.
NOTE:
everything what is in curled brackets - {} - mean that you should type a proper filename insteads (without the brackets!)
NOTE2: these command are for illustration only and should be tuned for a real data analysis!

1. Downloading data from SRA (Sequence Read Archive)

2. Looking at the sequences

3. Checking the quality of sequences using FastQC

4. Trimming sequences with Trimmomatic

5. Duplicate removal using fastuniq

6. Read mapping with BWA

7. View BAM file with Tablet

8. Variant calling using SAMtools/BCFtools


Tasks B (to work with Ensete dataset)
Full set of commands can be downloaded here.

1. Downloading data from SRA (Sequence Read Archive)

2. Repeat the steps from Tasks A for Ensete

3. De-novo assembly with Velvet

4. Comparing assemblies using Quast

5. Check the assembly in Geneious

6. Annotating the resulting contigs with GeSeq

 

Good luck and thank you for joining the practical course...