Basic analysis of NGS data

Molecular markers - computer practicals
DNA sequences Microsatellites NGS home

Basic analysis of NGS data

Ensete

Presentation is here.

Training dataset A is here. It is a subset of Illumina pair-end reads of Amomum subulatum (Zingiberaceae) enriched for family specific loci.
Training dataset B is here. It is an Illumina pair-end dataset of Ensete superbum (Musaceae) which we use to assemble the whole plastome DNA.

A ZIP file with necessary software for Windows is here. Download it and unzip, no installation is necessary. All the files are running under Windows (they were compiled under Cygwin and can be run as they are from commandline).
Unzip all files to a single folder together with the data and do all the work in this folder.
FastQC - program for assessing the quality of NGS reads, run run_fastqc.bat (requires Java)
Trimmomatic - software for adapter and quality trimming, run jar file with parameters from commandline (requires Java)
fastq-dump - command used to download NGS sequences from GenBank (part of the SRA toolkit), run with parameters from commandline
fastuniq - software for duplicate read removal
BWA - Burrows-Wheeler aligner for read mapping to the reference
samtools - a suite of programs for interacting with high-throughput sequencing data, run with parameters from commandline
bedtools - utilities for genomic analysis tasks, run with parameters from commandline
Velvet - de novo assembler, run with parameters from commandline
many other GNU tools (head, gzip, ls, sort, uniq, wc, cat, cut, sed, grep etc.) - run with parameters from commandline

The software Tablet should be downloaded here. This is a graphical viewer for NGS alignments and assemblies.

Other software recommended to install under Windows

Under Linux you need to install/download all the tools by yourself. Trimmomatic and FastQC are Java programs.
The remaining software should be either installed from repositories (depending on your distribution) or compiled using using these instructions.

Tasks A (to work with Amomum dataset)
NOTE: everything what is in curled brackets - {} - mean that you should type a proper filename insteads (without the brackets!)
NOTE2: these command are for illustration only and should be tuned for a real data analysis!

Full set of commands (fully working and for copy&past to terminal) can be downloaded here (for Windows) or here (for Linux).

1. Looking at the sequences

the sequences are usually gzipped (they have suffix *.gz), they can be opened either in Total Commander or you need to unzip them fist, e.g. with the command
gzip -d {filename}
open the fastq file, e.g. in Total Commander by selecting the file and pressing F3, do not open huge files in Notepad++ !
you can also look at the structure of the file using the command
head {filename}
what do you see in the file? How many lines is the information of a single read?
if you unzipped the file(s) using 'gzip -d' command you should gzip it back
gzip {filename}

2. Checking the quality of sequences using FastQC

open the software FastQC by double-click on run_fastqc.bat
use File - Open to select a gzipped file(s) (you can select multiple) and look at the various graphs and summaries
what type of information is displayed and how this can be interpreted? Are there any differences between forward (R1) and reverse (R2) reads? Is there any problem?

3. Trimming sequences with Trimmomatic

the software Trimmomatic is used to remove low quality sequences as well as Illumina specific adaptors
run the following command (do not forget to change the file and sample names in brackets!) (this is for Windows, on Linux change the backslash '\' to forward slash '/')
java -jar trimmomatic-0.38.jar PE -phred33 {read1.fastq.gz} {read2.fastq.gz} {sample}-1P.fastq.gz {sample}-1U.fastq.gz {sample}-2P.fastq.gz {sample}-2U.fastq.gz ILLUMINACLIP:adapters\TruSeq3-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36
check the Trimmomatic documentation and get familiar with the settings provided on the commandline
what files have been created?
check the resulting 1P and 2P files with FastQC and compare the results with the original unfiltered sequences - did something changed?

4. Duplicate removal using fastuniq

first, the sequences have to be unzipped
gzip –d {sample}-1P.fastq.gz
gzip –d {sample}-2P.fastq.gz
prepare a list of sequences (first, the command ls returns file names matching the parameter (? means whatever character, i.e., both 1 and 2) and second, the > redirect the output of the command to a file)
ls {sample}-?P.fastq > fastqlist.txt
run duplicate removal
fastuniq -i fastqlist.txt -t q -o {sample}-1P_nodup.fastq -p {sample}-2P_nodup.fastq
count number of reads in FASTQ file before and after quality trimming and after duplicate removal (divide the resulting numbers by 4)
gzip –d {read1.fastq.gz} #original file must be unzipped first
cat {read1.fastq} | wc -l
cat {sample}-1P.fastq | wc -l
cat {sample}-2P_nodup.fastq | wc -l
gzip the deduplicated samples (again, '?' allows to run the command just once)
gzip {sample}-?P_nodup.fastq

5. Read mapping with BWA

first, the reference must be indexed, use 'curcuma_HybSeqProbes_test_with400Ns_beginend.fas' as a reference
bwa index {reference.fasta}
map reads to the reference - look at the resulting SAM (Sequence Alignment/Map) file and get familiar with its structure (look at the appropriate slide in the presentation or a documentation)
bwa mem {reference.fasta} {sample}-1P_nodup.fastq {sample}-2P_nodup.fastq > {output.sam}
convert to BAM (the compressed binary version of a SAM file, much smaller, but not readable) using samtools
samtools view –bS {output.sam} > {output.bam}
count number of all reads in a BAM file (this is a serie of commands chained by so-called pipe (|), i.e., output of one command is taken as an input of the following one)
cut - takes 1st column/field, i.e., read name
sort - alphabetically sort lines (necessary to do before applying the next command)
uniq - report unique names only, i.e. remove duplicated names (reads can mapped to multiple locations)
wc - 'word count', counts number of lines with the parameter '-l'
samtools view {output.bam} | cut -f1 | sort | uniq | wc -l
count number of alignments in BAM file (reads mapped to multiple locations count multiple times) - the option '-c' outputs counts instead of reads
samtools view -c {output.bam}
get familiar with FLAGs and their meanings (look at the appropriate slide in the presentation, a documentation, p. 6-7, or this page)
count number of mapped alignments (flag 'unmapped' absent, i.e. the 3rd bit - binary = 4, hexadecimally = 0x40 - is not set - '-F')
samtools view -F 0x04 -c {output.bam}
count number of unmapped alignments (flag 'unmapped' present - '-f')
samtools view -f 0x04 -c {output.bam}
count number of supplementary alignments (flag 'supplementary alignment' present, i.e. the 12th bit - binary = 2048, hexadecimally = 0x800 - is set)
samtools view -f 0x800 -c {output.bam}
count number of mapped reads (alignments not flagged as unmapped (0x4) or supplementary (0x800))
samtools view -F 0x804 -c {output.bam}
sort the BAM file
samtools sort {output.bam} {output_sorted}
for Linux: samtools sort {output.bam} -o {output_sorted.bam}
index the sorted BAM file - what file is created?
samtools index {output_sorted.bam}
extract only mapped reads (reads must be flagged as mapped in proper pair (0x2))
samtools view -f 0x2 {output_sorted.bam} > {output_mapped.bam}
sort BAM file with extracted reads by the read name
samtools sort -n {output_mapped.bam} {output_mapped.sort}
for Linux: samtools sort -n {output.bam} -o {output_mapped.sort.bam}
extract mapped reads as pair-end FASTQ using bedtools
bedtools bamtofastq -i {output_mapped.sort.bam} -fq {read1_mapped.fastq} -fq2 {read2_mapped.fastq}

6. View BAM file with Tablet

open Tablet and click on ‘Open Assembly’
select the sorted/indexed BAM file and an appropriate reference
click on the contig in the table on the left side

7. Variant calling using SAMtools/BCFtools

create 'pileup' for the sorted BAM file (u = uncompressed output)
samtools mpileup -uf {reference.fasta} {output_sorted.bam} > {output.pileup}
call variants/genotypes using bcftools (v = variant sites only, c = SNP calling, g = call genotypes) [this is the command for old version, the new version uses 'bcftools view' and difeerent parameters]
bcftools view -vcg {output}.pileup > {output}.vcf
for Linux: bcftools call --skip-variants indels --multiallelic-caller --variants-only -O v -o {output}.vcf {output}.pileup
look at the resulting VCF file - how many variants were called? You can check it with following command (grep -v = prints all lines not containing a string, "^#" - line starting with #, wc -l = count number of lines)
grep -v "^#" {output}.vcf | wc -l
look at the last column of VCF - how many homozygotes (1/1) and heterozygote (0/1) are present? (homozygote = no variability in the sample; heterozygote = variability within the sample)
grep "1/1" {output}.vcf | wc -l
grep "0/1" {output}.vcf | wc -l
in the 6th column there are mapping qualities (QUAL); filter only variants with at least 36 (awk deals with columns, i.e. if the value in 6th column ($6) is >= 36 it prints the whole line - $0)
how many heterozygotes remained?
awk "{if ($6 >= 36) print $0}" {output}.vcf | grep "0/1" | wc -l
for Linux: awk '{if ($6 >= 36) print $0}' {output}.vcf | grep "0/1" | wc -l
filter only variants covered by at least 15 reads - DP in the INFO column is read depth (awk -F"=|;" - sets the column separator to both '=' and ';' - this makes the DP value 2nd column
awk -F"=|;" "{if ($2 >= 15) print $0}" {output}.vcf | grep "0/1" | wc -l
for Linux: awk -F"=|;" '{if ($2 >= 15) print $0}' {output}.vcf | grep "0/1" | wc -l
how many heterozygote sites left?
however, for safe variant filtering it is recommended to use a specific tool like, e.g. VCFtools (but it doesn't run uder Windows...)

Tasks B (to work with Ensete dataset)
Full set of commands can be downloaded here (for Windows) or here (for Linux).

1. Downloading data from SRA (Sequence Read Archive) - this is optional, you can work with dataset B

download the pair-end Illumina reads for Ensete superbum using the following command. This sample has SRR (=sequence read run) number SRR7058210
fastq-dump --split-3 -gzip SRR7058210
run this command at the beginning of the course because the download can take longer
if you do not succeed, download the training dataset A, it is the same

2. Repeat the steps from Tasks A for Ensete

trimm the sequences with Trimmomatic, remove duplicates with fastuniq
count number of reads in FASTQ files before and after quality trimming and after duplicate removal
map reads to the reference, use 'HF677508_Musa_acuminata_cpDNA.fas' as a reference
count following number of reads in the resulting BAM file
- number of alignments
- number of mapped alignments
- number of unmapped alignments
- number of supplementary alignments
- number of mapped reads
sort and index the BAM file
extract the mapped reads as FASTQ files - how many pairs were extracted?
submit the counts you are getting using a Google Form

3. De-novo assembly with Velvet

run velveth on obtained cpDNA reads with k-mer=101 (if you were unable to generate the FASTQ files with chloroplast reads you can download it here)
velveth {outputdir} {k-mer} -fastq –shortPaired -separate {read1_mapped.fastq.gz} {read2_mapped.fastq.gz}
run velvetg with the same name of the output dir (cov_cutoff - minimum limit for coverage, min_contig_lgth - minimum limit for contig length, exp_cov - expected coverage)
velvetg {outputdir} -cov_cutoff auto -min_contig_lgth 1000 -exp_cov auto
check number of contigs (nodes), n50, max length and total length; also check the statistics in ‘stats.txt’
experiment with different k-mer values (both higher and lower) - k-mer must be an odd number (to avoid palindromes) and inferior to read length - and rerun the analysis; for velvetg you can also lower the coverage cutoff
submit the values you obtained using this Google Form

4. Comparing assemblies using Quast

create a new folder and name it, e.g., 'comparison'
copy the files 'contigs.fa' from folders created by different Velvet runs and rename them (one by one) to, e.g., 91.fa or 121.fa where the name corresponds to k-mer
go to Quast webserver and upload all files with assemblies
fill in something as 'Caption' and click on 'Evaluate', refresh the webpage after some time
click on the particular report on the right side to see the summary
observe the summary results and all three available plots (Cumulative length, Nx, GC content) - what are the plots showing? Which k-mer setting gave best assembly?

5. Check the assembly in Geneious

import the reference as well as 'contigs.fa' from your best assembly to a new folder in Geneious
select all sequences and do Tools -> Align/Assemble -> Map to Reference
select the correct reference and change sensitivity to 'High Sensitivity/Medium'
in the resulting file you should see how de novo contigs matched the reference

6. Annotating the resulting contigs with GeSeq

go to GeSeq and upload the resulting contigs assembled by Velvet
set ‘Linear’ and then click at ‘ADD NCBI RefSeq(s)’
ttype the name of the reference (e.g. Musa) and select one
click ‘OK’ and then ‘Submit’
wait until job is finished and click at ‘OGDRAW’ at particular contig to see how it was annotated

Good luck and thank you for joining the practical course...