Molecular markers - computer practicals
DNA sequences
Microsatellites NGS
home
Basic analysis of NGS data

Presentation is here.
Training dataset A is here. It is a subset of
Illumina pair-end reads of Amomum subulatum (Zingiberaceae) enriched
for family specific loci.
Training dataset B is here. It is an Illumina
pair-end dataset of Ensete superbum (Musaceae) which we use to assemble
the whole plastome DNA.
A ZIP file with necessary software for Windows is here.
Download it and unzip, no installation is necessary. All the files are running
under Windows (they were compiled under Cygwin and can be run as they are from
commandline).
Unzip all files to a single folder together with the data and do all the work in
this folder.
FastQC - program for assessing the quality of NGS reads, run run_fastqc.bat
(requires Java)
Trimmomatic - software for adapter and quality trimming, run jar file with
parameters from commandline (requires Java)
fastq-dump - command used to download NGS sequences from GenBank (part of
the SRA toolkit),
run with parameters from commandline
fastuniq
- software for duplicate read removal
BWA -
Burrows-Wheeler aligner for read mapping to the reference
samtools - a suite of
programs for interacting with high-throughput sequencing data, run with
parameters from commandline
bedtools -
utilities for genomic analysis tasks, run with parameters from commandline
Velvet - de
novo assembler, run with parameters from commandline
many other GNU tools (head,
gzip,
ls,
sort,
uniq,
wc,
cat,
cut, sed,
grep etc.) - run with parameters from
commandline
The software Tablet should be downloaded
here.
This is a graphical viewer for NGS alignments and assemblies.
Other software recommended to install under Windows
Under Linux you need to install/download all the tools by
yourself. Trimmomatic
and FastQC
are Java programs.
The remaining software should be either installed from repositories (depending
on your distribution) or compiled using using
these instructions.
Tasks A
(to work
with Amomum dataset)
NOTE: everything
what is in curled brackets - {} - mean that you should type a proper filename
insteads (without the brackets!)
NOTE2: these
command are for illustration only and should be tuned for a real data analysis!
Full set of commands
(fully working and for copy&past to terminal) can be downloaded
here (for Windows) or
here (for Linux).
1. Looking at the sequences
- the sequences are usually gzipped (they have suffix *.gz), they can be
opened either in Total Commander or you need to unzip them fist, e.g. with
the command
gzip -d {filename}
- open the fastq file, e.g. in Total Commander by selecting the file and
pressing F3, do not open huge files in Notepad++ !
- you can also look at the structure of the file using the command
head {filename}
- what do you see in the file? How many lines is the information of a
single read?
- if you unzipped the file(s) using 'gzip -d' command you should gzip it
back
gzip {filename}
2. Checking the quality of sequences using FastQC
- open the software FastQC by double-click on run_fastqc.bat
- use File - Open to select a gzipped file(s) (you can select multiple)
and look at the various graphs and summaries
- what type of information is displayed and how this can be interpreted?
Are there any differences between forward (R1) and reverse (R2) reads? Is
there any problem?
3. Trimming sequences with Trimmomatic
- the software Trimmomatic is used to remove low quality sequences as well
as Illumina specific adaptors
- run the following command (do not forget to change the file and sample
names in brackets!) (this is for Windows, on Linux change the backslash '\'
to forward slash '/')
java -jar trimmomatic-0.38.jar PE -phred33 {read1.fastq.gz} {read2.fastq.gz}
{sample}-1P.fastq.gz {sample}-1U.fastq.gz {sample}-2P.fastq.gz
{sample}-2U.fastq.gz ILLUMINACLIP:adapters\TruSeq3-PE.fa:2:30:10
LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36
- check the
Trimmomatic documentation and get familiar with the settings provided on
the commandline
- what files have been created?
- check the resulting 1P and 2P files with FastQC and compare the results
with the original unfiltered sequences - did something changed?
4. Duplicate removal using fastuniq
- first, the sequences have to be unzipped
gzip –d {sample}-1P.fastq.gz
gzip –d {sample}-2P.fastq.gz
- prepare a list of sequences (first, the command
ls returns file names matching the parameter (? means whatever
character, i.e., both 1 and 2) and second, the >
redirect the output of the command to a file)
ls {sample}-?P.fastq > fastqlist.txt
- run duplicate removal
fastuniq -i fastqlist.txt -t q -o
{sample}-1P_nodup.fastq -p {sample}-2P_nodup.fastq
- count number of reads in FASTQ file before and after quality trimming
and after duplicate removal (divide the resulting numbers by 4)
gzip –d
{read1.fastq.gz}
#original file must be unzipped first
cat {read1.fastq} | wc -l
cat {sample}-1P.fastq | wc -l
cat {sample}-2P_nodup.fastq | wc -l
- gzip
the deduplicated samples (again, '?' allows to run the command just once)
gzip {sample}-?P_nodup.fastq
5. Read mapping with BWA
- first, the reference must be indexed, use
'curcuma_HybSeqProbes_test_with400Ns_beginend.fas' as a reference
bwa index {reference.fasta}
- map reads to the reference - look at the resulting SAM (Sequence
Alignment/Map) file and get familiar with its structure (look at the
appropriate slide in the presentation or a
documentation)
bwa mem {reference.fasta} {sample}-1P_nodup.fastq
{sample}-2P_nodup.fastq > {output.sam}
- convert to BAM (the compressed binary version of a SAM file, much
smaller, but not readable) using samtools
samtools view –bS {output.sam} > {output.bam}
- count number of all reads in a BAM file (this is a serie of commands
chained by so-called pipe (|), i.e., output of one command is taken as an
input of the following one)
cut - takes 1st column/field, i.e., read name
sort - alphabetically sort lines (necessary to do before applying the next
command)
uniq - report unique names only, i.e. remove duplicated names (reads can
mapped to multiple locations)
wc - 'word count', counts number of lines with the parameter '-l'
samtools view {output.bam} | cut -f1 | sort |
uniq | wc -l
- count number of alignments in BAM file (reads mapped to multiple
locations count multiple times) - the option '-c' outputs counts instead of
reads
samtools view -c {output.bam}
- get familiar with FLAGs and their meanings (look at the appropriate
slide in the presentation, a
documentation, p. 6-7, or
this page)
- count number of mapped alignments (flag 'unmapped' absent, i.e. the 3rd
bit - binary = 4, hexadecimally = 0x40 - is not set - '-F')
samtools view -F 0x04 -c {output.bam}
- count number of unmapped alignments (flag 'unmapped' present - '-f')
samtools view -f 0x04 -c {output.bam}
- count number of supplementary alignments (flag 'supplementary alignment'
present, i.e. the 12th bit - binary = 2048, hexadecimally = 0x800 - is set)
samtools view -f 0x800 -c {output.bam}
- count number of mapped reads (alignments not flagged as unmapped (0x4)
or supplementary (0x800))
samtools view -F 0x804 -c {output.bam}
- sort the BAM file
samtools sort {output.bam} {output_sorted}
for Linux:
samtools sort {output.bam} -o {output_sorted.bam}
- index the sorted BAM file - what file is created?
samtools index
{output_sorted.bam}
- extract only mapped reads (reads must be flagged as mapped in proper
pair (0x2))
samtools view -f 0x2 {output_sorted.bam} >
{output_mapped.bam}
- sort BAM file with extracted reads by the read name
samtools sort -n {output_mapped.bam}
{output_mapped.sort}
for Linux:
samtools sort -n {output.bam} -o {output_mapped.sort.bam}
- extract mapped reads as pair-end FASTQ using bedtools
bedtools bamtofastq -i {output_mapped.sort.bam}
-fq {read1_mapped.fastq} -fq2 {read2_mapped.fastq}
6. View BAM file with Tablet
- open Tablet and click on ‘Open Assembly’
- select the sorted/indexed BAM file and an appropriate reference
- click on the contig in the table on the left side
7. Variant calling using SAMtools/BCFtools
- create 'pileup' for the sorted BAM file (u = uncompressed output)
samtools mpileup -uf {reference.fasta}
{output_sorted.bam} > {output.pileup}
- call variants/genotypes using bcftools (v = variant sites only, c = SNP
calling, g = call genotypes) [this is the command for old version, the new
version uses 'bcftools view' and difeerent parameters]
bcftools view -vcg {output}.pileup > {output}.vcf
for Linux:
bcftools call --skip-variants indels
--multiallelic-caller --variants-only -O v -o {output}.vcf {output}.pileup
- look at the resulting VCF file - how many variants were called? You can
check it with following command (grep -v = prints all lines not containing a
string, "^#" - line starting with #, wc -l = count number of lines)
grep -v "^#" {output}.vcf | wc -l
- look at the last column of VCF - how many homozygotes (1/1) and
heterozygote (0/1) are present? (homozygote = no variability in the sample;
heterozygote = variability within the sample)
grep "1/1" {output}.vcf | wc -l
grep "0/1" {output}.vcf | wc -l
- in the 6th column there are mapping qualities (QUAL); filter only
variants with at least 36 (awk deals with columns, i.e. if the value in 6th
column ($6) is >= 36 it prints the whole line - $0)
how many heterozygotes remained?
awk "{if ($6 >= 36) print $0}" {output}.vcf |
grep "0/1" | wc -l
for Linux:
awk '{if ($6 >= 36) print $0}' {output}.vcf |
grep "0/1" | wc -l
- filter only variants covered by at least 15 reads - DP in the INFO
column is read depth (awk -F"=|;" - sets the column separator to both '='
and ';' - this makes the DP value 2nd column
awk -F"=|;" "{if ($2 >= 15) print $0}"
{output}.vcf | grep "0/1" | wc -l
for Linux:
awk -F"=|;" '{if ($2 >= 15) print $0}'
{output}.vcf | grep "0/1" | wc -l
how many heterozygote sites left?
- however, for safe variant filtering it is recommended to use a specific
tool like, e.g. VCFtools (but it doesn't run uder Windows...)
Tasks B
(to work
with Ensete dataset)
Full set of commands can be downloaded
here (for Windows) or
here (for Linux).
1. Downloading data from SRA (Sequence Read Archive) - this
is optional, you can work with dataset B
- download the pair-end Illumina reads for Ensete superbum using
the following command. This sample has SRR (=sequence read run) number
SRR7058210
fastq-dump --split-3 -gzip SRR7058210
- run this command at the beginning of the course because the download can
take longer
- if you do not succeed, download the training dataset A, it is the same
2. Repeat the steps from Tasks A for Ensete
- trimm the sequences with Trimmomatic, remove duplicates with fastuniq
- count number of reads in FASTQ files before and after quality trimming
and after duplicate removal
- map reads to the reference, use 'HF677508_Musa_acuminata_cpDNA.fas' as a
reference
- count following number of reads in the resulting BAM file
- number of alignments
- number of mapped alignments
- number of unmapped alignments
- number of supplementary alignments
- number of mapped reads
- sort and index the BAM file
- extract the mapped reads as FASTQ files - how many pairs were extracted?
- submit the counts you are getting using a
Google Form
3. De-novo assembly with Velvet
- run velveth on obtained cpDNA reads with k-mer=101 (if you were unable
to generate the FASTQ files with chloroplast reads you can download it
here)
velveth {outputdir} {k-mer} -fastq –shortPaired
-separate {read1_mapped.fastq.gz} {read2_mapped.fastq.gz}
- run velvetg with the same name of the output dir (cov_cutoff - minimum
limit for coverage, min_contig_lgth - minimum limit for contig length,
exp_cov - expected coverage)
velvetg {outputdir} -cov_cutoff auto
-min_contig_lgth 1000 -exp_cov auto
- check number of contigs (nodes), n50, max length and total length; also
check the statistics in ‘stats.txt’
- experiment with different k-mer values (both higher and lower) - k-mer
must be an odd number (to avoid palindromes) and inferior to read length -
and rerun the analysis; for velvetg you can also lower the coverage cutoff
- submit the values you obtained using this
Google Form
4. Comparing assemblies using Quast
- create a new folder and name it, e.g., 'comparison'
- copy the files 'contigs.fa' from folders created by different Velvet
runs and rename them (one by one) to, e.g., 91.fa or 121.fa where the name
corresponds to k-mer
- go to Quast
webserver and upload all files with assemblies
- fill in something as 'Caption' and click on 'Evaluate', refresh the
webpage after some time
- click on the particular report on the right side to see the summary
- observe the summary results and all three available plots (Cumulative
length, Nx, GC content) - what are the plots showing? Which k-mer setting
gave best assembly?
5. Check the assembly in Geneious
- import the reference as well as 'contigs.fa' from your best assembly to
a new folder in Geneious
- select all sequences and do Tools -> Align/Assemble -> Map to Reference
- select the correct reference and change sensitivity to 'High
Sensitivity/Medium'
- in the resulting file you should see how de novo contigs matched the
reference
6. Annotating the resulting contigs with GeSeq
- go to
GeSeq and upload the resulting contigs assembled by Velvet
- set ‘Linear’ and then click at ‘ADD NCBI RefSeq(s)’
- ttype the name of the reference (e.g. Musa) and select one
- click ‘OK’ and then ‘Submit’
- wait until job is finished and click at ‘OGDRAW’ at particular contig to
see how it was annotated
Good luck and thank you for joining the practical course...