Analysis of DNA sequences

Molecular markers - computer practicals
DNA sequences Microsatellites NGS home

Analysis of DNA sequences

Pericallis

Presentation is here.

Training dataset is here. It consists of ten samples from the genus Pericallis sequenced for the trnL-trnF chloroplast region.
The sequences we generated during a lab part of the course will be sent to you by e-mail.

A ZIP file with necessary software is here. Download it, unzip and install some of the software:
FinchTV - ABI format sequence viewer, installation necessary
ClustalX - sequence alignment software, installation necessary
BioEdit - simple alignment viewer/editor with many other functions, installation necessary
RC - reverse-complement maker, just run the exe file
SeqState - insertion/deletion coder, just run the jar file (requires Java)
Gblocks - alignment trimming, just run the exe file
trimAl - alignment trimming, command line use
TCS - statistical parsimony network, just run the jar file (requires Java)

The software Geneious should be downloaded here (14 days free trial version, registration necessary). For later use, our faculty has a network licence... ask me how to set up).

Other software recommended to install

Tasks (to work with Pericallis sequences)

1. Look at the sequences in AB1 format using FinchTV

open the file and scroll through the whole sequence - What is the last position you would consider reliable and why?
look at the chromatogram info ('i' icon) - What is the instrument name? When it was analyzed?
look at the raw data (View - Raw Data) - What is the maximum number at vertical axis (these are rfu - relative fluorescence units)? Are there any changes towards the end of sequence?
BLAST the sequence against on-line NCBI database (Edit - BLAST sequence) - What is the best hit? What is the similarity to it?

2. Prepare contigs using either Geneious or SeqMan

align the sequences, trim the unreliable ends - How long is the overlap in the middle?
do the necessary edits (delete gaps, put Ns when the base call is not reliable) and copy the resulting contig to text editor, e.g. Notepad++
submit the sequence using Google Form (we will allocate the work among us, everybody submits 1-2 sequences, all collected sequences appears here)

3. Align sequences using standalone ClustalX and on-line MAFFT approach

align sequences in FASTA format using ClustalX - What files are created? Can you create also some other alignment formats?
align sequences using MAFFT - Did you get the same alignment?
save alignments as FASTA (gapped FASTA, i.e., aligned sequences)

4. Trim the alignment using Gblocks / trimAl

open Gblocks and trim the alignment using default settings - How many positions were eliminated and why?
copy your alignment to trimAl/bin folder, open commandline there (e.g., using Total Commander) and run following commands:
- .\trimal -in alignment.fas -out out_nogaps.fas -htmlout out_nogaps.html -nogaps #removes all column with at least one gap
- .\trimal -in alignment.fas -out out_gappyout.fas -htmlout out_gappyout.html -gappyout #removes gappy regions
- .\trimal -in alignment.fas -out out_strict.fas -htmlout out_strict.html -strict #more strict filtering
check output files (in html file you can see what was removed) - How many positions are retained in these three trimming cases?

5. Convert FASTA files to PHYLIP and NEXUS format

use on-line service, e.g. EMBOSS Seqret or Format Converter
use the commandline option in trimAl/bin folder
- .\readal -in out_gappyout.fas -out out_gappyout.phy -phylip3.2
- .\readal -in out_gappyout.fas -out out_gappyout.nex -nexus
What are advantages of these formats?

6. Simple indel coding (SIC) using SeqState

run SeqState, open the alignment and select IndelCoder and desired type of indel coding
open the resulting file - What format it is? How many indels were coded and how?

7. Working with FaBox (on-line fasta sequence toolbox)

trim the alignment to the shortest sequence using 'Alignment trimmer' - What it does? What is the length of the resulting alignment?
extract variable sites only using 'Show variable sites only' - How many variable sites are there?
generate input file for TCS using 'Create TCS input file from fasta (fasta2tcs)' - Open the resulting file in, e.g., Notepad++ and check 'end-of-line' characters. Are there different from the input file? How?

8. Generating statistical parsimony network using TCS

open the JAR file, start new analysis, select PHYLIP file generated in the previous step, change gaps to missing
look at the resulting graph - What are the small dots between haplotypes? What do you think the network shows?

Good luck and thank you for joining the practical course...