Exercise 2: Data conversion/translation

One inconvenience of having a number of different DNA sequence analysis packages available is that they use different formats for storing more-or-less the same information. Further, most packages refuse to accept even the simplest files from one another. And finally, the DNA sequence databases - EMBL, Genbank and DDBJ - each has its own sequence file format.

<br /> Readseq is a program for sequence conversion, it was written originally around 1989 as a component of a sequence analysis program, it took on a life of its own as a conversion program for bioinformatics. Readseq is particularly useful as it automatically detects many sequence formats, and interconverts among them. Most sequence file formats allow more than one sequence per file, but some do not.

*Sequence formats that allow one or more sequences:*
* <literal>GenBank/GB, ge</literal>nbank flatfile format
* EMBL, EMBL flatfile format
* DNAStrider, for common Mac program
* Pearson/Fasta, a common format used by Fasta programs and others
* Phylip3.2, sequential format for Phylip programs

*Sequence formats that only allow one sequence.*
* GCG, single sequence format of GCG software
* Plain/Raw, sequence data only (no name, document, numbering)

* Have a look at <a rel="nofollow" target="_blank" href="http://emboss.sourceforge.net/docs/themes/SequenceFormats.html" title="http://emboss.sourceforge.net/docs/themes/SequenceFormats.html">Sequence formats</a>
* Now open the <a rel="nofollow" target="_blank" href="https://www.ebi.ac.uk/Tools/sfc/readseq/" title="https://www.ebi.ac.uk/Tools/sfc/readseq/">Readseq website</a>
* The file NM_004014.txt (Right-click > open in new window) contains a sequence in GCG format (Dystrophin transcript variant Dp116).
* Copy and paste the sequence, select "Fasta (Pearson)" as the output format, View in browser and press the submit button.
* Change the output format and look at the different file formats
* The file Sequences.txt (Right-click > open in new window) contains more Dystrophin protein sequences in Fasta format.
* Convert to GCG format (View in browser)
* *What happens if you convert the sequences in GCG format back to fasta format?*
* *What would you have do to overcome this error?*

Exercise 4: Retrieve the DNA sequence for HBB, mask repeats, and use Primer3 for picking primers to screen patients for the HBS mutation

  • Go to the GeneCards database and retrieve the genecard for the HBB gene. Genecards is a cross-reference database with links to major information sources. Examine the HBB card. What kind of information is available?

The UCSC genome browser contains the DNA sequence of more than ten organisms, including human and mouse. Besides the sequence it contains several annotation tracks, like repetitive sequences, genes and conservation. It is also possible to send a custom made track to the browser.
  • Follow the link from the GeneCards database to the "Genomic location: UCSC golden path with GeneCards custom track". The "UCSC genes" track is a curated track where information of UniProt, RefSeq and mRNA sequences from GenBank are combined. Click on this track to expand it. Compare the GeneCards annotation to the "Known genes" annotation. Is the GeneCards annotation of HBB correct?
  • Click on "HBB" in the "UCSC genes" track. This leads to a summary about the HBB gene. Check which information is available here. Follow the "Genome browser" link (in the box: Sequence and Links to Tools and Databases) The boundaries of the browser are now reset to the boundaries of the HBB gene according to the Known genes track. The thick blocks indicate exons of the particular gene, the thin lines indicate introns. The arrows give information about the orientation. >>> means that the gene is located on the positive strand (read from p to q arm of the chromosome), <<< means that the sequence is located on the opposite (reverse complement) strand.
  • To pick primers for HBB the DNA sequence is needed as input for the Primer3 program. This sequence can be downloaded from the UCSC database. Select "View", "DNA" in the top menu. Add 100 bp extra up- and downstream of the gene to broaden your search for primer sequences. Make sure you'll get the "reverse complement". Leave the other options unchanged and click on "get DNA". Save the sequence on the desktop. You can copy/paste the sequence in a new text file or choose File->Save as... in the menu of the webbrowser.
  • Before the sequence can be analysed by Primer3, repetitive sequences have to be removed. This can be done with the program RepeatMasker. Upload the DNA sequence (from the desktop) to the RepeatMasker program. Select "html" as "return format" and leave the other options unchanged. Why do you need to mask repeats? Which repeats are present in the sequence and how much is masked by RepeatMasker?
  • Copy/paste the Masked Sequence into Primer3. Keep the default settings and choose "Pick primers" Does the primerset cover the region of interest (i.e. the region with the mutation)?

Exercise 5: Pick primers using UCSC

Pick primers for the first exon of HBB / HBS
  • Go back to the UCSC genome browser and search for HBB. Select HBB from the Known Genes track and click on HBB in the browser itself. This time retrieve the DNA sequence by selecting "Genomic Sequence (chr11:5,246,696-5,248,301)" from the "Sequence and Links to Tools and Databases" box. Uncheck the "Introns" box and select "One fasta record per region" including 100 bp up- and downstream. Select "Mask repeats" to N. This option does the same as the RepeatMasker program.
  • Copy the sequence of the first exon into Primer3. Does the resulting primerset include the region with the mutation?
  • Check if the primerset is unique. Go to the UCSC website and start the "In-silico PCR" (Tools > In-silico PCR). Choose the most recent human genome assembly and copy/paste the forward and reverse primer from the previous exercise. Follow the link to go to the graphical result and inspect if the primerset contains the correct region.
  • Inspect the dbSNP track (one of the lower tracks in the genome browser. If you don't see the dbSNP track activate it in the menu below the graph (Variation, Common SNPs). What is the rs-id of the variant corresponding with the HBS phenotype? Click on the rs-id in the genome browser, and then once more on the rs-id to go to the dbSNP database and get more information about the variant.

Exercise 6: Similarity search with BLAST

When considering a new DNA sequence, the first question will probably be: "What do I have in my sequence?" which relates to the question "Is my sequence similar to previously known sequences?" To find out, you may compare your sequence with the public databases, using one or more computer programs that have been written for searching the databases for similarities to a given query sequence. The fastest way to search for similar sequences in the databases is BLAST (Basic Local Alignment Search Tool). Blast is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. Blast has several variants:
Sequence type nucleotide database amino acid database
nucleotide query blastn/tblastx blastx
amino acid query tblastn blastp
  • blastn is good for finding nucleotide sequences similar to yours.
  • blastp is good for finding amino acid sequences similar to yours.
  • blastx is good for finding amino acid sequences similar to any translation of your nucleotide sequence - e.g. if you could not recognise an ORF.
  • tblastn is good for finding nucleotide sequences that can be translated into something similar to your amino acid sequence - e.g. unannotated pseudogenes.
  • tblastx is good for keeping computers busy (or for very specific applications).

The BLAST 'Search' box accepts a number of different types of input and automatically determines the format. Accepted input types are: FASTA file, bare sequence or database identifiers (e.g. genbank i.d. M18533, Swissprot i.d. P11532)
  • Open BLAST en go to blastp
  • Copy-paste Virtual.prot (Right-click > open in new window, this sequence is translated from an mRNA sequence) in the sequence input field, leave the other options unchanged and press BLAST
  • Which sequence in the database is most similar to the one we have submitted? Are the sequences identical?

A Graphical overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, with the most similar hits uppermost and appearing in red. Pink, green, blue and black bars follow, representing proteins in decreasing order of similarity. Hatched areas would indicate a gap in similarity i.e., two or more distinct regions of similarity were found within the same sequence hit. Detached bars on the same line correspond to unrelated hits. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.
  • On the Blast result page click on Taxonomy reports and look at the hits for different organisms
  • Go back to the Blast result page. You can access different databases by clicking on the accession code. Explore some links.
  • Find the original protein, transcript variant Dp116 (NP_004005). From which gene does this protein originate? Find the RefSeq accession code for the corresponding mRNA sequence.
  • In the database record for NM_004014, explore the links in the menu on the right. You can do further analysis with this sequence and link out to other databases.
  • "LinkOut" to the Genome browser, explore DMD. You can switch annotation tracks on and off below. Select "Common SNPs" in the "Variation" section to explore the known SNPs.

Exercise 7 Pairwise sequence alignment

Pairwise sequence alignment methods are concerned, in contrast to BLAST, with finding the best-matching piecewise local or global alignments of protein and DNA sequences.

Local alignment (Smith-Waterman algorithm)

Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence (e.g. positions 20-40 of sequence A might align with positions 50-70 of sequence B).This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related.

Global alignment (Needleman-Wunsch algorithm)

A global alignment between two sequences is an alignment in which all of the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences.


The EMBOSS-Align tool contains two programs each using a different algorithm:
  • When you want an alignment that covers the whole length of both sequences, use the needle program
  • When you are trying to find the best region of similarity between two sequences, use the water program

Open EMBOSS or EMBOSS-Explorer

Note that matching amino acids are connected with a "|" symbol. Mismatches would be connected with a space. A gap would be represented with a "-" symbol. Similar amino acids (e.g. threonine vs methionine) are connected via a "." symbol. Thus a sequence alignment can be represented in the format...
 |    ||||  ..      

Exercise 8: GeneCards database for finding information about a gene

In agreement with evolutionary principles, scientific research to date has shown that all genes share common elements. For many genetic elements, it has been possible to construct consensus sequences, those sequences best representing the norm for a given class of organisms (e.g, bacteria, eukaroytes). Common genetic elements include promoters, enhancers, polyadenylation signal sequences and protein binding sites.

A mRNA can be divided into three parts: a 5′ untranslated region (5′ UTR), the polypeptide coding region, sometimes called the open reading frame (ORF), and the 3′ translated region (3′ UTR).

The first codon in a messenger RNA sequence is almost always AUG. While this reduces the number of candidate codons, the reading frame of the sequence must also be taken into consideration.

There are six reading frames possible for a given DNA sequence, three on each strand, that must be considered, unless further information is available. Since genes are transcribed away from their promoters, the definitive location of this element can reduce the number of possible frames to three. The location of the appropriate start codon will include a frame in which there are no apparent abrupt stop codons. Incorrect reading frames usually predict relatively short peptide sequences. Therefore, it might seem deceptively simple to ascertain the correct frame. In bacteria, such is frequently the case. However, eukaryotes add a new obstacle to this process, introns.

Intron/exon splice sites can be predicted on the basis of their common features. Most introns begin with the nucleotides GT and end with the nucleotides AG.


  • Have a look at the organisation of introns/exons on the DMD homepage

GeneCards is a database of human genes, their products and their involvement in diseases and there are several ways to search:
  • Go to the GeneCards website
    • Select disease genes and select the GeneCard for gene DMD.
      • This card shows all information available on the gene, with links to corresponding databases, e.g.
      • Disorders/Diseases:
        • Symbol Name DMD, explore the information
        • Search OMIM for DMD


Exercise 4: Pick primers to screen patients for the HBS mutation

Is the GeneCards annotation of HBB correct?

No, it is longer than the curated "UCSC genes" annotation. When you inspect the human mRNAs and the Ensemble gene predictions you see that there is one mRNA sequence that covers the longer region. This is either a real antisense transcript, a transcript resulting from a mutation, or just an artifact. If you really want to know you can click on the mRNA to get more information about that particular sequence, but it is out of scope of these exercises.

Why do you need to mask repeats?

You need unique primers for sequencing. If you pick primers in a repetitive region your PCR products are poluted with sequences from elsewhere in the genome.

Which repeats are present in the sequence and how much is masked by RepeatMasker?

96 Base pairs are masked. It is a LINE1 repeat.

Does the primerset cover the region of interest (i.e. the region with the mutation)?

No, the primer set is located at the end of the sequence, while our region of interest is at the start of the sequence, i.e. the 6th codon of the HBB gene.

Exercise 6: Similarity search with BLAST

Which sequence in the database is most similar to the one we have submitted? Are the sequences identical?

It is most similar to "PREDICTED: dystrophin isoform X10 [Macaca fascicularis]". The sequences are 99% the same (3 mismatching amino acids). You can see this information in the overview right below the image. The top sequence is most similar. Clicking on the name gives you more details about the sequence alignment. Note that there is a second best hit which is also very close to the query sequence: "PREDICTED: dystrophin isoform X5 [Chinchilla lanigera]".

Find the RefSeq accession code for the corresponding mRNA sequence

You will find the mRNA RefSeq accession code when you go via the accession code in the Blast report (NP_004005.1) to the protein database. It is accession code NM_004014. The RefSeq database contains curated sequences on DNA, RNA and protein level. The accession codes of DNA sequences start with NC_, of RNA sequences with NM_, and the protein sequences start with NP_.

Exercise 7 Pairwise sequence alignment

Which alignment is better?

Only a part of the sequences is conserved. The Needle program (global alignment) tries to align the sequences over their entire length and therefore introduces too many gaps. The Water program (local alignment) only focuses on the region with high similarity and therefore does a better job in finding the optimal alignment.

Exercise 8: GeneCards database for finding information about a gene

Topic revision: r1 - 26 Feb 2015, BarberaVanSchaik

http://wiki.bioinformaticslaboratory.nl/foswiki/bin/view/BioLab/WebHome Search
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback