# Bioinformatics Questions and Answers – Statistical Methods for Aiding Alignment

«
»

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Statistical Methods for Aiding Alignment”.

1. The Expectation Maximization algorithm has been used to identify conserved domains in unaligned proteins only.
a) True
b) False

Explanation: This algorithm has been used to identify both conserved domains in unaligned proteins and protein-binding sites in unaligned DNA sequences (Lawrence and Reilly 1990), including sites that may include gaps (Cardon and Stormo 1992). Given are a set of sequences that are expected to have a common sequence pattern and may not be easily recognizable by eye.

2. Which of the following is untrue regarding Expectation Maximization algorithm?
a) An initial guess is made as to the location and size of the site of interest in each of the sequences, and these parts of the sequence are aligned
b) The alignment provides an estimate of the base or amino acid composition of each column in the site
c) The column-by-column composition of the site already available is used to estimate the probability of finding the site at any position in each of the sequences
d) The row-by-column composition of the site already available is used to estimate the probability

Explanation: The EM algorithm then consists of two steps, which are repeated consecutively. In step 1, the expectation step, the column-by-column composition of the site already available is used to estimate the probability of finding the site at any position in each of the sequences. These probabilities are used in turn to provide new information as to the expected base or amino acid distribution for each column in the site.

3. Out of the two repeated steps in EM algorithm, the step 2 is ________
a) the maximization step
b) the minimization step
c) the optimization step
d) the normalization step

Explanation: In step 2, the maximization step, the new counts of bases or amino acids for each position in the site found in step 1 are substituted for the previous set. Step 1 is then repeated using these new counts. The cycle is repeated until the algorithm converges on a solution and does not change with further cycles. At that time, the best location of the site in each sequence and the best estimate of the residue composition of each column in the site will be available.

4. In EM algorithm, as an example, suppose that there are 10 DNA sequences having very little similarity with each other, each about 100 nucleotides long and thought to contain a binding site near the middle 20 residues, based on biochemical and genetic evidence. the following steps would be used by the EM algorithm to find the most probable location of the binding sites in each of the ______ sequences.
a) 30
b) 10
c) 25
d) 20

Explanation: When examining the EM program MEME, the size and number of binding sites, the location in each sequence, and whether or not the site is present in each sequence do not necessarily have to be known. For the present example, the following steps would be used by the EM algorithm to find the most probable location of the binding sites in each of the 10 sequences.

5. In the initial step of EM algorithm, the 20-residue-long binding motif patterns in each sequence are aligned as an initial guess of the motif.
a) True
b) False

Explanation: The base composition of each column in the aligned patterns is then determined. The composition of the flanking sequence on each side of the site provides the surrounding base or amino acid composition for comparison. Each sequence is assumed to be the same length and to be aligned by the ends.

6. In the intermediate steps of EM algorithm, the number of each base in each column is determined and then converted to fractions.
a) True
b) False

Explanation: For example, that there are four Gs in the first column of the 10 sequences, then the frequency of G in the first column of the site, fSG = 4/10 = 0.4. This procedure is repeated for each base and each column.

7. For the 10-residue DNA sequence example, there are _______ possible starting sites for a 20-residue-long site.
a) 30
b) 21
c) 81
d) 60

Explanation: For the 10-residue DNA sequence example, there are 100 – 20 +1 possible starting sites for a 20-residue-long site. Where the first one is at position 1 in the sequence ending one at 20 and the last beginning at position 81 and ending at 100 (there is not enough sequence for a 20-residue-long site beyond position 81).

8. An alternative method is to produce an odds scoring matrix calculated by dividing each base frequency by the background frequency of that base.
a) True
b) False

Explanation: In this method, the probability of each location is then found by multiplying the odds scores from each column. An even simpler method is to use log odds scores in the matrix. The column scores are then simply added. In this case, the log odds scores must be converted to odds scores before position probabilities are calculated.

9. Which of the following about MEME is untrue?
a) It is a Web resource for performing local MSAs (Multiple Sequence Alignment) by the above expectation maximization method is the program MEME
b) It stands for Multiple EM for Motif Elicitation
c) It was developed at developed at the University of California at San Diego Supercomputing Center
d) The Web page has multiple versions for searching blocks by an EM algorithm

Explanation: The Web page for two versions of MEME, ParaMEME, a Web program that searches for blocks by an EM algorithm (Described below), and a similar program MetaMEME (which searches for profiles using HMMs, described below).The Motif Alignment and Search Tool (MAST) for searching through databases for matches to motifs.

10. Which of the following about the Gibbs sampler is untrue?
a) It is a statistical method for finding motifs in sequences
b) It is dissimilar to the principle of the EM method
c) It searches for the statistically most probable motifs
d) It can find the optimal width and number of given motifs in each sequence 