This set of Bioinformatics Questions and Answers for Freshers focuses on “Motif and Domain Databases Using Statistical Models”.
1. Which of the following is not an advantage of Statistical models’ methods in analyzing protein motifs?
a) Sequence information is preserved from a multiple sequence alignment and expresses it with probabilistic models
b) Statistical models allow partial matches and compensate for unobserved sequence patterns using pseudo-counts
c) Statistical models have stronger predictive power than the regular expression based approach, even when they are derived from a limited set of sequences
d) The comparative flexibility is less in case of these methods when compared to regular expressions methods
Explanation: The major limitation of regular expressions is that this method does not take into account sequence probability information about the multiple alignment from which it is modeled making them less flexible. If a regular expression is derived from an incomplete sequence set, it has less predictive power because many more sequences with the same type of motifs are not represented. Unlike regular expressions, position-specific scoring matrices (PSSMs), profiles, and HMMs preserve the sequence information from a multiple sequence alignment and express it with probabilistic models.
2. For motif scanning which of the following programs or databases is for regulated sites curated from scientific literature?
Explanation: Clover identifies overrepresented motifs in protein sequences whereas; MAST allows users to scan different databases for matches to motifs. ENSEMBL is another online genomic sequence repository which also includes online tools for data mining as well as BLAST searches.
3. Which of the following is not an advantageous feature or algorithm of the database PRINTS?
a) This program breaks down a motif into even smaller non-overlapping units called ‘fingerprints’, which are represented by unweighted PSSMs
b) To define a motif, at least a majority of fingerprints are required to match with a query sequence
c) A query that has simultaneous high-scoring matches to a majority of fingerprints belonging to a motif is a good indication of containing the functional motif
d) The difficulty to recognize short motifs when they reach the size of single fingerprints
Explanation: PRINTS is a protein fingerprint database containing ungapped, manually curated alignments corresponding to the most conserved regions among related sequences. The drawbacks of PRINTS are–the difficulty to recognize short motifs when they reach the size of single fingerprints and a relatively small database, which restricts detection of many motifs.
4. In which of the following multipurpose packages Gibbs sampling algorithm is used?
Explanation: The Gibbs sampling algorithm can identify multiple motifs in a sequence in a sequence set using iterative masking procedure. It is used in AlignACE whereas BEST is a suite of four motif discovery tools integrated in a graphical user interface. Also, Consensus program finds motifs in a set of unaligned sequences and PhyloCon builds on this framework by modeling conservation across orthologous genes from multiple species.
5. Which of the following is untrue in case of the database BLOCKS?
a) The alignments are automatically generated using the same data sets used for deriving the BLOSUM matrices
b) The derived ungapped alignments are called ‘blocks’, which are usually longer than motifs, are subsequently converted to PSSMs
c) A weighting scheme and pseudo counts are subsequently applied to the PSSMs to account for underrepresented and unobserved residues in alignments
d) The functional annotation of blocks is not consistent with that for the motifs
Explanation: BLOCKS is a database that uses multiple alignments derived from the most conserved, ungapped regions of homologous protein sequences. Because blocks often encompass motifs, the functional annotation of blocks is thus consistent with that for the motifs. A query sequence can be used to align with pre-computed profiles in the database to select the highest scored matches. Because of the use of the weighting scheme, the signal-to-noise ratio is improved relative to PRINTS.
6. Which of the following is false in case of the database Pfam and its algorithm?
a) Each motif or domain is represented by an HMM profile generated from the seed alignment of a number of conserved homologous proteins
b) Since the probability scoring mechanism is more complex in HMM than in a profile-based approach the use of HMM yields further increases in sensitivity of the database matches
c) Pfam-B only contains sequence families not covered in Pfam
d) The functional annotation of motifs in Pfam-A is often related to that in UNIPROT
Explanation: Pfam is a database with protein domain alignments derived from sequences in SWISSPROT and TrEMBL. The Pfam database is composed of two parts, Pfam-A and Pfam-B. Pfam-A involves manual alignments and Pfam-B, automatic alignment in a way similar to ProDom. The functional annotation of motifs in Pfam-A is often related to that in PROSITE. Because of the automatic nature, Pfam-B has a much larger coverage but is also more error prone because some HMMs are generated from unrelated sequences.
7. Which of the following is false in case of the database SMART and its algorithm?
a) Contains HMM profiles constructed from manually refined protein domain alignments
b) Alignments in the database are built based on tertiary structures whenever available or based on PSI-BLAST profiles
c) Alignments are further checked but not refined by human annotators before HMM profile construction
d) SMART stands for Simple Modular Architecture Research Tool
Explanation: Alignments are further checked and refined by human annotators before HMM profile construction. Protein functions are also manually curated. Thus, the database may be of better quality than Pfam with more extensive functional annotations. Compared to Pfam,
The SMART database contains an independent collection of HMMs, with emphasis on signaling, extracellular, and chromatin-associated motifs and domains. Sequence searching in this database produces a graphical output of domains with well-annotated information with respect to cellular localization, functional sites, super-family, and tertiary structure.
8. Which of the following is false in case of the database InterPro and its algorithm?
a) InterPro is an integrated pattern database designed to unify multiple databases for protein domains and functional sites
b) This database integrates information from PROSITE, Pfam, PRINTS, ProDom, and SMART databases
c) Only overlapping motifs and domains in a protein sequence derived by all five databases are included
d) All the motifs and domains in a protein sequence derived by all five databases are included
Explanation: The only overlapping motifs and domains in a protein sequence derived by all five databases are included in the database. The InterPro entries use a combination of regular expressions, fingerprints, profiles, and HMMs in pattern matching. However, an InterPro search does not obviate the need to search other databases because of its unique criteria of motif inclusion and thus may have lower sensitivity than exhaustive searches in individual databases. A popular feature of this database is a graphical output that summarizes motif matches and has links to more detailed information.
9. Which of the following is false in case of the CDART and its algorithm?
a) CDART is a domain search program that combines the results from RPS-BLAST, SMART, and Pfam
b) The program is now an integral part of the regular BLAST search function
c) CDART is substitute for individual database searches
d) It stands for Conserved Domain Architecture
Explanation: CDART is a domain search program that combines the results from various database searches. As with InterPro, CDART is not a substitute for individual database searches because it often misses certain features that can be found in SMART and Pfam.
10. Point out the wrong or irrelevant mathematical method in motif analysis.
b) Probabilistic Optimization
c) Deterministic Optimization
d) Literature mining
Explanation: All the rest of the options are indeed valid and proven mathematical methods that contain efficient algorithms in finding motifs in protein sequences. Literature mining is not a mathematical algorithm or tool as such to be used in identifying motifs. But it is definitely a part of research when it comes to find a function of various protein sequences.
Sanfoundry Global Education & Learning Series – Bioinformatics.
To practice all areas of Bioinformatics for Freshers, here is complete set of 1000+ Multiple Choice Questions and Answers.