This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Sequence Formats & Computer Storage of Sequences”.
1. Which of the following is wrong about GenBank DNA Sequence Entry?
a) The information is organized into fields, each with an identifier, shown as the first text on each line
b) In some entries, these identifiers may be abbreviated to two letters, e.g., RF for reference
c) Some identifiers may have additional subfields
d) The CDS subfield in the field FEATURES does not offer the amino acid sequence
Explanation: The CDS subfield in the field FEATURES gives the amino acid sequence obtained by translation of known and potential open reading frames. The format of a database entry in GenBank, the NCBI nucleic acid and protein sequence database, is as follows: Information describing each sequence entry is given, including literature references, information about the function of the sequence, locations of mRNAs and coding regions, and positions of important mutations.
2. A consecutive set of three-letter words that could be codons specifying the amino acid sequence of a protein. The sequence entry is assumed by computer programs to lie between the identifiers “ORIGIN” and “//”.
Explanation: The sequence includes numbers on each line so that sequence positions can be located by eye. Because the sequence count or a sequence checksum value may be used by the computer program to verify the sequence composition, the sequence count should not be modified except by programs that also modify the count. The GenBank sequence format often has to be changed for use with sequence analysis software.
3. In Organization of the GenBank database and the search procedure used by ENTREZ—each row is another sequence entry and each column another GenBank field.
Explanation: When one sequence entry is retrieved, all of these fields will be displayed. Search for the term “SOS regulon and coli” in all fields will find two matching sequences. Finding these sequences is simple because indexes have been made listing all of the sequences that have any given term, one index for each field. Similarly, a search for transcriptional regulator will find three sequences.
4. Which of the following is wrong about European Molecular Biology Laboratory Data Library Format?
a) EMBL maintains DNA and protein sequence databases
b) As with GenBank entries, a large amount of information describing each sequence entry is given
c) Sequence entry includes literature references and information about the function of the sequence, but not locations of mRNAs and coding regions
d) Information is organized into fields, each with an identifier, shown as the first text on each line
Explanation: Sequence entry includes literature references and information about the function of the sequence, locations of mRNAs and coding regions and positions of important mutations. The sequence count or a checksum value for the sequence may be used by computer programs to make sure that the sequence is complete and accurate. For this reason, the sequence part of the entry should usually not be modified except with programs that also modify this count.
5. The format of an entry in the SwissProt protein sequence database is very similar to the EMBL format.
Explanation: The format is quite similar to the EMBL format, except that considerably more information about the physical and biochemical properties of the protein is provided. Also, the output of a DDBJ DNA sequence entry is almost identical to that of GenBank.
6. Which of the following is wrong about FASTA Sequence Format?
a) The FASTA sequence format includes a comment line identified by a “>” character in the first column followed by the name and origin of the sequence
b) The FASTA sequence format includes the sequence in standard one-letter symbols
c) This format provides a very convenient way to copy just the sequence part from one window to another because there are no numbers or other nonsequence characters within the sequence
d) The presence of ‘*’ is not quite essential for reading the sequence correctly by some sequence analysis programs
Explanation: The FASTA sequence format includes an optional ‘*’ which indicates end of sequence and which may or may not be present and its presence maybe essential. The FASTA sequence format is similar to the protein information resource (NBRF) format except that the NBRF format includes a first line with a “>” character in the first column followed by information about the sequence, a second line containing an identification name for the sequence, and the third to last lines containing the sequence.
7. Which of the following is wrong about National Biomedical Research Foundation/Protein Information Resource Sequence Format?
a) Sequences retrieved from the PIR database are not in this compact format, but in an expanded format with much more information about the sequence
b) The NBRF format is similar to the FASTA sequence format but with significant differences
c) This is different than PIR format
d) The first line includes an initial “>” character followed by a two-letter code such as P for complete sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a semicolon, then a four- to six-character unique name for the entry
Explanation: This sequence format, which is sometimes also called the PIR format. It has been used by the National Biomedical Research Foundation/Protein Information Resource (NBRF) and also by other sequence analysis programs.
8. In Stanford University/Intelligenetics Sequence Format– At the end of the sequence, a 1 is placed if the sequence is linear, and a 2 if the sequence is circular.
Explanation: It is started by a molecular genetics group at Stanford University, and subsequently continued by a company, Intelligenetics, the IG format is similar to the PIR format, except that a semicolon is usually placed before the comment line. The identifier on the second line is also present.
9. Which of the following is wrong about Genetics Computer Group Sequence Format?
a) Earlier versions of the Genetics Computer Group (GCG) programs require a unique sequence format and include programs that convert other sequence formats into GCG format
b) Information about the sequence in the GenBank entry is not included but the line information is carried out
c) If one or more sequence characters become changed through error, a program reading the sequence will be able to determine that the change has occurred because the checksum value in the sequence entry will no longer be correct
d) Lines of information are terminated by two periods, which mark the end of information and the start of the sequence on the next line
Explanation: Information about the sequence in the GenBank entry is first included, followed by a line of information about the sequence and a checksum value. This value (not shown) is provided as a check on the accuracy of the sequence by the addition of the ASCII values of the sequence. If the sequence has not been changed, this value should stay the same.
10. Which of the following is wrong about Abstract Syntax Notation Sequence Format?
a) The information is much more difficult to read by eye than a GenBank formatted sequence
b) Abstract Syntax Notation (ASN.1) is a formal data description language that has been developed by the computer industry
c) All the information found in other forms of sequence storage, e.g., the GenBank format, is present. For example, sequences can be retrieved in this format by ENTREZ
d) Taxonomic information and bibliographic information cannot be encoded with this format
Explanation: ASN.1 has been adopted by the National Center for Biotechnology Information (NCBI) to encode data such as sequences, maps, taxonomic information, molecular structures, and bibliographic information. These data sets may then be easily connected and accessed by computers. The ASN.1 sequence format is a highly structured and detailed format especially designed for computer access to the data.
11. Which of the given statements is in correct?
a) Before using a sequence file in a sequence analysis program, it is important to ensure that computer sequence files contain only sequence characters and not special characters used by text editors
b) Computer sequence files might contain special characters used by text editors
c) Editing a sequence file with a word processor can introduce such changes if one is not careful to work only with text or so-called ASCII files
d) Most text editors normally create text files that include control characters in addition to standard ASCII characters
Explanation: As option a and b contradict, option a being right, one should check for special characters. The control characters will only be recognized correctly by the text editor program. Sequence files that contain such control characters may not be analyzed correctly, depending on whether or not the sequence analysis program filters them out. Editors usually provide a way to save files with only standard ASCII characters, and these files will be suitable for most sequence analysis programs.
12. Which of the given statements is in correct about ASCII and Hexadecimal?
a) Computers store sequence information as simple rows of sequence characters called strings, which are similar to the sequences shown on the computer terminal
b) Each character is stored in binary code in the smallest unit of memory, called a byte
c) Each character is stored in binary code in the smallest unit of memory, called a bit
d) By convention, many of these combinations have a specific definition, called their ASCII equivalent
Explanation: Each byte comprises 8 bits, with each bit having a possible value of 0 or 1, producing 255 possible combinations. Some ASCII values are defined as keyboard characters, others as special control characters, such as signaling the end of a line (a line feed and a carriage return), or the end of a file full of text (end-of-file character). A file with only ASCII characters is called an ASCII file.
13. Which of the given statements is untrue?
a) Sequence and other data files that contain non-ASCII characters also may not be transferred correctly from one machine to another and may cause unpredictable behavior of the communications software
b) The ASCII mode is useful for transferring text files, and the binary mode is useful for transferring compressed data files, which also contain non-ASCII characters
c) ASCII and binary modes cannot be set by the user
d) Most sequence analysis programs also require not only that a DNA or protein sequence file be a standard ASCII file, but also that the file be in a particular format such as the FASTA format
Explanation: The file transfer program (FTP) has ASCII and binary modes, which may be set by the user. Some communications software can be set to ignore such control character. The use of windows on a computer has simplified such problems, since one merely has to copy a sequence from one window, for example, a window that is running a Web browser on the ENTREZ Web site, and paste it into another, for example, that of a translation program.
14. According to standard amino acid code letters which of the given pair is not right?
a) K- lysine
b) Y- tyrosine
c) Q- glutamine
d) R- serine
Explanation: In addition to the standard four base symbols, A, T, G, and C, the Nomenclature Committee of the International Union of Biochemistry has established a standard code to represent bases in a nucleic acid sequence that is uncertain or ambiguous. R is represented by arginine.
15. For computer analysis of proteins, it is more convenient to use single-letter than three letter amino acid codes.
Explanation: For example, GenBank DNA sequence entries contain a translated sequence in single-letter code. The standard, single-letter amino acid code was established by a joint international committee.
Sanfoundry Global Education & Learning Series – Bioinformatics.
To practice all areas of Bioinformatics, here is complete set of 1000+ Multiple Choice Questions and Answers.