Comparative evolutionary study: is amino acid or nucleotide comparison more useful?

I am a high school student and am currently learning about evolutionary relationship study in biology.

My teacher said that a comparative study of amino acid sequences is more useful than a comparative study of nucleotide sequences, because the genetic code is degenerate in nature - several codons may code for the same amino acid.

However, I just do not understand the logic.

Since several codons may code for the same amino acid, I (as a math person) consider the conversion of a nucleotide sequence to an amino acid sequence as a non-injective function, and thus is information-losing.

(Analogy: consider the function $f(x)=x^2$. Imagine that you have a number, and you plug it into $f(x)$ to get $1$ as the output. You would never know if the original number is $1$ or $-1$.)

Therefore I arrive at the exact opposite conclusion. Is my conclusion correct or not, and why?

Each has its own utility, depending on the time-frame you are looking at. For evolutionary studies, you need variation, but not so much variation that one substitution at the same position overwrites a previous substitution. So if you are looking at deep splits, over hundreds of millions of years, it may be that amino acids are more reliable. But you are correct in that since they are functionally more important than silent substitutions (nucleotide changes that don't change the amino acid), it is possible to have amino acids converge on the same state, independently. Nucleotides do this too. Both require statistical models (maximum likelihood) that accommodate the possibility of multiple changes at the same site. If you are looking at recent evolutionary splits, there may not be enough (or any) amino acid changes to compare, so in this case, nucleotides would be better. You would not measure continental drift with a stopwatch, or a 100-meter dash with radiometric dating.

The Answer

It is correct that the product of the conceptual translation of a nucleotide sequence into an amino acid sequence results in the loss of certain information present in the former. An obvious example is that the amino acid sequences of the same protein in two individuals may be identical, but there may be silent mutations in the DNA, and these can be useful in tracing ancestry. The one part of Crick's Central Dogma about which there can be no argument is that you cannot go from protein to DNA because the information for the nucleotide sequence is not present in the protein, with or without the genetic code.


An amino acid sequence contains information that is not present in the gene from which it originates, if we just consider the sequence as mathematical sequence of symbols. And, with 20 letters instead of 4, this new information has a different (and greater) complexity. The mistake is an unspoken assumption that the information of the genetic code is inherent in the nucleotide sequence. It is not. Yes, if we have the information of the genetic code, then the nucleotide sequence also has the information of the amino acid sequence, but that is not the practical question at issue.

So (addressing the poster) in the majority of practical instances your school teacher is correct. I am not a mathematician, so I cannot be sure what is the flaw of your argument. Perhaps it is the fact that only a subsection of the information can be used in the sequence comparison, perhaps the fact that you are talking three symbols from a set of four in your non-injective function to produce one symbol from a set of 20, or perhaps it is the biology. That's something for you to work out. But if your conclusions are wrong (which they are) there must be a flaw in your logic)

The Question at Issue

The practical question at issue is:

Which is more suitable for determining the evolutionary relatedness of two organisms - a pairwise comparison of the amino acid sequences of a functionally similar protein (e.g. cytochrome c) or of the nucleotide sequences of the corresponding gene?

The general answer is:

It depends on the relatedness of the organisms, but, except for very close kinship (e.g. humans and neanderthals) or certain specialized problems, the answer is likely to be the amino acid sequences.

How can this be?

In relationship to the evolutionary distance between organisms, it is necessary to consider the different rates at which nucleotides and amino acids mutate, and the constraints on what mutations are likely to occur. If the rate of mutation is too fast there will be a time difference after which it will be difficult or impossible to calculate their evolutionary divergence accurately and ultimately even to detect any relationship between them.

Nucleotides mutate more rapidly than amino acids, and in practice comparison of nucleotide sequences is less useful than comparison of amino acid sequences for longer timespans.

  1. Because of the degeneracy of the genetic code (the fact that an amino acid can be encoded by more than one triplet of nucleotides) it is possible for one or even two nucleotides to mutate without affecting the amino acid sequence. (And the similarity between sequences is computed from a letter-by-letter comparison.)

  2. Statistics is not my forté, but in a general sense, because there are only four bases, 25% identity between two nucleotide sequences would be expected to occur by chance, whereas two amino acid sequences that are 25% identical would be statistically significantly similar because there are 20 amino acids. (Only 5% identity would arise by chance.)

There is a further aspect of divergence of amino acid sequence that is useful for evolutionary comparison, and this is that the nature of the mutation of amino acids is far more constrained than that of nucleotides. Admittedly purine-to-purine or pyrimidine-to-pyrimidine mutations are more frequent than purine/pyrimidine mutations, but amino acid mutations are often constrained by the role the amino acid plays in a protein. However one can construct empirical matrices of the likelihood of different amino acid mutations to obtain a more subtle and accurate estimate of relatedness.

What this means in practice is that instead of having to use a scoring system for comparisons of amino acid sequences that is either 1 for identity or 0 for non-identity, one can use a scoring system that gives 'half marks' (as it were) for structural/functional similarity. Thus, two amino acid sequences having 5% identity in pairwise comparison could be shown to be related because of an overall higher 'similarity' score.

Appendix 1: Sequence Comparison

It is important to realize that however much information resides in nucleotide or amino acid sequences, only the information that is actually used in the practical methods of determining evolutionary differences is relevant. These methods involve computer programs that compare sequences according to mathematical algorithms to answer the question of how similar two (or more) sequences are. So, regardless of the fact that the amino acid sequence is generally computed from the gene sequence, the question is “should I imput nucleotide or amino acid sequences into the program to get the best comparison?”. It is in this context that the remarks above about rate of change and likelihood of interconversions should be taken.

To quote from an article by one of the pioneers in sequence comparison, W. R. Pearson:

“Protein (and translated-DNA) similarity searches are much more sensitive than DNA:DNA searches. DNA:DNA alignments have between 5-10-fold shorter evolutionary look-back time than protein:protein or translated DNA:protein alignments. DNA:DNA alignments rarely detect homology after more than 200-400 million years of divergence; protein:protein alignments routinely detect homology in sequences that last shared a common ancestor more than 2.5 billion years ago (e.g. humans to bacteria). Moreover, DNA:DNA alignment statistics are less accurate than protein:protein statistics; while protein:protein alignments with expectation values < 0.001 can reliably be used to infer homology, DNA:DNA expecation values < 10−6 often occur by chance, and 10−10 is a more widely accepted threshold for homology based on DNA:DNA searches.”

There are Wikipedia articles about sequence alignment, and about the use of BLOSUM and PAM matrices. The section on sequence alignment in Berg et al. online - which involves amino acid, rather than nucleotide sequences - may also be of interest.

Appendix 2: Terminology and Definitions

As the term, Genetic Code, was misused in the unedited version of the question - and is widely misused in the press - I thought that a glossary of terms might be helpful

DNA (from which the genome and its constituent genes are constructed) are linear polymers of 4 nucleotides. The order of these is called the nucleotide sequence, or, because the only the purine or pyrimidine base varies between nucleotides, the base sequence.

Proteins are linear polymers of 20* amino acids. The order of these is called the amino acid sequence.

The Genetic Code is a cipher - and can be represented as a table showing the correspondence between 64 triplets of three nucleotides and the 20 amino acids and three stop signals when these nucleotides are part of the translatable part of a gene. The genetic code is highly - but not absolutely - conserved between organisms (and differs for proteins encoded by mitochondrial DNA).

In NO circumstances can the word Genetic Code be used as a synonym of Genome, although this is abused by even the scientific press, and is difficult for computer programmers to come to terms with, working as they do in a field where the noun 'code' is used for the product of encoding instructions.

*The genetic code has a certain plasticity and two additional amino acids can be encoded by termination codons in certain circumstances.

Comparative feeding ecology of abyssal and hadal fishes through stomach content and amino acid isotope analysis

Describes trophic ecology of two hadal fishes from the Mariana and Kermadec trenches (Liparidae).

Predatory fishes may have advantage in trenches due to increased biomass of small crustaceans.

Lower δ 15 N values of source amino acids in abyssal macrourids show upper ocean-derived food web.

Clarifies role of trophic ecology in fish community structure at the abyssal-hadal boundary.

Comparative evolutionary study: is amino acid or nucleotide comparison more useful? - Biology

Amino acids are special organic molecules used by living organisms to make proteins. The main elements in amino acids are carbon, hydrogen, oxygen, and nitrogen. There are twenty different kinds of amino acids that combine to make proteins in our bodies. Our bodies can actually make some amino acids, but the rest we must get from our food.

Proteins are long chains of amino acids. There are thousands of different proteins in the human body. They provide all sorts of functions to help us survive.

Why are they important?

Proteins are essential for life. Around 20% of our body is made up of proteins. Every cell in our body uses proteins to perform functions.

Proteins are made inside cells. When a cell makes a protein it is called protein synthesis. The instructions for how to make a protein are held in DNA molecules inside the cell nucleus. The two major stages in making a protein are called transcription and translation.

The first step in making a protein is called transcription. This is when the cell makes a copy (or "transcript") of the DNA. The copy of DNA is called RNA because it uses a different type of nucleic acid called ribonucleic acid. The RNA is used in the next step, which is called translation.

The next step in making a protein is called translation. This is when the RNA is converted (or "translated") into a sequence of amino acids that makes up the protein.

  • The RNA moves to the ribosome. This type of RNA is called the "messenger" RNA. It is abbreviated as mRNA where the "m" is for messenger.
  • The mRNA attaches itself to the ribosome.
  • The ribosome figures out where to start on the mRNA by finding a special three letter "begin" sequence called a codon.
  • The ribosome then moves down the strand of mRNA. Every three letters represents another amino acid molecule. The ribosome builds a string of amino acids based on the codes in the mRNA.
  • When the ribosome sees the "stop" code, it ends the translation and the protein is complete.


Tissue-specific genes in cultivated peanut

A total of 3,191 tissue-specific genes were identified from 22 RNA-seq datasets for cultivated peanut (Table S1). The largest number of tissue-specific genes were expressed in gynoecium tissue, while the fewest tissue-specific genes were expressed in seedling leaf tissue (Fig. 1). The descending order of tissues ranked by number of tissue-specific genes expressed in them was gynoecium, root, nodule, Pattee 5 seed, reproductive shoot, Pattee 6 seed, main stem leaf, later leaf, Pattee 8 seed, perianth, stalk, Pattee 7 seed, Pattee 3 pod, aerial gynophores, Pattee 5 pericarp, vegetative shoot, androecium, subterranean gynophore, Pattee 6 pericarp, Pattee 1 pod, Pattee 10 seed, and seedling leaf tissue (Fig. 1). RNA-seq data for the leaf, shoot, gynophore, pod, pericarp, and seed can be classified into three, two, two, three, two, and five developmental stages, respectively (Fig. 1 and Table S1). If different developmental stages were considered as individual tissues, we could obtain nine leaf-specific, seventeen shoot-specific, two gynophore-specific, four pod-specific, three pericarp-specific, and twenty-five seed-specific genes, respectively (Table S2). In this study, we used different developmental stages of tissues as the level of analysis because genes can vary in spatial and temporal expression. Sex-specific genes can be expressed at a particular developmental stage without being expressed in later stages 1 . (These genes are available as supplemental material in Table S2 and may be helpful for research on spatial and temporal gene expression patterns in cultivated peanut).

The number of tissue-specific genes in cultivated peanut.

In contrast, we found 38,745 genes expressed simultaneously among 22 tissues, considered common genes hereafter. The cultivated peanut has about 78,574 coding sequences (CDSs) based on the number of genes of its two ancestors, Arachis duranensis (36,734 genes) and Arachis ipaënsis (41,840 genes) 11 . Therefore, tissue-specific genes account for 4.06% of the total number of genes (3,191 out of 78,574), and commonly expressed genes account for 49.31% of the total number of genes (38,745 out of 78,574) in cultivated peanut. Further, 1,357 tissue-specific genes and 18,627 common genes were derived from A. duranensis, accounting for 1.73% (1,357 out of 78,574) and 23.71% (18,627 out of 78,574) of cultivated peanut genes. Similarly, 1,834 tissue-specific genes and 20,117 common genes were derived from A. ipaënsis, accounting for 2.32% (1,834 out of 78,574) and 25.60% (20,117 out of 78,574) of cultivated peanut genes. The tissue-specific and common genes from A. ipaënsis outnumbered those from A. duranensis. This is consistent with more gene duplication events having occurred in A. ipaënsis than in A. duranensis 11 . Tissue-specific genes were further classified into sex-specific and somatic tissue-specific genes. In this study, sex-specific genes were expressed specifically in gynoecium and androecium tissues, while somatic tissue-specific genes were expressed specifically in one of the other 20 tissues. Sex-specific and somatic tissue-specific genes accounted for 0.66% (522 out of 78,574) and 3.40% (2,669 out of 78,574) of cultivated peanut genes. The sex-specific and somatic-specific genes from A. duranensis accounted for 0.28% (218 out of 78,574) and 1.45% (1,139 out of 78,574) of cultivated peanut genes, respectively. The sex-specific and somatic-specific genes from A. ipaënsis accounted for 0.39% (304 out of 78,574) and 1.95% (1530 out of 78,574) of cultivated peanut genes, respectively.

Gene expression levels of tissue-specific genes were significantly lower than those of common genes (Mann–Whitney U test, P < 0.01). The gene expression levels differed significantly among the 22 tissues (Kruskal–Wallis test, Chi-square = 486.63, P < 0.05). It should also be noted that the sex-specific gene expression levels were significantly higher than those of somatic tissue-specific genes (Mann–Whitney U test, P < 0.01). Gynoecium-specific gene expression levels were higher than those of androecium-specific genes (Mann–Whitney U test, P < 0.01). Tissue-specific genes also overlapped among annotations for different tissues (Fig. S1). These analyses revealed a lack of function-specific genes among tissue-specific genes. Further, gene ontology (GO) analyses revealed that although one tissue may exhibit gene expression across different developmental stages for genes involved in different biological processes, the same biological processes may be shared among different tissues (Fig. S1). We found the most common GO categories to be 0008270 (zinc ion binding), 0006355 (regulation of transcription), 0016021 (transmembrane), 0003676 (nucleic acid binding), 0005524 (ATP binding), 0055114 (oxidation-reduction process), 0005515 (protein binding), and 006508 (proteolysis Fig. 2). The detailed GO annotation listed in Table S3.

Identification of GO items in tissue-specific genes. The detailed GO annotation listed in Table S3.

Evolutionary divergence between tissue-specific duplicated genes

A total of 274 full-length duplicated gene pairs were detected from the cultivated peanut RNA-seq data. K a, K s, and K a/K s values were calculated between 232 duplicated gene pairs, and 42 duplicated gene pairs were removed because their K s values were less than 0.01 or larger than 0.30. The average values of K a, K s, and K a/K s were 0.08, 0.21, and 0.56, respectively. Purifying selection predominated the molecular evolution of 207 duplicate gene pairs with K a/K s values less than 1. In contrast, positive selection played a crucial force in 25 duplicate gene pairs with K a/K s values larger than 1. It should be noted that these duplicate gene pairs possibly underwent adaptive evolution as suggested by their higher average K a/K s values. Similarly, adaptive evolution was detected among sex-biased genes in Ectocarpus spp. because their corresponding average K a/K s value exceeded 0.5 6, 8 .

Among duplicate gene pairs that underwent purifying selection, 167 and 40 were heterogeneous gene pairs and homogeneous gene pairs, respectively. Among duplicate gene pairs predominantly shaped by positive selection, 16 and 9 were heterogeneous gene pairs and homogeneous gene pairs, respectively. The average K a and K s values for homogeneous gene pairs were lower than those for heterogeneous gene pairs (Mann–Whitney U test, P < 0.05), indicating the heterogeneous gene pairs evolved more rapidly than homogeneous gene pairs. The average K a/K s values of homogeneous gene pairs were larger than those of heterogeneous gene pairs, but this difference was not statistically significant (Mann-Whitney U test, P > 0.05). Further, 176 and 19 duplicate gene pairs consisted of somatic tissue-specific genes and sex-specific genes, respectively, while 37 duplicate gene pairs consisted of one somatic tissue-specific gene and one sex-specific gene (somatic-sex-specific gene pair). The K a and K s values of somatic-sex-specific duplicate genes exceeded those of both somatic tissue-specific genes and sex-specific genes (Fig. 3). This again indicates that heterogeneous gene pairs appear to evolve more rapidly than homogeneous gene pairs. In addition, the average K s value was similar between sex-specific and somatic tissue-specific genes, but the average K a value of somatic tissue-specific genes exceeded that of sex-specific genes (Fig. 3). The synonymous substitution rate was similar between sex-specific and somatic tissue-specific genes, while the nonsynonymous evolutionary rate of somatic tissue-specific genes was more rapid than that of sex-specific genes. However, the average K a/K s value of somatic tissue-specific genes and somatic-sex-specific genes exceeded that of sex-specific genes, but this difference was not statistically significant (Fig. 3 Mann–Whitney U test, P > 0.05). Nevertheless, the average K a/K s values of somatic tissue-specific genes, somatic-sex-specific genes, and sex-specific genes were 0.59, 0.50, and 0.46, respectively. Overall, somatic tissue-specific genes and somatic-sex-specific genes mainly underwent relaxed selection, while sex-specific genes experienced stronger selective constraint.

Comparison of K s, K a, and K a/K s of duplicated gene pairs in sex-specific, somatic-specific, and somatic-sex genes.

Codon usage bias in tissue-specific genes

After filtering criteria were applied, a total of 2,756 sequences were used to analyze codon usage bias. Although frequency of optimal codons (Fop) was not significantly different among different tissue types (Kruskal–Wallis test, Chi-square = 22.68, P > 0.05), the Fop value of somatic tissue-specific genes was significantly higher than that of sex-specific genes (Mann–Whitney U test, P < 0.05). Moreover, Fop values of gynoecium-specific genes were slightly but not significantly higher than those of androecium-specific genes (Mann–Whitney U test, P > 0.05). These results indicated that codon usage bias in somatic tissue-specific genes was higher than that in sex-specific genes. In addition, amino acid sequence length was significantly different across the various tissues (Kruskal–Wallis test, Chi-square = 36.62, P < 0.05). The amino acid sequences of sex-specific genes were longer than those of somatic tissue-specific genes (Mann–Whitney U test, P < 0.05), while the amino acid sequences of gynoecium-specific genes were non-significantly longer than those of androecium-specific genes (Mann–Whitney U test, P > 0.05).

In-silico Characterization and Comparative Analysis of BLB Disease Resistance Xa genes in Oryza sativa

How to cite: Ramzan, M.A. Asghar, H. Rehman, A. Rashid, M. Jankuloski, L. In-silico Characterization and Comparative Analysis of BLB Disease Resistance Xa genes in Oryza sativa. Preprints 2020, 2020100472 (doi: 10.20944/preprints202010.0472.v1). Ramzan, M.A. Asghar, H. Rehman, A. Rashid, M. Jankuloski, L. In-silico Characterization and Comparative Analysis of BLB Disease Resistance Xa genes in Oryza sativa. Preprints 2020, 2020100472 (doi: 10.20944/preprints202010.0472.v1). Copy

Cite as:

Ramzan, M.A. Asghar, H. Rehman, A. Rashid, M. Jankuloski, L. In-silico Characterization and Comparative Analysis of BLB Disease Resistance Xa genes in Oryza sativa. Preprints 2020, 2020100472 (doi: 10.20944/preprints202010.0472.v1). Ramzan, M.A. Asghar, H. Rehman, A. Rashid, M. Jankuloski, L. In-silico Characterization and Comparative Analysis of BLB Disease Resistance Xa genes in Oryza sativa. Preprints 2020, 2020100472 (doi: 10.20944/preprints202010.0472.v1). Copy

Supporting Information

Supplementary File S1.

Supplemental Information for the Maximum-likelihood Analyses

Supplementary File S2.

Supplemental Information for the Bayesian Analyses

Supplementary File S3.

Zip file containing all multiple sequence alignments and phylogenetic trees used in this study

Supplementary File S4.

Zip file containing Consurf scores for both the IR and the IGF1R ectodomains

Supplementary File S5.

Zip file containing Evolutionary Trace results for the IR

Supplementary File S6.

Zip file containing PDB structures obtained through normal modes calculations.

Associated Data

Noroviruses are the causative agents of the majority of viral gastroenteritis outbreaks in humans. During the past 15 years, noroviruses of genotype GGII.4 have caused four epidemic seasons of viral gastroenteritis, during which four novel variants (termed epidemic variants) emerged and displaced the resident viruses. In order to understand the mechanisms and biological advantages of these epidemic variants, we studied the genetic changes in the capsid proteins of GGII.4 strains over this period. A representative sample was drawn from 574 GGII.4 outbreak strains collected over 15 years of systematic surveillance in The Netherlands, and capsid genes were sequenced for a total of 26 strains. The three-dimensional structure was predicted by homology modeling, using the Norwalk virus (Hu/NoV/GGI.1/Norwalk/1968/US) capsid as a reference. The highly significant preferential accumulation and fixation of mutations (nucleotide and amino acid) in the protruding part of the capsid protein provided strong evidence for the occurrence of genetic drift and selection. Although subsequent new epidemic variants differed by up to 25 amino acid mutations, consistent changes were observed in only five positions. Phylogenetic analyses showed that each variant descended from its chronologic predecessor, with the exception of the 2006b variant, which is more closely related to the 2002 variant than to the 2004 variant. The consistent association between the observed genetic findings and changes in epidemiology leads to the conclusion that population immunity plays a role in the epochal evolution of GGII.4 norovirus strains.

Since the beginning of viral gastroenteritis outbreak surveillance in the early 1990s, noroviruses have become recognized as the major cause of reported outbreaks of acute viral gastroenteritis worldwide. Noroviruses form a genus within the family Caliciviridae and are genetically and antigenically highly variable. Currently, five distinct genogroups (GGs) are recognized. Strains belonging to GGI, GGII, and GGIV are known to cause infections in humans. The GGs have been subdivided further into genotypes, defined by a minimum amino acid sequence identity over the complete capsid sequence of 80% (1).

The strains most commonly identified as the cause of outbreaks belong to genotype GGII.4. In The Netherlands, this was the case for 68% of all norovirus outbreaks that were characterized during 12 years of surveillance and for up to 81% of all health care-related outbreaks. Since their first detection in The Netherlands in January 1995, the GGII.4 strains have consistently been present in the Dutch population (46). These observations are in agreement with those of other surveillance studies worldwide (3, 4, 15, 17, 29, 36, 55).

During the past 15 years, four epidemic norovirus seasons have occurred, in the winters of 1995-1996, 2002-2003, 2004-2005, and 2006-2007. These worldwide epidemics were invariantly caused by the predominant genotype, GGII.4, and were attributed to the emergence of new variant lineages of this genotype (4, 31, 35, 52, 53). These genetic variants, which have been identified previously by partial sequencing of either the RNA-dependent RNA polymerase (RdRp) or the capsid gene, have been given several names across the world. Here they are referred to by using the first year of their detection, supplemented where necessary with an extra suffix. The following variants have been identified: �, 1996, 2002, 2004, 2006a, and 2006b.

The pattern of emergence of new lineages followed by large-scale epidemics suggests that new variants obtained one or more decisive advantages over the previously circulating predominant variant. It is unknown what the nature of this advantage is, but its basis is likely to be found in VP1, since this protein is needed for essential properties and functions in the viral life cycle, such as antigenicity, host specificity, host cell binding and virus entry properties, and assembly of new particles.

Noroviruses have a positive-strand RNA genome of 𢏇.6 kb, which is subdivided into three open reading frames (ORFs). ORF1 encodes a polyprotein which is posttranslationally processed into the nonstructural proteins, including the RdRp. Conserved regions within the RdRp are commonly used as targets for diagnostic PCR assays. At the National Institute for Public Health and the Environment in The Netherlands (RIVM), region A (nucleotides 4279 to 4604 Lordsdale genome numbering [GenBank accession no. <"type":"entrez-nucleotide","attrs":<"text":"X86557","term_id":"1008952","term_text":"X86557">> X86557]) is commonly used for genotyping outbreak strains. The second ORF (ORF2) encodes the major structural protein VP1. Ninety dimers of this capsid protein form a Tϓ icosahedral shell (41). In the virion, a small number of copies of the protein encoded by ORF3 are present. The precise role of this protein is not clear, although it has been suggested that it functions both in upregulation of VP1 expression and as a histone-like protein in stabilizing the capsid-RNA complex (2, 19, 22).

The understanding of immunity against noroviruses remains limited. Between the different GGs and genotypes, antigenic differences as well as cross-reactivities have been demonstrated using virus-like particles and polyclonal antisera (20). Short-term immunity was reported, but preexisting antibodies were not protective against reinfection with the same genotype (25, 39, 56). Studies looking at neutralizing antibodies have not been possible due to the lack of cell culture or small-animal model systems (13). The high level of genetic diversity between different GGs and even between genotypes within the same GG resulting from the high mutation rate and from recombination events contributes to a large degree of antigenic diversity.

Host genetic factors determining the presence or absence of virus receptors also play an important role in susceptibility (21, 23). These receptors, the histo-blood group antigens, show virus strain-specific binding patterns, determining the ability of virus to infect potential host cells. Because noroviruses belonging to GGII.4 have the broadest range of binding to the histo-blood group antigens of all genotypes assayed to date, this may explain part of the relative success of these viruses (24). Other success factors may include a higher stability of the viral particles outside the host, a higher replication rate, or other factors that need to be investigated more thoroughly.

To obtain more insight into the genetic and structural bases of the selective advantage of new GGII.4 variants over the old GGII.4 variants, we determined the complete capsid sequences of a systematic sample of GGII.4 norovirus outbreak strains found in The Netherlands during 13 years of surveillance of viral gastroenteritis and studied their genetic diversity and predicted structure (46). Because a high-resolution three-dimensional (3D) model of GGII noroviruses was lacking at the time this study was initiated, a homology model of the capsid protein was made in silico based on the known 3D structure of the Norwalk virus (NV GGI.1) capsid protein.

Materials and Methods

Data Sources

Experimentally Characterized Proteins.

The disordered protein sequences were taken from a curated database of experimentally determined disordered proteins, DisProt 3.6 ( Vucetic et al. 2005). There were 287 disordered sequences with a total of 40,770 residues. Each disordered sequence was ≥30 residues in length. The disordered sequences had a mean length of 142 residues and a median of 86 residues. The longest disordered sequence was of 2,174 residues. The ordered protein sequences were taken from PDB Select 25, a nonredundant subset of the Protein Data Bank (PDB). This data set was chosen because all proteins share ≤25% sequence identity ( Boberg et al. 1992 Berman et al. 2000). The sequences were selected from structures that were determined by X-ray crystallography and had strong indications of order, with a resolution ≤2Å, an R factor ≤20%, and no missing backbone or side chain atoms ( Smith et al. 2003). The proteins in this data set are ≥80 residues in length and contained no nonstandard residues. There were 289 ordered sequences with a total of 67,548 residues. The ordered sequences had a mean length of 289 residues and a median of 193 residues. The longest ordered sequence was 907 residues. The proteins are listed in supplementary table S1 ( Supplementary Material online).

Families of Related Sequences.

Putative homologs of the experimentally characterized disordered and ordered proteins were identified by performing a basic alignment search tool (BLAST) search with each ordered and disordered sequence against GenBank release 159 ( Altschul et al. 1997 Benson et al. 2008). To ensure quality matches, the maximum allowed e value was 0.0001, and the minimum match length was at least 35% of the length of the query sequence. Match sequences were cropped to the region corresponding to the start and end of the query. Sequences identified as hypothetical, patented, or predicted were removed from the alignments. Only one sequence in a group of sequences with 100% identity was retained so that all sequences in a family were unique.

During this analysis, it was determined that families of proteins from the Human Immunodeficiency Virus, and some other viruses, contained large numbers of similar sequences having a disproportionate effect on the results. Many papers submitting sequences of these viruses obtained them from an individual organism (see for instance [ Huet et al. 1989 Herring et al. 2001]). In order to reduce any undue influence from these families, only one randomly chosen sequence from each referenced paper was included. Unreferenced sequences were not included. The sequences whose families were culled in this way included DP00048, DP00148, DP00160, and DP00424 for the disordered set and 1mml, 1idaa, and 1svb for the ordered set.

Procedure for Developing Matrices

To demonstrate different levels of evolutionary divergence, substitution matrices were developed for three percent identity levels, defined as 85% to <100%, 60–85%, and 40–60% identity ( table 1). The number of gaps of any length in the alignments was minimized to reduce ambiguity while still maintaining enough data for meaningful comparisons. This was achieved by specifying no gaps for matrices with 85% minimum percent identity and no more than four gaps for the 60% and 40% matrices. The maximum number of gaps was set to 4 because it was the lowest number that included the majority of alignments in the 60% and 40% percent identity levels.

Criteria Used to Develop Matrices.

Matrix Label (D/O) Minimum % Identity Maximum % Identity Maximum No. of Gaps Starting Matrix No. of Realignments (D/O)
D85/O85 85 <100 0 BLOSUM62 3/3
D60/O60 60 85 4 First 85%, zero gaps 4/3
D40/O40 40 60 4 First 60%, four gaps 3/3
Matrix Label (D/O) Minimum % Identity Maximum % Identity Maximum No. of Gaps Starting Matrix No. of Realignments (D/O)
D85/O85 85 <100 0 BLOSUM62 3/3
D60/O60 60 85 4 First 85%, zero gaps 4/3
D40/O40 40 60 4 First 60%, four gaps 3/3

Criteria Used to Develop Matrices.

Matrix Label (D/O) Minimum % Identity Maximum % Identity Maximum No. of Gaps Starting Matrix No. of Realignments (D/O)
D85/O85 85 <100 0 BLOSUM62 3/3
D60/O60 60 85 4 First 85%, zero gaps 4/3
D40/O40 40 60 4 First 60%, four gaps 3/3
Matrix Label (D/O) Minimum % Identity Maximum % Identity Maximum No. of Gaps Starting Matrix No. of Realignments (D/O)
D85/O85 85 <100 0 BLOSUM62 3/3
D60/O60 60 85 4 First 85%, zero gaps 4/3
D40/O40 40 60 4 First 60%, four gaps 3/3

Alignments for Counting Substitutions.

Amino acid substitution frequencies were inferred from sequence alignments. Sets of pairwise alignments were created ( fig. 1) such that each sequence of a family was aligned with every other sequence in that family using the Needleman–Wunsch algorithm as implemented by The European Molecular Biology Open Software Suite (EMBOSS)’ needle but modified to perform pairwise comparisons on a group of sequences loaded from a single file ( Needleman and Wunsch 1970 Rice et al. 2000). The gap-opening penalty was 10 and the gap-extension penalty was 0.5. The substitution matrix that was used to initially align the sequences is shown in table 1. The substitution matrix inferred from these alignments was then used to realign the sequences ( fig. 1). This realignment cycle was done for each matrix class and percent identity level until the difference between successive matrices had no individual log odds value changing by more than 1 and there were fewer than 10 log odds values that differed in subsequent iterations. Table 1 shows the numbers of cycles required for each matrix.

Iterative procedure used for constructing substitution matrices.

Iterative procedure used for constructing substitution matrices.

Pairwise alignments were included in counts for a substitution matrix based on two criteria, the percent identity and the number of gaps in the alignment. The process of including an alignment has three steps: 1) Pairwise alignments were performed between a putative family member and a sequence from the experimentally characterized set. If this alignment met the criteria for minimum percent identity and maximum number of gaps, then it was included in the count for a substitution matrix. 2) A family member included at this level was then used to recruit new family members based on pairwise alignments that met the criteria for minimum percent identity. Alignments among these new recruits were included in the count for a substitution matrix when their pairwise alignments with other recruits at the same level also met the criteria for minimum percent identity. 3) New family members identified in step 2 were then used to recruit the next level of family members based on pairwise alignments that met the criteria for minimum percent identity. This last step was repeated until no more alignments were added. At each new level, pairwise alignments between recruits that met the criteria for minimum percent identity were not included if their pairwise alignment with at least one established family member did not meet the criteria for minimum percent identify. Otherwise, sequences with very low percent identities in alignments with the sequence from the experimentally characterized set would be included. Alignments that did not meet the criteria for minimum percent identity were not included, even if these alignments were between established family members.

Calculating Substitution Matrices

Scaling by Family Size.

The amino acid substitutions and matches of all included alignments from each family were tallied and scaled according to family size. Large families have a disproportionate influence on substitution matrices because they increase the number of alignments, and thus the number of counted substitutions, at a rate of n × (n − 1)/2. Ideally, we would like to offset this effect by scaling the increase in number of alignments from a quadratic to a linear function. This was not possible because the system was developed such that the number of sequences did not directly determine the number of alignments. Therefore, the total number of substitutions each family contributed was scaled instead. In the scaling, it is assumed that the substitutions are increasing quadratically and then they are mapped to a linear function. Let y be the total number of substitutions for a family the scaled number of substitutions would be x when solving the equation y = x × (x 1)/2. The matrix of scaled substitution counts for that family can then be calculated by multiplying the matrix of raw substitution counts by x/y.

Calculating the Log Odds.

The log odds for the substitution matrices were calculated using the matrix of scaled substitution counts, C. To calculate amino acid frequencies, C was mirrored and values off of the diagonal were halved. Then, the sum of substitution counts of each column was divided by the total substitution counts in C to get the amino acid frequency pi. To calculate the substitution frequencies qij, each value of C was divided by the total number of substitutions. The observed frequency of substitution qij is divided by the expected frequency pipj to get the odds ratio of that substitution. The log odds value sij of the odds ratio is 2 × log2 of the odds ratio. In the 85% matrices, some of the amino acid substitutions had no counts. This prevented us from calculating their true log odds values, as the log of 0 is infinity. In order to approximate the values for these substitutions, a value that was half of the lowest existing count was used instead. This approximation gave an appropriately lower frequency for that substitution and worked well for scaled substitution counts.

Special treatment was also given to the X (any residue), B (N or D), and Z (Q or E) ambiguity codes. These ambiguity codes are present in a few of the sequences and are included in many substitution matrices. Substitution values between standard residues and the ambiguity codes B and Z were an average of the values for substitutions between their constituent residues and that standard residue. Values of X in the 85%, 60%, and 40% identity class matrices were replaced by the X values in the EMBOSS substitution matrices, EBLOSUM85, EBLOSUM60, and EBLOSUM40, respectively ( Rice et al. 2000).

Comparing Matrices Using the Sum of Off-Diagonal Matrix Values

In order to compare the disordered and ordered matrices calculated at a similar percent identity level, the sum of the off-diagonal values in the substitution matrix was computed. The off-diagonal sum of a substitution matrix's log odds values gives an idea of how unlikely substitutions are overall, separated from the context of the amino acid frequencies. More negative sums indicate substitutions are more unlikely overall for that matrix. A jackknife procedure was used to estimate the variance of this statistic: substitution matrices were calculated leaving out the substitution counts for one family at a time. The statistical difference between the off-diagonal values for disorder and order was then determined using Welch's t-test.


Human UCHL1 is known to play an important role in ubiquitin stability within neurons which is critical for ubiquitin–proteasome system and neuronal survival. Mutations in the human UCHL1 gene have been associated with various neurodegenerative disorders like PD, recessive hereditary spastic paraplegia (SPG79), AD and Huntington’s disease. Considering the indispensable role of the UCHL1 gene product in neuronal physiology and pathophysiology, the current study investigates the sequence evolutionary pattern and structural dynamics of UCHL1. Phylogenetic data suggest the ancient origin of UCHL1 at the root of gnathostomes (jawed vertebrate) history. Furthermore, molecular sequence evolutionary analysis reveals that UCHL1 has remained under strong functional constraints throughout the gnathostomes history which might have discouraged the duplication of this gene in any of the animal lineage analyzed in the present study. Comparative structural analysis of UCHL1 pinpointed a critical protein segment (amino acids 32 to 39 within the secretion site) with crucial implications in evolution and PD pathogenesis through a well known phenomenon of intraprotein conformational epistasis. This critical protein segment of UCHL1 can be targeted for drug designing and investigation for the treatment of PD in future.

Author information

Michael R. Garvin and Erica T. Prates contributed equally to this work.


Oak Ridge National Laboratory, Biosciences Division, Oak Ridge, TN, USA

Michael R. Garvin, Erica T. Prates, Mirko Pavicic, Piet Jones, B. Kirtley Amos, Armin Geiger, Manesh B. Shah, Jared Streich, Joao Gabriel Felipe Machado Gazolla, David Kainer, Ashley Cliff, Jonathon Romero & Daniel Jacobson

The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA

Piet Jones, Armin Geiger, Ashley Cliff, Jonathon Romero & Daniel Jacobson

Department of Horticulture, N-318 Ag Sciences Center, University of Kentucky, Lexington, KY, USA

Lawrence Berkeley National Laboratory, Environmental Genomics & Systems Biology, Berkeley, CA, USA

Nathan Keith & James B. Brown

Department of Psychology, University of Tennessee Knoxville, Knoxville, TN, USA

Supplementary Materials

The following are available online at Figure S1: Chromosomal distribution of GhHH3 genes on different chromosomes of G. hirsutum. A02 to A13 and D02 to D13 represent At and Dt sub-genomes G. hirsutum, respectively Figure S2: Gene structure and domain architecture of GhHH3 genes along with phylogenetic tree constructed by NJ method. (a) Gene structure of all GhHH3 genes with phylogenetic analysis. (b) Domain architecture of GhHH3 genes depicting protein motif distribution Table S1: List of all qPCR primers used in this study. Table S2: Gene ID and proposed names of all observed 19 different plant species including A. thaliana, B. napus, G. arboreum, G. hirsutum, G. max, G. raimondii, M. truncatula, O. sativa, P. trichocarpa, S. bicolor, S. tuberosum, T. cacao, V. vinifera. Z. mays, A. comosus, P. taeda, C. reinhardtii, P. patens, and S. moellendorffii Table S3: Biophysical properties of GhHH3 genes including locus ID, start and end point, strand, CDs (coding sequence), protein length, MW (molecular weight), pl (isoelectric point), gravity values, and predicted subcellular localization Table S4: Genes orthologous/paralogous of in At and Dt sub-genomes of G. hirsutum, G. arboreum (A genome), and G. raimondii (D genome). A total of 81 orthologous/paralogous gene pairs were identified as the result of segmental and whole genome duplication. Further, the Ka/Ks (non-synonymous/synonymous) ratio of all identified orthologous/paralogous gene pairs was calculated Table S5. Promoter cis-element analysis of 34 GhHH3 genes. Predicted cis-element in the promoters of GhHH3 genes were characterized according to their relevance to growth and development, light, and stress responses as well Table S6. RNA-seq data analysis of 34 GhHH3 genes in two fuzzless/lintless mutants (M1l and M2l). Further, genes were categorized on the basis of their up- or downregulated expression in these two mutants.