12.2: Codons - Biology

Given the different numbers of “letters” in the mRNA and protein “alphabets,” scientists theorized that combinations of nucleotides corresponded to single amino acids. Scientists theorized that amino acids were encoded by nucleotide triplets and that the genetic code was degenerate. These nucleotide triplets are called codons. Though insertion of three nucleotides caused an extra amino acid to be inserted during translation, the integrity of the rest of the protein was maintained.

Scientists painstakingly solved the genetic code by translating synthetic mRNAs in vitro and sequencing the proteins they specified (Figure 2).

In addition to instructing the addition of a specific amino acid to a polypeptide chain, three of the 64 codons terminate protein synthesis and release the polypeptide from the translation machinery. These triplets are called nonsense codons, or stop codons. Another codon, AUG, also has a special function. In addition to specifying the amino acid methionine, it also serves as the start codon to initiate translation. The reading frame for translation is set by the AUG start codon near the 5′ end of the mRNA.

The genetic code is universal. With a few exceptions, virtually all species use the same genetic code for protein synthesis. Conservation of codons means that a purified mRNA encoding the globin protein in horses could be transferred to a tulip cell, and the tulip would synthesize horse globin. That there is only one genetic code is powerful evidence that all of life on Earth shares a common origin, especially considering that there are about 1084 possible combinations of 20 amino acids and 64 triplet codons.

Transcribe a gene and translate it to protein using complementary pairing and the genetic code at this site.

Degeneracy is believed to be a cellular mechanism to reduce the negative impact of random mutations. Codons that specify the same amino acid typically only differ by one nucleotide. In addition, amino acids with chemically similar side chains are encoded by similar codons. This nuance of the genetic code ensures that a single-nucleotide substitution mutation might either specify the same amino acid but have no effect or specify a similar amino acid, preventing the protein from being rendered completely nonfunctional.

DNA and Protein Synthesis

Messenger RNA Carries the Instructions for Making Proteins

mRNA is “messenger” RNA. mRNA is synthesized in the nucleus using the nucleotide sequence of DNA as a template. This process requires nucleotide triphosphates as substrates and is catalyzed by the enzyme RNA polymerase II. The process of making mRNA from DNA is called transcription, and it occurs in the nucleus. The mRNA directs the synthesis of proteins, which occurs in the cytoplasm. mRNA formed in the nucleus is transported out of the nucleus and into the cytoplasm where it attaches to the ribosomes. Proteins are assembled on the ribosomes using the mRNA nucleotide sequence as a guide. Thus mRNA carries a “message” from the nucleus to the cytoplasm. The message is encoded in the nucleotide sequence of the mRNA, which is complementary to the nucleotide sequence of the DNA that served as a template for synthesizing the mRNA. Making proteins from mRNA is called translation.

Alternative start codons are different from the standard AUG codon and are found in both prokaryotes (bacteria and archaea) and eukaryotes. Alternate start codons are still translated as Met when they are at the start of a protein (even if the codon encodes a different amino acid otherwise). This is because a separate transfer RNA (tRNA) is used for initiation. [1]

Eukaryotes Edit

Alternate start codons (non-AUG) are very rare in eukaryotic genomes. However, naturally occurring non-AUG start codons have been reported for some cellular mRNAs. [2] Seven out of the nine possible single-nucleotide substitutions at the AUG start codon of dihydrofolate reductase were functional as translation start sites in mammalian cells. [3] In addition to the canonical Met-tRNA Met and AUG codon pathway, mammalian cells can initiate translation with leucine using a specific leucyl-tRNA that decodes the codon CUG. [4] [5]

Candida albicans uses a CAG start codon. [6]

Prokaryotes Edit

Prokaryotes use alternate start codons significantly, mainly GUG and UUG. [7]

E. coli uses 83% AUG (3542/4284), 14% (612) GUG, 3% (103) UUG [8] and one or two others (e.g., an AUU and possibly a CUG). [9] [10]

Well-known coding regions that do not have AUG initiation codons are those of lacI (GUG) [11] [12] and lacA (UUG) [13] in the E. coli lac operon. Two more recent studies have independently shown that 17 or more non-AUG start codons may initiate translation in E. coli. [14] [15]

Mitochondria Edit

Mitochondrial genomes use alternate start codons more significantly (AUA and AUU in humans). [7] Many such examples, with codons, systematic range, and citations, are given in the NCBI list of translation tables. [16]

Amino-acid biochemical properties Nonpolar Polar Basic Acidic Termination: stop codon
Standard genetic code
2nd base 3rd
U UUU (Phe/F) Phenylalanine UCU (Ser/S) Serine UAU (Tyr/Y) Tyrosine UGU (Cys/C) Cysteine U
UUA (Leu/L) Leucine UCA UAA Stop (Ochre) [B] UGA Stop (Opal) [B] A
UUG [A] UCG UAG Stop (Amber) [B] UGG (Trp/W) Tryptophan G
C CUU CCU (Pro/P) Proline CAU (His/H) Histidine CGU (Arg/R) Arginine U
CUA CCA CAA (Gln/Q) Glutamine CGA A
A AUU (Ile/I) Isoleucine ACU (Thr/T) Threonine AAU (Asn/N) Asparagine AGU (Ser/S) Serine U
AUA ACA AAA (Lys/K) Lysine AGA (Arg/R) Arginine A
AUG [A] (Met/M) Methionine ACG AAG AGG G
G GUU (Val/V) Valine GCU (Ala/A) Alanine GAU (Asp/D) Aspartic acid GGU (Gly/G) Glycine U
GUA GCA GAA (Glu/E) Glutamic acid GGA A
A The codon AUG both codes for methionine and serves as an initiation site: the first AUG in an mRNA's coding region is where translation into protein begins. [17] The other start codons listed by GenBank are rare in eukaryotes and generally codes for Met/fMet. [18] B ^ ^ ^ The historical basis for designating the stop codons as amber, ochre and opal is described in an autobiography by Sydney Brenner [19] and in a historical article by Bob Edgar. [20]

Engineered initiator tRNAs (tRNA fMet2 with CUA anticodon) have been used to initiate translation at the amber stop codon UAG. [21] This type of engineered tRNA is called a nonsense suppressor tRNA because it suppresses the translation stop signal that normally occurs at UAG codons. One study has shown that the amber initiator tRNA does not initiate translation to any measurable degree from genomically-encoded UAG codons, only plasmid-borne reporters with strong upstream Shine-Dalgarno sites. [22]


Efforts to understand how proteins are encoded began after DNA's structure was discovered in 1953. George Gamow postulated that sets of three bases must be employed to encode the 20 standard amino acids used by living cells to build proteins, which would allow a maximum of 4 3 = 64 amino acids. [3]

Codons Edit

The Crick, Brenner, Barnett and Watts-Tobin experiment first demonstrated that codons consist of three DNA bases. Marshall Nirenberg and Heinrich J. Matthaei were the first to reveal the nature of a codon in 1961. [4]

They used a cell-free system to translate a poly-uracil RNA sequence (i.e., UUUUU. ) and discovered that the polypeptide that they had synthesized consisted of only the amino acid phenylalanine. [5] They thereby deduced that the codon UUU specified the amino acid phenylalanine.

This was followed by experiments in Severo Ochoa's laboratory that demonstrated that the poly-adenine RNA sequence (AAAAA. ) coded for the polypeptide poly-lysine [6] and that the poly-cytosine RNA sequence (CCCCC. ) coded for the polypeptide poly-proline. [7] Therefore, the codon AAA specified the amino acid lysine, and the codon CCC specified the amino acid proline. Using various copolymers most of the remaining codons were then determined.

Subsequent work by Har Gobind Khorana identified the rest of the genetic code. Shortly thereafter, Robert W. Holley determined the structure of transfer RNA (tRNA), the adapter molecule that facilitates the process of translating RNA into protein. This work was based upon Ochoa's earlier studies, yielding the latter the Nobel Prize in Physiology or Medicine in 1959 for work on the enzymology of RNA synthesis. [8]

Extending this work, Nirenberg and Philip Leder revealed the code's triplet nature and deciphered its codons. In these experiments, various combinations of mRNA were passed through a filter that contained ribosomes, the components of cells that translate RNA into protein. Unique triplets promoted the binding of specific tRNAs to the ribosome. Leder and Nirenberg were able to determine the sequences of 54 out of 64 codons in their experiments. [9] Khorana, Holley and Nirenberg received the 1968 Nobel for their work. [10]

The three stop codons were named by discoverers Richard Epstein and Charles Steinberg. "Amber" was named after their friend Harris Bernstein, whose last name means "amber" in German. [11] The other two stop codons were named "ochre" and "opal" in order to keep the "color names" theme.

Expanded genetic codes (synthetic biology) Edit

In a broad academic audience, the concept of the evolution of the genetic code from the original and ambiguous genetic code to a well-defined ("frozen") code with the repertoire of 20 (+2) canonical amino acids is widely accepted. [12] However, there are different opinions, concepts, approaches and ideas, which is the best way to change it experimentally. Even models are proposed that predict "entry points" for synthetic amino acid invasion of the genetic code. [13]

Since 2001, 40 non-natural amino acids have been added into protein by creating a unique codon (recoding) and a corresponding transfer-RNA:aminoacyl – tRNA-synthetase pair to encode it with diverse physicochemical and biological properties in order to be used as a tool to exploring protein structure and function or to create novel or enhanced proteins. [14] [15]

H. Murakami and M. Sisido extended some codons to have four and five bases. Steven A. Benner constructed a functional 65th (in vivo) codon. [16]

In 2015 N. Budisa, D. Söll and co-workers reported the full substitution of all 20,899 tryptophan residues (UGG codons) with unnatural thienopyrrole-alanine in the genetic code of the bacterium Escherichia coli. [17]

In 2016 the first stable semisynthetic organism was created. It was a (single cell) bacterium with two synthetic bases (called X and Y). The bases survived cell division. [18] [19]

In 2017, researchers in South Korea reported that they had engineered a mouse with an extended genetic code that can produce proteins with unnatural amino acids. [20]

In May 2019, researchers, in a milestone effort, reported the creation of a new synthetic (possibly artificial) form of viable life, a variant of the bacteria Escherichia coli, by reducing the natural number of 64 codons in the bacterial genome to 59 codons instead, in order to encode 20 amino acids. [21] [22]

Reading frame Edit

A reading frame is defined by the initial triplet of nucleotides from which translation starts. It sets the frame for a run of successive, non-overlapping codons, which is known as an "open reading frame" (ORF). For example, the string 5'-AAATGAACG-3' (see figure), if read from the first position, contains the codons AAA, TGA, and ACG if read from the second position, it contains the codons AAT and GAA and if read from the third position, it contains the codons ATG and AAC. Every sequence can, thus, be read in its 5' → 3' direction in three reading frames, each producing a possibly distinct amino acid sequence: in the given example, Lys (K)-Trp (W)-Thr (T), Asn (N)-Glu (E), or Met (M)-Asn (N), respectively (when translating with the vertebrate mitochondrial code). When DNA is double-stranded, six possible reading frames are defined, three in the forward orientation on one strand and three reverse on the opposite strand. [24] : 330 Protein-coding frames are defined by a start codon, usually the first AUG (ATG) codon in the RNA (DNA) sequence.

In eukaryotes, ORFs in exons are often interrupted by introns.

Start and stop codons Edit

Translation starts with a chain-initiation codon or start codon. The start codon alone is not sufficient to begin the process. Nearby sequences such as the Shine-Dalgarno sequence in E. coli and initiation factors are also required to start translation. The most common start codon is AUG, which is read as methionine or, in bacteria, as formylmethionine. Alternative start codons depending on the organism include "GUG" or "UUG" these codons normally represent valine and leucine, respectively, but as start codons they are translated as methionine or formylmethionine. [25]

The three stop codons have names: UAG is amber, UGA is opal (sometimes also called umber), and UAA is ochre. Stop codons are also called "termination" or "nonsense" codons. They signal release of the nascent polypeptide from the ribosome because no cognate tRNA has anticodons complementary to these stop signals, allowing a release factor to bind to the ribosome instead. [26]

Effect of mutations Edit

During the process of DNA replication, errors occasionally occur in the polymerization of the second strand. These errors, mutations, can affect an organism's phenotype, especially if they occur within the protein coding sequence of a gene. Error rates are typically 1 error in every 10–100 million bases—due to the "proofreading" ability of DNA polymerases. [28] [29]

Missense mutations and nonsense mutations are examples of point mutations that can cause genetic diseases such as sickle-cell disease and thalassemia respectively. [30] [31] [32] Clinically important missense mutations generally change the properties of the coded amino acid residue among basic, acidic, polar or non-polar states, whereas nonsense mutations result in a stop codon. [24]

Mutations that disrupt the reading frame sequence by indels (insertions or deletions) of a non-multiple of 3 nucleotide bases are known as frameshift mutations. These mutations usually result in a completely different translation from the original, and likely cause a stop codon to be read, which truncates the protein. [33] These mutations may impair the protein's function and are thus rare in in vivo protein-coding sequences. One reason inheritance of frameshift mutations is rare is that, if the protein being translated is essential for growth under the selective pressures the organism faces, absence of a functional protein may cause death before the organism becomes viable. [34] Frameshift mutations may result in severe genetic diseases such as Tay–Sachs disease. [35]

Although most mutations that change protein sequences are harmful or neutral, some mutations have benefits. [36] These mutations may enable the mutant organism to withstand particular environmental stresses better than wild type organisms, or reproduce more quickly. In these cases a mutation will tend to become more common in a population through natural selection. [37] Viruses that use RNA as their genetic material have rapid mutation rates, [38] which can be an advantage, since these viruses thereby evolve rapidly, and thus evade the immune system defensive responses. [39] In large populations of asexually reproducing organisms, for example, E. coli, multiple beneficial mutations may co-occur. This phenomenon is called clonal interference and causes competition among the mutations. [40]

Degeneracy Edit

Degeneracy is the redundancy of the genetic code. This term was given by Bernfield and Nirenberg. The genetic code has redundancy but no ambiguity (see the codon tables below for the full correlation). For example, although codons GAA and GAG both specify glutamic acid (redundancy), neither specifies another amino acid (no ambiguity). The codons encoding one amino acid may differ in any of their three positions. For example, the amino acid leucine is specified by YUR or CUN (UUA, UUG, CUU, CUC, CUA, or CUG) codons (difference in the first or third position indicated using IUPAC notation), while the amino acid serine is specified by UCN or AGY (UCA, UCG, UCC, UCU, AGU, or AGC) codons (difference in the first, second, or third position). [41] A practical consequence of redundancy is that errors in the third position of the triplet codon cause only a silent mutation or an error that would not affect the protein because the hydrophilicity or hydrophobicity is maintained by equivalent substitution of amino acids for example, a codon of NUN (where N = any nucleotide) tends to code for hydrophobic amino acids. NCN yields amino acid residues that are small in size and moderate in hydropathicity NAN encodes average size hydrophilic residues. The genetic code is so well-structured for hydropathicity that a mathematical analysis (Singular Value Decomposition) of 12 variables (4 nucleotides x 3 positions) yields a remarkable correlation (C = 0.95) for predicting the hydropathicity of the encoded amino acid directly from the triplet nucleotide sequence, without translation. [42] [43] Note in the table, below, eight amino acids are not affected at all by mutations at the third position of the codon, whereas in the figure above, a mutation at the second position is likely to cause a radical change in the physicochemical properties of the encoded amino acid. Nevertheless, changes in the first position of the codons are more important than changes in the second position on a global scale. [44] The reason may be that charge reversal (from a positive to a negative charge or vice versa) can only occur upon mutations in the first position of certain codons, but not upon changes in the second position of any codon. Such charge reversal may have dramatic consequences for the structure or function of a protein. This aspect may have been largely underestimated by previous studies. [44]

Codon usage bias Edit

The frequency of codons, also known as codon usage bias, can vary from species to species with functional implications for the control of translation.

Non-standard amino acids Edit

In some proteins, non-standard amino acids are substituted for standard stop codons, depending on associated signal sequences in the messenger RNA. For example, UGA can code for selenocysteine and UAG can code for pyrrolysine. Selenocysteine came to be seen as the 21st amino acid, and pyrrolysine as the 22nd. [46] Unlike selenocysteine, pyrrolysine-encoded UAG is translated with the participation of a dedicated aminoacyl-tRNA synthetase. [47] Both selenocysteine and pyrrolysine may be present in the same organism. [46] Although the genetic code is normally fixed in an organism, the achaeal prokaryote Acetohalobium arabaticum can expand its genetic code from 20 to 21 amino acids (by including pyrrolysine) under different conditions of growth. [48]

Variations Edit

Variations on the standard code were predicted in the 1970s. [49] The first was discovered in 1979, by researchers studying human mitochondrial genes. [50] Many slight variants were discovered thereafter, [51] including various alternative mitochondrial codes. [52] These minor variants for example involve translation of the codon UGA as tryptophan in Mycoplasma species, and translation of CUG as a serine rather than leucine in yeasts of the "CTG clade" (such as Candida albicans). [53] [54] [55] Because viruses must use the same genetic code as their hosts, modifications to the standard genetic code could interfere with viral protein synthesis or functioning. However, viruses such as totiviruses have adapted to the host's genetic code modification. [56] In bacteria and archaea, GUG and UUG are common start codons. In rare cases, certain proteins may use alternative start codons. [51] Surprisingly, variations in the interpretation of the genetic code exist also in human nuclear-encoded genes: In 2016, researchers studying the translation of malate dehydrogenase found that in about 4% of the mRNAs encoding this enzyme the stop codon is naturally used to encode the amino acids tryptophan and arginine. [57] This type of recoding is induced by a high-readthrough stop codon context [58] and it is referred to as functional translational readthrough. [59]

Variant genetic codes used by an organism can be inferred by identifying highly conserved genes encoded in that genome, and comparing its codon usage to the amino acids in homologous proteins of other organisms. For example, the program FACIL [60] infers a genetic code by searching which amino acids in homologous protein domains are most often aligned to every codon. The resulting amino acid probabilities for each codon are displayed in a genetic code logo, that also shows the support for a stop codon.

Despite these differences, all known naturally occurring codes are very similar. The coding mechanism is the same for all organisms: three-base codons, tRNA, ribosomes, single direction reading and translating single codons into single amino acids. [61] The most extreme variations occur in certain ciliates where the meaning of stop codons depends on their position within mRNA. When close to the 3’ end they act as terminators while in internal positions they either code for amino acids as in Condylostoma magnum [62] or trigger ribosomal frameshifting as in Euplotes. [63]

The genetic code is a key part of the history of life, according to one version of which self-replicating RNA molecules preceded life as we know it. This is the RNA world hypothesis. Under this hypothesis, any model for the emergence of the genetic code is intimately related to a model of the transfer from ribozymes (RNA enzymes) to proteins as the principal enzymes in cells. In line with the RNA world hypothesis, transfer RNA molecules appear to have evolved before modern aminoacyl-tRNA synthetases, so the latter cannot be part of the explanation of its patterns. [64]

A hypothetical randomly evolved genetic code further motivates a biochemical or evolutionary model for its origin. If amino acids were randomly assigned to triplet codons, there would be 1.5 × 10 84 possible genetic codes. [65] : 163 This number is found by calculating the number of ways that 21 items (20 amino acids plus one stop) can be placed in 64 bins, wherein each item is used at least once. [66] However, the distribution of codon assignments in the genetic code is nonrandom. [67] In particular, the genetic code clusters certain amino acid assignments.

Amino acids that share the same biosynthetic pathway tend to have the same first base in their codons. This could be an evolutionary relic of an early, simpler genetic code with fewer amino acids that later evolved to code a larger set of amino acids. [68] It could also reflect steric and chemical properties that had another effect on the codon during its evolution. Amino acids with similar physical properties also tend to have similar codons, [69] [70] reducing the problems caused by point mutations and mistranslations. [67]

Given the non-random genetic triplet coding scheme, a tenable hypothesis for the origin of genetic code could address multiple aspects of the codon table, such as absence of codons for D-amino acids, secondary codon patterns for some amino acids, confinement of synonymous positions to third position, the small set of only 20 amino acids (instead of a number approaching 64), and the relation of stop codon patterns to amino acid coding patterns. [71]

Three main hypotheses address the origin of the genetic code. Many models belong to one of them or to a hybrid: [72]

12.2: Codons - Biology

All of the alleles that exist within a species are known as the gene pool. When mutations or genetic leakage occur, new genes are introduced into the gene pool. Genetic variability is essential for the survival of a species because it allows it to evolve to adapt to changing environmental stresses. Certain traits may be more desirable than others and confer a selective advantage&mdashan advantage that allows for the individual to produce more viable, fertile offspring. In this section, we will consider genetic diversity and mutations, leakage, and genetic drift, which cause changes to the alleles present in the gene pool.

A mutation is a change in DNA sequence. New mutations may be introduced in a variety of ways. Ionizing radiation, such as ultraviolet rays from the sun, and chemical exposures can damage DNA substances that can cause mutations are called mutagens. DNA polymerase is subject to making mistakes during DNA replication, albeit at a very low rate proofreading mechanisms also help prevent mutations from occurring through this mechanism. Elements known as transposons can insert and remove themselves from the genome. If a transposon inserts in the middle of a coding sequence, the mutation will disrupt the gene.

Flawed proteins can arise in other ways without an underlying change in DNA sequence, as well. Incorrect pairing of nucleotides during transcription or translation, or a tRNA molecule charged with the incorrect amino acid for its anticodon, can result in derangements of the normal amino acid sequence.

The major types of nucleotide-level mutations are discussed in great detail in Chapter 7 of MCAT Biochemistry Review, so we offer just a brief overview here of each type.

Nucleotide-Level Mutations

Many mutations occur at the level of a single nucleotide (or a very small number of nucleotides). These mutations are shown in Figure 12.3 and are summarized below.

Figure 12.3. Common Nucleotide-Level Mutations

Point mutations occur when one nucleotide in DNA (A, C, T, or G) is swapped for another. These can be subcategorized as silent, missense, or nonsense mutations:

·&emspSilent mutations occur when the change in nucleotide has no effect on the final protein synthesized from the gene. This most commonly occurs when the changed nucleotide is transcribed to be the third nucleotide in a codon because there is degeneracy (wobble) in the genetic code.

·&emspMissense mutations occur when the change in nucleotide results in substituting one amino acid for another in the final protein.

·&emspNonsense mutations occur when the change in nucleotide results in substituting a stop codon for an amino acid in the final protein.

Frameshift mutations occur when nucleotides are inserted into or deleted from the genome. Because mRNA transcribed from DNA is always read in three-letter sequences called codons, insertion or deletion of nucleotides can shift the reading frame, usually resulting in either changes in the amino acid sequence or premature truncation of the protein (due to the generation of a nonsense mutation). These can be subcategorized as insertion or deletion mutations.

Chromosomal Mutations

Chromosomal mutations are larger-scale mutations in which large segments of DNA are affected, as demonstrated in Figure 12.4 and summarized below.

Figure 12.4. Common Chromosomal Mutations

·&emspDeletion mutations occur when a large segment of DNA is lost from a chromosome. Small deletion mutations are considered frameshift mutations, as described previously.

·&emspDuplication mutations occur when a segment of DNA is copied multiple times in the genome.

·&emspInversion mutations occur when a segment of DNA is reversed within the chromosome.

·&emspInsertion mutations occur when a segment of DNA is moved from one chromosome to another. Small insertion mutations (including those where the inserted DNA is not from another chromosome) are considered frameshift mutations, as described previously.

·&emspTranslocation mutations occur when a segment of DNA from one chromosome is swapped with a segment of DNA from another chromosome.

Consequences of Mutations

Mutations can have many different consequences. Some mutations can be advantageous, conferring a positive selective advantage that may allow the organism to produce more offspring. For example, sickle cell disease is a single nucleotide mutation that causes sickled hemoglobin. While the disease itself is detrimental to life, heterozygotes for sickle cell disease usually have minor symptoms, if any, and have natural resistance to malaria because their red blood cells have a slightly shorter lifespan&mdashjust short enough that the parasitic Plasmodium species that cause malaria cannot reproduce. Thus, heterozygotes for sickle cell disease actually have a selective advantage because they are less likely to die from malaria.

On the other hand, some mutations can be detrimental or deleterious. For example, xeroderma pigmentosum (XP) is an inherited defect in the nucleotide excision repair mechanism. In patients with XP, DNA that has been damaged by ultraviolet radiation cannot be repaired appropriately. Ultraviolet radiation can introduce cancer-causing mutations without a repair mechanism, patients with XP are frequently diagnosed with malignancies, especially of the skin.

One important class of deleterious mutations is known as inborn errors of metabolism. These are defects in genes required for metabolism. Children born with these defects often require very early intervention in order to prevent permanent damage from the buildup of metabolites in various pathways. For example, in phenylketonuria (PKU), the enzyme phenylalanine hydrolase, which completes the metabolism of the amino acid phenylalanine, is defective. In the absence of this enzyme, toxic metabolites of phenylalanine accumulate, causing seizures, impairment of cerebral function, and learning disabilities, as well as a musty odor to bodily secretions. However, if the disease is discovered shortly after birth, then dietary phenylalanine can be eliminated and treatments can be administered to aid in metabolizing any additional phenylalanine.

Genetic leakage is a flow of genes between species. In some cases, individuals from different (but closely related) species can mate to produce hybrid offspring. Many hybrid offspring, such as the mule (hybrid of a male horse and a female donkey), are not able to reproduce because they have odd numbers of chromosomes&mdashhorses have 64 chromosomes and donkeys have 62, so mules, with 63 chromosomes, cannot undergo normal homologous pairing in meiosis and cannot form gametes. In some cases, however, a hybrid can reproduce with members of one species or the other, such as the beefalo (a cross between cattle and American bison). The hybrid carries genes from both parent species, so this results in a net flow of genes from one species to the other.

Genetic drift refers to changes in the composition of the gene pool due to chance. Genetic drift tends to be more pronounced in small populations. The founder effect is a more extreme case of genetic drift in which a small population of a species finds itself in reproductive isolation from other populations as a result of natural barriers, catastrophic events, or other bottlenecks that drastically and suddenly reduce the size of the population available for breeding. Because the breeding group is small, inbreeding, or mating between two genetically related individuals, may occur in later generations. Inbreeding encourages homozygosity, which increases the prevalence of both homozygous dominant and recessive genotypes. Ultimately, genetic drift, the founder effect, and inbreeding cause a reduction in genetic diversity, which is often the reason why a small population may have increased prevalence of certain traits and diseases. For example, branched-chain ketoacid dehydrogenase deficiency (also called maple syrup urine disease) is especially common in Mennonite communities this implies a common origin of the mutation, which may be a very small original population.

This loss of genetic variation may cause reduced fitness of the population, a condition known as inbreeding depression. On the opposite end of the spectrum, outbreeding or outcrossing, is the introduction of unrelated individuals into a breeding group. Theoretically, this could result in increased variation within a gene pool and increased fitness of the population.

MCAT Concept Check 12.2:

Before you move on, assess your understanding of the material with these questions.

1. What are the three main types of point mutations? What change occurs in each?

2. What are the two main types of frameshift mutations?

3. What are the three main types of chromosomal mutations that do NOT share their name with a type of frameshift mutation? What change occurs in each?

4. Why would genetic leakage in animals be rare prior to the last century?

5. Why is genetic drift more common in small populations? What relationship does this have to the founder effect?

If you are the copyright holder of any material contained on our site and intend to remove it, please contact our site administrator for approval.

The codon sequences predict protein lifetimes and other parameters of the protein life cycle in the mouse brain

The homeostasis of the proteome depends on the tight regulation of the mRNA and protein abundances, of the translation rates, and of the protein lifetimes. Results from several studies on prokaryotes or eukaryotic cell cultures have suggested that protein homeostasis is connected to, and perhaps regulated by, the protein and the codon sequences. However, this has been little investigated for mammals in vivo. Moreover, the link between the coding sequences and one critical parameter, the protein lifetime, has remained largely unexplored, both in vivo and in vitro. We tested this in the mouse brain, and found that the percentages of amino acids and codons in the sequences could predict all of the homeostasis parameters with a precision approaching experimental measurements. A key predictive element was the wobble nucleotide. G-/C-ending codons correlated with higher protein lifetimes, protein abundances, mRNA abundances and translation rates than A-/U-ending codons. Modifying the proportions of G-/C-ending codons could tune these parameters in cell cultures, in a proof-of-principle experiment. We suggest that the coding sequences are strongly linked to protein homeostasis in vivo, albeit it still remains to be determined whether this relation is causal in nature.

Conflict of interest statement

The authors declare no competing interests.


Protein lifetimes correlate to the…

Protein lifetimes correlate to the sequence composition. ( a ) The graphs plot…

The amino acid and codon sequences can be used to reliably predict protein…

The protein and mRNA abundances,…

The protein and mRNA abundances, the ribosome density and the protein length can…

70% of the maximum expected (the reproducibility of the data between different data sets, from different laboratories).

The nature of the third nucleotide coordinates a number of parameters linked to…

The nature of the third nucleotide influences protein lifetimes. ( a–d ) We…

Hypothetical scenarios linking the G/C…

Hypothetical scenarios linking the G/C contents at the third position of codons to…

12.2: Codons - Biology

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited.

Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review prior to publication.

The Feature Paper can be either an original research article, a substantial novel research study that often involves several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest progress in the field that systematically reviews the most exciting advances in scientific literature. This type of paper provides an outlook on future directions of research or possible applications.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to authors, or important in this field. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Phylogeny, rates of evolution, and patterns of codon usage among sea urchin retroviral-like elements, with implications for the recognition of horizontal transfer

Phylogenetic relationships, rates of evolution, and codon usage were investigated in a family of retrotransposons (SURL elements) found in echinoids. The phylogeny of SURL element reverse transcriptase sequences from 10 echinoid species clearly shows the phylogenetic signature of the host taxa as well as paralogous sequences that diverged prior to speciation events. Two subfamilies (1 and 5) of SURL element reverse transcriptase sequences are recognized that diverged prior to the radiation of the Echinometridae. Comparisons of synonymous versus nonsynonymous substitutions indicate that SURL elements have been active in echinoid genomes and have evolved under purifying selection for millions of years. Rates of synonymous substitution for reverse transcriptase are similar to rates of single-copy DNA evolution and to rates of synonymous substitution for the H3 and H4 histone genes, contradicting the assumption that rates of evolution are accelerated in retrotransposons. Finally, codon usage in SURL elements is biased for codons ending in A or U relative to 42 sea urchin nuclear genes. Biased codon usage is sometimes cited as evidence for horizontal transfer, but in the case of SURL elements this bias occurs in spite of a long history of vertical transmission rather than because of horizontal transfer.

The Central Dogma: DNA Encodes RNA RNA Encodes Protein

The flow of genetic information in cells from DNA to mRNA to protein is described by the central dogma ((Figure)), which states that genes specify the sequence of mRNAs, which in turn specify the sequence of amino acids making up all proteins. The decoding of one molecule to another is performed by specific proteins and RNAs. Because the information stored in DNA is so central to cellular function, it makes intuitive sense that the cell would make mRNA copies of this information for protein synthesis, while keeping the DNA itself intact and protected. The copying of DNA to RNA is relatively straightforward, with one nucleotide being added to the mRNA strand for every nucleotide read in the DNA strand. The translation to protein is a bit more complex because three mRNA nucleotides correspond to one amino acid in the polypeptide sequence. However, the translation to protein is still systematic and colinear , such that nucleotides 1 to 3 correspond to amino acid 1, nucleotides 4 to 6 correspond to amino acid 2, and so on.

The Genetic Code Is Degenerate and Universal

Each amino acid is defined by a three-nucleotide sequence called the triplet codon. Given the different numbers of “letters” in the mRNA and protein “alphabets,” scientists theorized that single amino acids must be represented by combinations of nucleotides. Nucleotide doublets would not be sufficient to specify every amino acid because there are only 16 possible two-nucleotide combinations (4 2 ). In contrast, there are 64 possible nucleotide triplets (4 3 ), which is far more than the number of amino acids. Scientists theorized that amino acids were encoded by nucleotide triplets and that the genetic code was “degenerate.” In other words, a given amino acid could be encoded by more than one nucleotide triplet. This was later confirmed experimentally: Francis Crick and Sydney Brenner used the chemical mutagen proflavin to insert one, two, or three nucleotides into the gene of a virus. When one or two nucleotides were inserted, the normal proteins were not produced. When three nucleotides were inserted, the protein was synthesized and functional. This demonstrated that the amino acids must be specified by groups of three nucleotides. These nucleotide triplets are called codons . The insertion of one or two nucleotides completely changed the triplet reading frame , thereby altering the message for every subsequent amino acid ((Figure)). Though insertion of three nucleotides caused an extra amino acid to be inserted during translation, the integrity of the rest of the protein was maintained.

Scientists painstakingly solved the genetic code by translating synthetic mRNAs in vitro and sequencing the proteins they specified ((Figure)).

In addition to codons that instruct the addition of a specific amino acid to a polypeptide chain, three of the 64 codons terminate protein synthesis and release the polypeptide from the translation machinery. These triplets are called nonsense codons , or stop codons. Another codon, AUG, also has a special function. In addition to specifying the amino acid methionine, it also serves as the start codon to initiate translation. The reading frame for translation is set by the AUG start codon near the 5′ end of the mRNA. Following the start codon, the mRNA is read in groups of three until a stop codon is encountered.

The arrangement of the coding table reveals the structure of the code. There are sixteen “blocks” of codons, each specified by the first and second nucleotides of the codons within the block, e.g., the “AC*” block that corresponds to the amino acid threonine (Thr). Some blocks are divided into a pyrimidine half, in which the codon ends with U or C, and a purine half, in which the codon ends with A or G. Some amino acids get a whole block of four codons, like alanine (Ala), threonine (Thr) and proline (Pro). Some get the pyrimidine half of their block, like histidine (His) and asparagine (Asn). Others get the purine half of their block, like glutamate (Glu) and lysine (Lys). Note that some amino acids get a block and a half-block for a total of six codons.

The specification of a single amino acid by multiple similar codons is called “degeneracy.” Degeneracy is believed to be a cellular mechanism to reduce the negative impact of random mutations. Codons that specify the same amino acid typically only differ by one nucleotide. In addition, amino acids with chemically similar side chains are encoded by similar codons. For example, aspartate (Asp) and glutamate (Glu), which occupy the GA* block, are both negatively charged. This nuance of the genetic code ensures that a single-nucleotide substitution mutation might specify the same amino acid but have no effect or specify a similar amino acid, preventing the protein from being rendered completely nonfunctional.

The genetic code is nearly universal. With a few minor exceptions, virtually all species use the same genetic code for protein synthesis. Conservation of codons means that a purified mRNA encoding the globin protein in horses could be transferred to a tulip cell, and the tulip would synthesize horse globin. That there is only one genetic code is powerful evidence that all of life on Earth shares a common origin, especially considering that there are about 10 84 possible combinations of 20 amino acids and 64 triplet codons.


Pseudogenes are usually characterized by a combination of homology to a known gene and loss of some functionality. That is, although every pseudogene has a DNA sequence that is similar to some functional gene, they are usually unable to produce functional final protein products. [1] Pseudogenes are sometimes difficult to identify and characterize in genomes, because the two requirements of homology and loss of functionality are usually implied through sequence alignments rather than biologically proven.

  1. Homology is implied by sequence identity between the DNA sequences of the pseudogene and parent gene. After aligning the two sequences, the percentage of identical base pairs is computed. A high sequence identity means that it is highly likely that these two sequences diverged from a common ancestral sequence (are homologous), and highly unlikely that these two sequences have evolved independently (see Convergent evolution).
  2. Nonfunctionality can manifest itself in many ways. Normally, a gene must go through several steps to a fully functional protein: Transcription, pre-mRNA processing, translation, and protein folding are all required parts of this process. If any of these steps fails, then the sequence may be considered nonfunctional. In high-throughput pseudogene identification, the most commonly identified disablements are premature stop codons and frameshifts, which almost universally prevent the translation of a functional protein product.

Pseudogenes for RNA genes are usually more difficult to discover as they do not need to be translated and thus do not have "reading frames".

Pseudogenes can complicate molecular genetic studies. For example, amplification of a gene by PCR may simultaneously amplify a pseudogene that shares similar sequences. This is known as PCR bias or amplification bias. Similarly, pseudogenes are sometimes annotated as genes in genome sequences.

Processed pseudogenes often pose a problem for gene prediction programs, often being misidentified as real genes or exons. It has been proposed that identification of processed pseudogenes can help improve the accuracy of gene prediction methods. [2]

Recently 140 human pseudogenes have been shown to be translated. [3] However, the function, if any, of the protein products is unknown.

There are four main types of pseudogenes, all with distinct mechanisms of origin and characteristic features. The classifications of pseudogenes are as follows:

Processed Edit

In higher eukaryotes, particularly mammals, retrotransposition is a fairly common event that has had a huge impact on the composition of the genome. For example, somewhere between 30–44% of the human genome consists of repetitive elements such as SINEs and LINEs (see retrotransposons). [6] [7] In the process of retrotransposition, a portion of the mRNA or hnRNA transcript of a gene is spontaneously reverse transcribed back into DNA and inserted into chromosomal DNA. Although retrotransposons usually create copies of themselves, it has been shown in an in vitro system that they can create retrotransposed copies of random genes, too. [8] Once these pseudogenes are inserted back into the genome, they usually contain a poly-A tail, and usually have had their introns spliced out these are both hallmark features of cDNAs. However, because they are derived from an RNA product, processed pseudogenes also lack the upstream promoters of normal genes thus, they are considered "dead on arrival", becoming non-functional pseudogenes immediately upon the retrotransposition event. [9] However, these insertions occasionally contribute exons to existing genes, usually via alternatively spliced transcripts. [10] A further characteristic of processed pseudogenes is common truncation of the 5' end relative to the parent sequence, which is a result of the relatively non-processive retrotransposition mechanism that creates processed pseudogenes. [11] Processed pseudogenes are continually being created in primates. [12] Human populations, for example, have distinct sets of processed pseudogenes across its individuals. [13]

Non-processed Edit

Non-processed (or duplicated) pseudogenes. Gene duplication is another common and important process in the evolution of genomes. A copy of a functional gene may arise as a result of a gene duplication event caused by homologous recombination at, for example, repetitive sine sequences on misaligned chromosomes and subsequently acquire mutations that cause the copy to lose the original gene's function. Duplicated pseudogenes usually have all the same characteristics as genes, including an intact exon-intron structure and regulatory sequences. The loss of a duplicated gene's functionality usually has little effect on an organism's fitness, since an intact functional copy still exists. According to some evolutionary models, shared duplicated pseudogenes indicate the evolutionary relatedness of humans and the other primates. [14] If pseudogenization is due to gene duplication, it usually occurs in the first few million years after the gene duplication, provided the gene has not been subjected to any selection pressure. [15] Gene duplication generates functional redundancy and it is not normally advantageous to carry two identical genes. Mutations that disrupt either the structure or the function of either of the two genes are not deleterious and will not be removed through the selection process. As a result, the gene that has been mutated gradually becomes a pseudogene and will be either unexpressed or functionless. This kind of evolutionary fate is shown by population genetic modeling [16] [17] and also by genome analysis. [15] [18] According to evolutionary context, these pseudogenes will either be deleted or become so distinct from the parental genes so that they will no longer be identifiable. Relatively young pseudogenes can be recognized due to their sequence similarity. [19]

Unitary pseudogenes Edit

Various mutations (such as indels and nonsense mutations) can prevent a gene from being normally transcribed or translated, and thus the gene may become less- or non-functional or "deactivated". These are the same mechanisms by which non-processed genes become pseudogenes, but the difference in this case is that the gene was not duplicated before pseudogenization. Normally, such a pseudogene would be unlikely to become fixed in a population, but various population effects, such as genetic drift, a population bottleneck, or, in some cases, natural selection, can lead to fixation. The classic example of a unitary pseudogene is the gene that presumably coded the enzyme L-gulono-γ-lactone oxidase (GULO) in primates. In all mammals studied besides primates (except guinea pigs), GULO aids in the biosynthesis of ascorbic acid (vitamin C), but it exists as a disabled gene (GULOP) in humans and other primates. [20] [21] Another more recent example of a disabled gene links the deactivation of the caspase 12 gene (through a nonsense mutation) to positive selection in humans. [22]

It has been shown that processed pseudogenes accumulate mutations faster than non-processed pseudogenes. [23]

Pseudo-pseudogenes Edit

The rapid proliferation of DNA sequencing technologies has led to the identification of many apparent pseudogenes using gene prediction techniques. Pseudogenes are often identified by the appearance of a premature stop codon in a predicted mRNA sequence, which would, in theory, prevent synthesis (translation) of the normal protein product of the original gene. There have been some reports of translational readthrough of such premature stop codons in mammals. As alluded to in the figure above, a small amount of the protein product of such readthrough may still be recognizable and function at some level. If so, the pseudogene can be subject to natural selection. That appears to have happened during the evolution of Drosophila species.

In 2016 it was reported that 4 predicted pseudogenes in multiple Drosophila species actually encode proteins with biologically important functions, [24] "suggesting that such 'pseudo-pseudogenes' could represent a widespread phenomenon". For example, the functional protein (an olfactory receptor) is found only in neurons. This finding of tissue-specific biologically-functional genes that could have been classified as pseudogenes by in silico analysis complicates the analysis of sequence data. In the human genome, a number of examples have been identified that were originally classified as pseudogenes but later discovered to have a functional, although not necessarily protein-coding, role. [25] [26] As of 2012, it appeared that there are approximately 12,000–14,000 pseudogenes in the human genome, [27] A 2016 proteogenomics analysis using mass spectrometry of peptides identified at least 19,262 human proteins produced from 16,271 genes or clusters of genes, with 8 new protein-coding genes identified that were previously considered pseudogenes. [28]

Drosophila glutamate receptor. The term "pseudo-pseudogene" was coined for the gene encoding the chemosensory ionotropic glutamate receptor Ir75a of Drosophila sechellia, which bears a premature termination codon (PTC) and was thus classified as a pseudogene. However, in vivo the D. sechellia Ir75a locus produces a functional receptor, owing to translational read-through of the PTC. Read-through is detected only in neurons and depends on the nucleotide sequence downstream of the PTC. [24]

siRNAs. Some endogenous siRNAs appear to be derived from pseudogenes, and thus some pseudogenes play a role in regulating protein-coding transcripts, as reviewed. [29] One of the many examples is psiPPM1K. Processing of RNAs transcribed from psiPPM1K yield siRNAs that can act to suppress the most common type of liver cancer, hepatocellular carcinoma. [30] This and much other research has led to considerable excitement about the possibility of targeting pseudogenes with/as therapeutic agents [31]

piRNAs. Some piRNAs are derived from pseudogenes located in piRNA clusters. [32] Those piRNAs regulate genes via the piRNA pathway in mammalian testes and are crucial for limiting transposable element damage to the genome. [33]

microRNAs. There are many reports of pseudogene transcripts acting as microRNA decoys. Perhaps the earliest definitive example of such a pseudogene involved in cancer is the pseudogene of BRAF. The BRAF gene is a proto-oncogene that, when mutated, is associated with many cancers. Normally, the amount of BRAF protein is kept under control in cells through the action of miRNA. In normal situations, the amount of RNA from BRAF and the pseudogene BRAFP1 compete for miRNA, but the balance of the 2 RNAs is such that cells grow normally. However, when BRAFP1 RNA expression is increased (either experimentally or by natural mutations), less miRNA is available to control the expression of BRAF, and the increased amount of BRAF protein causes cancer. [34] This sort of competition for regulatory elements by RNAs that are endogenous to the genome has given rise to the term ceRNA.

PTEN. The PTEN gene is a known tumor suppressor gene. The PTEN pseudogene, PTENP1 is a processed pseudogene that is very similar in its genetic sequence to the wild-type gene. However, PTENP1 has a missense mutation which eliminates the codon for the initiating methionine and thus prevents translation of the normal PTEN protein. [35] In spite of that, PTENP1 appears to play a role in oncogenesis. The 3' UTR of PTENP1 mRNA functions as a decoy of PTEN mRNA by targeting micro RNAs due to its similarity to the PTEN gene, and overexpression of the 3' UTR resulted in an increase of PTEN protein level. [36] That is, overexpression of the PTENP1 3' UTR leads to increased regulation and suppression of cancerous tumors. The biology of this system is basically the inverse of the BRAF system described above.

Potogenes. Pseudogenes can, over evolutionary time scales, participate in gene conversion and other mutational events that may give rise to new or newly functional genes. This has led to the concept that pseudogenes could be viewed as potogenes: potential genes for evolutionary diversification. [37]

Sometimes genes are thought to be pseudogenes, usually based on bioinformatic analysis, but then turn out to be functional genes. Examples include the Drosophila jingwei gene [38] [39] which encodes a functional alcohol dehydrogenase enzyme in vivo. [40]

Another example is the human gene encoding phosphoglycerate mutase [41] which was thought to be a pseudogene but which turned out to be a functional gene, [42] now named PGAM4. Mutations in it cause infertility. [43]

Pseudogenes are found in bacteria. [44] Most are found in bacteria that are not free-living that is, they are either symbionts or obligate intracellular parasites. Thus, they do not require many genes that are needed by free-living bacteria, such as gene associated with metabolism and DNA repair. However, there is not an order to which functional genes are lost first. For example, the oldest pseudogenes in Mycobacterium laprae are in RNA polymerases and the biosynthesis of secondary metabolites while the oldest ones in Shigella flexneri and Shigella typhi are in DNA replication, recombination, and repair. [45]

Since most bacteria that carry pseudogenes are either symbionts or obligate intracellular parasites, genome size eventually reduces. An extreme example is the genome of Mycobacterium leprae, an obligate parasite and the causative agent of leprosy. It has been reported to have 1,133 pseudogenes which give rise to approximately 50% of its transcriptome. [45] The effect of pseudogenes and genome reduction can be further seen when compared to Mycobacterium marinum, a pathogen from the same family. Mycobacteirum marinum has a larger genome compared to Mycobacterium laprae because it can survive outside the host, therefore, the genome must contain the genes needed to do so. [46]

Although genome reduction focuses on what genes are not needed by getting rid of pseudogenes, selective pressures from the host can sway what is kept. In the case of a symbiont from the Verrucomicrobia phylum, there are seven additional copies of the gene coding the mandelalide pathway. [47] The host, species from Lissoclinum, use mandelalides as part of its defense mechanism. [47]

The relationship between epistasis and the domino theory of gene loss was observed in Buchnera aphidicola. The domino theory suggests that if one gene of a cellular process becomes inactivated, then selection in other genes involved relaxes, leading to gene loss. [48] When comparing Buchnera aphidicola and Escherichia coli, it was found that positive epistasis furthers gene loss while negative epistasis hinders it.

Watch the video: Chapter Cerebral Cortex BIO201 (January 2022).