5.1: Binding - The First Step Toward Protein Function - Biology

5.1:  Binding - The First Step Toward Protein Function - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Reversible Binding of a Ligand to a Macromolecule

Reversible, noncovalent binding of two or molecules is the first step in the expression of the biological properties of almost all biomacromolecule. If one of the molecules is small, it's often called a ligand. Metal ions (simple like Ca2+ or molecular like CH3CO2-) are also considered ligands when bound to proteins or nucleic acids.

You might be more familiar with the term ligand when it's applied to the coordination of a transition metal complex by electron pair donors (Lewis acids) on single or multidentate molecules, which for transition metal complexes are called ligands. Here is an interactive molecule model of a cobalt ion binding to EDTA, a multidentate ligand.

The cobalt ion (dark grey ball) is octahedrally coordinated to the multidentate ligand EDTA.

Whether a macromolecule M and a ligand L bind to each other depends on their relative concentrations and how tightly they bind. Compare this to an acid. Its pKa and the pH of the medium determine if it deprotonates.

Biochemists rarely talk about equilibrium constants to describe the strength of a binding interaction, but rather their reciprocals - the dissociation constants, (K_D). For the reactions (M + L ↔ ML), where M is free macromolecule, L is free ligand, and ML is macromolecule-ligand complex (which is held together by intermolecular forces, not covalent forces), the KD is given by


The cartoon below shows free and bound M and L.

Notice the unit of KD is molarity, M.

  • The lower the KD (i.e. the higher the [ML] at any given M and L), the tighter the binding.
  • The higher the KD, the looser the binding. KDs for biological molecules are finely tuned to their environments.

KD values vary from about 1 mM (weak interactions) for some enzyme-substrate complex, to pM - fM levels. Examples of very tight, non-covalent interactions include the avidin (an egg protein)-biotin (a vitamin) and thrombin (enzyme initiating clotting)-hirudin (a leech salivary protein) complexes. The values are "tuned" so that the relative concentration of free and bound M and L are appropriate for a biological setting.

To understand binding, it is important not only to know the noncovalent, intermolecular forces (IMFs) that lead to binding, but also to ask the simple question, are the macromolecule and ligand bound and to what extent. To know if M or L is bound, we must use simple simple mathematics that you would have learned in Introductory or Analytical Chemistry courses. We'll start with the mathematical description which is harder for students to understand than the IMFs.

We will start with three basic equations:

For the Dissociation constant:

[K_D = ([M]eq[L]eq)/[ML]eq = ([M][L])/[ML]]

(note that KD has units of molarity);

For Mass Balance of M: [M_0 = M + ML] where M0 is the total amount of macromolecule. (note: brackets and the eq subscript will be left off if the resulting equation is nonambiguous)

For Mass Balance of L: [L_0 = L + ML] where L0 is the total amount of ligand

We would like to derive equations which give ML as a function of known or measurable values. The KD equation (5.1) shows

that ML depends on free M and free L. From the equations above we can two derive two fundamental and equally valid equations which are useful under different experimental condition

Case 1:

This applies when you can readily measure free L OR when experimental conditions are such the Lo >> Mo (so L= Lo), which is often encountered in a lab setting. You don't have to measure free L since for this case, it is approximately the total ligand was added to the system.

Substitute 5.1.3 into 5.1.1 gives

[K_D= ([M][L])/[ML] = [Mo-ML][L])/[ML]]

[(ML)K_D = (M_o)L - (ML)L]

[(ML)K_D + (ML)L = (M_o)L]

[(ML)(K_D+L) = (M_0)L]


[(ML) = dfrac{(M_0)L}{K_D + L}]

This equation is ALWAYS TRUE for the chemical equation written above. L is the free ligand concentration at equilibrium.

An interactive plot of the concentration of the concentration of the ML complex (ML) vs free L (L) is shown below. Vary the sliders and note the changes in the graph.

If L0 >> M0, then the equations simplifies to:

[ML = dfrac{(M_0)(L_0)}{K_D + L}]

Dividing this equation by Mo gives the fractional saturation Y of the macromolecule M.

[Y = [ML]/M_0 = dfrac{L}{K_D + L}]

where Y can vary from 0 (when L = 0) to 1 (when L >> KD)

Note that the interactive graph above and graphs of ML vs L (equation 5.1.10) and Y vs L (equation 5.1.11) are all HYPERBOLAs

To get a "gut" level understanding of the graphs of ((ML) = (M_0)(L)/(K_D + L)) and (Y = L/(K_D+L)), let's consider 3 different values or sets of values of free ligand:

  1. L = 0: This obviously gives ML = 0
  2. L = KD: ((ML) = (M_0)(L)/(L + L)= (M_0)(L)/(2L) = Mo/2) which indicates that M is half saturated. In fact the operational definition of KD is the ligand concentration at which the M is half saturated.
  3. L >> KD: ML = M0 and the macromolecule is saturated with ligand.

Case 2 (more general):

This applies when you know KD, but don't know free L or haven't measured it, and you just wish to calculate how much ML is present at equilibrium, given a KD value. In this case, L0 does not have to be much greater than M0. If where, like it is often in an experimental system, you would know that free L = L0 and you could use Case 1.

In this case, we will substitute mass balance equations for both M0 (Eq 5.1.2) and L0 (Eq 5.1.3)and into the equation for KD (Eq. 5.1.1). This gives:

[K_D = ([M][L])/[ML] = [M_0-ML][L_0-ML]/[ML]]

[(ML)K_D = (M_0-ML)(L_0- ML)]

[(ML)K_D = (M_0)(L_0) - (ML)(L_0) - (ML)(M_0) + (ML)^2] or

[(ML)^2 - (L_0 + M_0 + K_D)(ML) + (M_0)(L_0) = 0]

This can be rearranged into the form (ax^2 + bx + c = 0) where

  • a = 1
  • b = - (L0 + M0 +KD)
  • c = (M0)(L0)

with the well known solution (x = [(-b) - (b^2 - 4(a)(c))^{1/2}]/2a). Therefore,

[(ML) = [(L_0+M_0+K_D) - ((L_0+M_0+K_D)^2 - 4(M_0)(L_0))^{1/2}]/2]

An interactive plot of the Y, fractional saturation, vs total L (L0) is shown below. Vary the sliders and note the changes in the graph.

In the derivations, we came up with two equations for ML, Eq 5.1.10 which gives ML vs L and Eq 5.1.16 which gives ML vs L0.

Both equations are valid. In the first you must known free L which is often L0 if M0 << L0. In the second, you don't need to know free M or L at all. At a given Lo, Mo, and KD, you can calculate ML, which should be the same ML you get from the first equation if you know free L.

Equations 5.1.10 and 5.1.16 are useful in several circumstances. They can be used to

  • calculate the concentration of ML if KD, M0, and L (for Eq. 5.1.10) or if KD, M0, and L0 (for Eq. 5.1.16) are known. This is analogous to the use of the Henderson-Hasselbach equation to calculate the protonation state (HA) and hence charge state of an acid at various pH values. In the former bind case we are measuring the concentration of a reversibly bound ligand (ML) and in the latter case, the concentration of valently bound protons (HA).
  • calculate KD if ML, M0, and L (for Eq. 5.1.10) or if ML, M0, and L0 (for Eq. Techniques to extract the KD from binding data will be discussed in A separate chapter section.

Interpretation of Binding Analyzes

It is important to get a mathematical understanding of the binding equations and graphs. It is equally important to get an intuitive understanding of their properties. Just as we used the +/- 2 pH rule in determining at a glance the charge state of an acid, you need to be able to determine the extent of binding (how much of M is bound with L) given their relative concentrations and the KD. The usual situation is that [M0] is << [L0]. What happens to the binding curves for M + L <===> ML if the KD gets progressively lower? Intuitively, you should expect that binding will increase, especially as L gets greater. The curves below should help you develop the intuition you need with respect to binding equili


The figures below show Y vs L0 at Varying KDs

The next figure shows Y vs L0 at a very low KD (0.001 uM = 1 pmM, resulting in a sharp "titration" curve. Any increment of L added is bound so effectively none is present free. the line abrupt changes to a horizontal line when all the macromolecule is bound. This curve could be used to determine [M0]!

Note that in the last graph, given the same M0 and L0 concentrations, the "titration curves" for a binding equilibrium characterized by even tighter binding (for example, a KD = 0.1 pM or 0.01 pM) would be indistinguishable from the graph when KD = 1 pM. It should be apparent that for all of these KD values, all of the added ligand is bound until [L0] > [M0]. To differentiate these cases, much lower ligand concentrations would be required such that on addition of ligand, all is not bound. Also note that this curve is NOT hyperbolic, which makes sense since the graph is of Y vs L0, not Y vs L, and since L0 is not >> M0.

The interactive graph below shows fractional saturation Y vs L at two different KD values

It is quite interesting to compare graphs of Y (fractional saturation) vs L (free) and Y vs Lo (total L) in the special case when L0 is not >> M0. Examples are shown below when M0 = 4 μM, Kd = 0.19 μM . Under the ligand concentration used, it should be apparent the L can't be approximated by L0

Two points should be evident from these graphs when L is not approximated by Lo:

  • a graph of Y vs L0 is not truly hyperbolic, but it does saturate
  • a KD value (ligand concentration at half-maximal binding) can not be estimated by inspection from the Y vs L0, but it can be from the Y vs L graph.

The figure below shows a comparison of the extent of covalent binding of a proton to an acid at pH values around the pKa and by analogy the extent of noncovalent binding of a ligand at log[L] values around the log KD.

Different Graphical Analyzes of Binding

In addition to the the hyperbolic plots of [ML] vs [L] and fractional saturation Y vs [L], a variety of derivative plots are often encountered. The equations and their graphs (for two different KD values, are shown below. The graphs are in the form of Y vs L0, when L0 is approximately equal to free L.

[ ext { hyperbolic saturation plot: } quad mathrm{Y}=frac{mathrm{L}}{mathrm{K}_{mathrm{D}}+mathrm{L}}]

[ ext { double reciprocal plot: } quad frac{1}{mathrm{Y}}=frac{mathrm{K}_{mathrm{D}}+mathrm{L}}{mathrm{L}}=frac{mathrm{K}_{mathrm{D}}}{mathrm{L}}+1=mathrm{K}_{mathrm{D}}left(frac{1}{mathrm{L}} ight)+1]

A plot of 1/Y vs 1/L has a slope of KD and a y intercept of 1 (which is the number of binding sites for this simple mechanism)

[egin{aligned} mathrm{Y}left(mathrm{K}_{mathrm{D}}+mathrm{L} ight) &=mathrm{L} Yleft(mathrm{K}_{mathrm{D}} ight)+Y L &=L Yleft(mathrm{K}_{mathrm{D}} ight)=L-mathrm{YL} &=mathrm{L}(1-mathrm{Y}) ext { Scatchard Plot: } & frac{Y}{mathrm{L}}=frac{1-mathrm{Y}}{mathrm{K}_{mathrm{D}}}=-frac{mathrm{Y}}{mathrm{K}_{mathrm{D}}}+frac{1}{mathrm{K}_{mathrm{D}}} end{aligned}]

A plot of Y/L vs Y has a slope of -1/KD and a y intercept of 1/KD.

Straight line transformations of the hyperbolic binding equations are useful to get approximate values of KD, but linear regression analysis to get slopes and intercepts is not statistically optimal as the errors in the y variable (Y) and in the y and x variables in the Scatchard plot are not identical across values. To determine KD, it is best to fit experimental data to the nonlinear function for the hyperbola.

Dimerization and Multiple Binding Sites

In the previous examples, we considered the case of a macromolecule M binding a ligand L at a single site, as described in the equation below:

M + L ↔ ML

where KD = [M][L]/[ML]

We saw that the binding curves (ML vs L or Y vs L are hyperbolic, with a KD = L at half maximal binding. But there are many other chemical equilibria than can mechanistically explain binding data. We'll consider just two cases here.


A special, yet common example of this equilibrium occurs when a macromolecule binds itself to form a dimer (D), as shown below:

M + M ↔ M2 or D

where D is the dimer, and where

[K_D = [M][M]/[D] = [M]^2/[D]]

At first glance you would expect a graph of [D] vs [M] to be hyperbolic, with the KD again equaling the [M] at half-maximal dimer concentration. This turns out to be true, but a simple derivation is in order. In the case of dimer formation, Mo, which superficially represents both M and L in the earlier derived expression, are both changing. So we have to invoke mass balance of M again: ([Mo] = [M] + 2[D]), where the coefficient 2 is necessary since their are 2 M in each dimer.

More generally, for the case of formation of trimers (Tri), tetramers (Tetra), and other oligomers, ([Mo] = [M] + 2[D] + 3[Tri] + 4[Tetra] + ....)

Rearranging (12) and solving for D gives (D = ([M_0] - [M])/2). Substituting this into the KD expression (1) gives

[K_D = M^2/)(M_0- M))/2 = 2M^2/(M_0 - M].

This can be rearranged into quadratic form for M (not D):

[2M^2 +K_D(M)-K_D(M_0)= 0]

which is of the form y = ax2+bx+c.

Solving the quadratic equation gives [M] and with M0 , D can be calulated from (D = ([M_0]-[M])/2).

A value Y, similar to fractional saturation, can be calculated, where Y is the fraction of total possible D, which can vary from 0-1: (Y= 2D/M_0)

A graph of Y vs Mo with a dimerization dissociation constant KD = 25 uM, is shown below.

Note that the curve appears somewhat hyperbolic. Half-maximal dimer formation does occur at a total M concentration M0 = KD. Also note, however, that even at M0 = 1000 uM, which is 40x KD, only 90% of the total possible D is formed (Y = 0.90). For the simple M + L ↔ ML equilibrium, if L0 = 40x the KD and M0 << L0, (Y = L/(K_D+L) = L/[(L/40)+L] = 0.976)

An interactive graph showing Y (the fraction of dimers) vs M0 is show. Move the sliders to show how changes in M0 and "KD" affect the dimerization.


The aggregation state of a protein monomer is closely linked with its biological activity. For proteins that can form dimers, some are active in the monomeric state, while others are active as a dimer. High concentrations, such as found under conditions when protein are crystallized for x-ray structure analysis, can drive proteins into the dimeric state, which may lead to the false conclusion that the active protein is a dimer. Determination of the actual physiological concentration of [Mo] and KD gives investigators knowledge of the Y value which can be correlated with biological activity. For example, interleukin 8, a chemokine which binds certain immune cells, exists as a dimer in x-ray and NMR structural determinations, but as a monomer at physiological concentrations. Hence the monomer, not the dimer, binds its receptors on immune cells. Viral proteases (herpes viral protease, HIV protease) are active in dimeric form, in which the active site is formed at the dimer interface.

Binding of a ligand to two independent sites

What if a ligand L binds to two different sites on the same biomacromolecule? The interactive graph below shows such binding to two independent sites with different KDs. We'll assume the binding on one ligand does NOT influence the binding of the other.

The Binding Continuum

Binding affinities give us a way to measure the relative strength of binding between two substances. But how "tight" is tight binding? Weak binding? Let us exam that issue by considering a binding continuum. Consider two substances, A and B that might interact. Over what range of strengths can they actually bind to each other? It would helpful to set up the extremes of the binding continuum. At one end is no binding at all. At the other end, consider two things that bind covalently. We have discussed how Kd reflects binding strength. Remember, KD = 1/Keq. Also, we know that Keq is related to ΔGo, by the equations:

[ΔG^0 = - RTlnK_{eq} = RTln K_D). Given these simple equations, you should be able to interconvert between Keq, KD, and ΔG0. (Keep your units straight.).

No interaction: One end of the binding continuum represents no interaction. Let's assume that Keq is tiny (KD large), for example Keq~ 2.4 x10-72. Plugging this into the equation (ΔG^0 = - RTlnK_{eq}), where R = 2.00 cal/mol.K, and T is about 300K, the ΔG0 ~ +100 kcal/mol. That is, if we add A + B, there is no drive to form AB. If AB did form, then it would immediately fall apart.

Covalent interaction: At the other end of the continuum consider the interaction of 1H atom with another to form H2. From a general chemistry book we can get ΔG0form. Using simple thermodynamics, we can calculate ΔGo for H-H formation. (ΔGo = ΣΔG0form prod. - ΣΔG0form react.) Doing this gives a value of -97 kcal/mol.

Specific and Nonspecific Binding: Consider the interaction of a protein, the lambda repressor (R), with a small oligonucleotide to which it binds tightly (called the operator DNA, O). This is an example of a biologically tight, but reversible interaction. R can bind to many short oligonucleotides due to electrostatic interactions and H bonds between the positively charged protein and the negatively charge nucleic acid backbone. The tight binding interaction, however, involves oligonucleotides of specific base sequence. Hence we can distinguish between tight binding, which usually involves specific DNA sequence and weak binding which involves nonspecific sequences. Likewise, we will speak of specific and nonspecific binding. R and O, which bind with a KD of 1 pM, represent an example of specific binding, while R and nonspecific DNA (D), which bind mostly through electrostatic interactions with a KD of 1 mM, are an example of nonspecific binding. You might expect any positively charged protein, like mitochondrial cytochrome C, would bind negatively charged DNA. This nonspecific interaction would have presumably have no biological significance since the two are localized in different compartments of the cell. In contrast, the interaction between positively charged histone proteins, bound to DNA in the nucleus, would be specific.

Rate constants for association and dissociation: When the reaction
M + L ↔ ML is at equilibrium, the rate of the forward reaction is equal to the rate of the reverse reaction. From General Chemistry, the forward reaction is biomolecular and second order. Hence the vf, the rate in the forward direction is proportional to [M][L], or
(v_f = k_f[M][L]), where kf is the rate constant in the forward direction. The rate of the reverse reaction, vr is first order, proportional to [ML], and is given by (v_r = k_r [ML]), where kr is the rate constant for the reverse reaction. Notice that the units of kf are M-1s-1, while units of kr are s-1. At equilibrium, (v_f = v_r), or

[k_f[M][L] = k_r[ML]]. Rearranging the equation gives

[[ML]/[M][L]= k_f/k_r = K_{eq}].

Hence Keq is given by the ratio of rate constants. For tight binding interactions, Keq >> 1, KD << 1, and kf is very large (in the order of 108-9 ) and kr must be very small (10-2 - 10 -4 s-1).

To get a more intuitive understanding of KDs, it is often easier to think about the rate constants which contribute to binding and dissociation. Let us assume that kr is the rate constant which describes the dissociation reaction. It is often times called koff. Likewise kf is often called the on rate (kon). It can be shown mathematically that the rate at which two simple molecules associate depends on their radius and effective molecular weight. The maximal rate at which they will associate is the maximal rate at which diffusion will lead them together. Let us assume that the rate at which M and L associate is diffusion limited. The theoretical kon is about 108 M-1s-1. Knowing this, the KD and the fact that kon/ koff = Keq = 1/KD, we can calculate koff, which remember is a first order rate constant.

We can also determine koff experimentally. Imagine the following example. Adjust the concentrations of M and L such that Mo << Lo and Lo>> Kd. Under these conditions of ligand excess, M is entirely in the bound from, ML. Now at t = 0, dilute the solution so that Lo << Kd. The only process that will occur here is dissociation, since negligible association can occur given the new condition. If you can measure the biological activity of ML, then you could measure the rate of disappearance of ML with time, and get koff. Alternatively, if you could measure the biological activity of M, the rate at which activity returns will give you koff.

Now you will remember from Introductory Chemistry that for a first order rate constant, the half-life (t1/2) of the reaction can be calculated by the expression: k = 0.693/t1/2. Hence given koff, you can determine the t1/2 for the associated species existence. That is, how long will a complex of ML last before it dissociates? Given ΔGo or KD, and assuming a kon (108 M-1s-1), you should be able to calculate koff and t1/2. Or, you could be able to determine koff experimentally, and then calculate t1/2. Applying these principles, you can calculate the parameters below.

Calculated koff and t1/2 for binary complexes assuming diffusion-controlled kon.


KD (M)

koff (s-1)


1 x 10-71

1 x 10-63

2 x 1055 yr

RtV3 : Rt'L3(a)


1 x 10-9

2 yr



1 x 10-7

80 days


5 x10-14

5 x 10-6

2 days


1 x 10-13

1 x 10-5

0.8 days



1 x 10-3

700 s




7 s


2 x 10-9

2 x 10-1

3 s


4 x 10-9

4 x 10-1

2 s

LDH (pig): NADH(g)


7.1 x 101

10 ms

profilin: CaATP-G-actin

1.2 x 10-6

1.2 x 102

6 ms

TBP: DNAnonspec(h)

5 x 10-6

5 x 102

1 ms

TCR(i): cyto C peptide



100 us


1 x 10-4

1 X104

70 us

uridine-3P: RNase

1.4x10-4 (j)


50 us

Creatine Kinase: ADP

8.2x10-4 (j)


10 us


1.2 x 10-3

1.2 x 105

6 us

no interaction

4 x 1073

4 x 1081


  1. Trivalent Vancomycin derivative RtV3 + Trivalent D-Ala-D-Ala deriv, Rt'L3'
  2. Hirudin is a potent thrombin inhibitor from leach saliva
  3. lac rep is the E. Coli lac operon repressor protein, and DNAoper is the specific DNA binding region in the E. Coli genome that binds to the repressor
  4. Zif268 is a mouse zinc-finger binding protein
  5. GroEL is a chaperone protein; r-lactalbumin is the reduced form of lactalbumin
  6. TBP is the TATA Binding Protein which binds to the TATA box consensus sequence
  7. LDH is lactate dehydrogenase
  8. DNAnonspec is DNA which does not contain the specific DNA sequence region involved in specific
    binding to a DNA binding protein
  9. TCR is the T-cell receptor
  10. calculated from equation: KD = koff/kon.

What is usually measured is KD and/or koff (if the koff is reasonable). This analysis is very simplified. Electrostatic forces and other orientation factors may significantly change kon, while conformational changes in the complex may prevent ready unbinding of the bound ligand, dramatically altering koff.

The structure of one of the tightest binding complexes, avidin and biotin, is shown below.

It is important to note that even reactions characterized by high KD can be specific. Specificity is ultimately defined as a binding interaction between a macromolecule and ligand that can be co-localized in the same environment and for which a biological function is elaborated upon binding.

Transcription factor

In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. [1] [2] The function of TFs is to regulate—turn on and off—genes in order to make sure that they are expressed in the right cell at the right time and in the right amount throughout the life of the cell and the organism. Groups of TFs function in a coordinated fashion to direct cell division, cell growth, and cell death throughout life cell migration and organization (body plan) during embryonic development and intermittently in response to signals from outside the cell, such as a hormone. There are up to 1600 TFs in the human genome. [3] Transcription factors are members of the proteome as well as regulome.

  • gene expression – the process by which information from a gene is used in the synthesis of a functional gene product such as a protein
  • transcription – the process of making messenger RNA (mRNA) from a DNA template by RNA polymerase
  • transcription factor – a protein that binds to DNA and regulates gene expression by promoting or suppressing transcription
  • transcriptional regulationcontrolling the rate of gene transcription for example by helping or hindering RNA polymerase binding to DNA
  • upregulation, activation, or promotionincrease the rate of gene transcription
  • downregulation, repression, or suppressiondecrease the rate of gene transcription
  • coactivator – a protein (or a small molecule) that works with transcription factors to increase the rate of gene transcription
  • corepressor – a protein (or a small molecule) that works with transcription factors to decrease the rate of gene transcription
  • response element – a specific sequence of DNA that a transcription factor binds to

TFs work alone or with other proteins in a complex, by promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcription of genetic information from DNA to RNA) to specific genes. [4] [5] [6]

A defining feature of TFs is that they contain at least one DNA-binding domain (DBD), which attaches to a specific sequence of DNA adjacent to the genes that they regulate. [7] [8] TFs are grouped into classes based on their DBDs. [9] [10] Other proteins such as coactivators, chromatin remodelers, histone acetyltransferases, histone deacetylases, kinases, and methylases are also essential to gene regulation, but lack DNA-binding domains, and therefore are not TFs. [11]

TFs are of interest in medicine because TF mutations can cause specific diseases, and medications can be potentially targeted toward them.

T-Cells and Systemic Lupus Erythematosus

Robert Hoffman DO , Marcos E. Maldonado MD , in Systemic Lupus Erythematosus , 2007


Self-reactive T-cells may escape normal mechanisms of immunologic tolerance, expand, and be detectable in the peripheral blood of patients with SLE. 10–16 T-cells reactive with a number of lupus nuclear autoantigens [including DNA histones the small nuclear ribonucleic proteins Sm-B, Sm-D, U1-70kD, and U1-A and heterogeneous nuclear ribonucleoprotein (hnRNP) A2 protein] have been isolated from the peripheral blood of SLE patients and characterized. These are outlined in Table 10.2 . 10–12 , 17–32

Rajaogopalan et al. were the first to describe human T-cell lines reactive with double-stranded DNA isolated from patients with SLE. 17 The activated T-cells they identified selectively augmented the production of pathogenic IgG anti-DNA antibodies ex vivo, supporting the conclusion that they might have a role in pathogenesis. 17 Datta and colleagues subsequently characterized chromatin-reactive T-cells in SLE in detail and reported that these T-cells are typically CD4+, can provide help to anti-DNA and antihistone antibody-producing B-cells, and that they have restricted T-cell receptor CDR3 usage with characteristics of antigen selection by a limited number of cationically charged antigenic epitopes. They mapped the major T-cell epitopes present on the core nucleosomal histone protein complex to four regions: histone H2B amino acid residues 10 through 33, histone H3 residues 85 through 105, histone H4 residues 16 through 39, and histone H4 residues 71 through 94. They demonstrated that these autoantigenic peptides can be promiscuously presented by several HLA-DR alleles. Furthermore, they found that nucleosome-reactive human T-cells produce substantial quantities of INF gamma. They found in parallel studies done in a murine model system that such nephrogenic complement-fixing antinucleosome autoantibodies belong to INF gamma–dependent IgG subclasses. They subsequently proposed that expansion of these low-affinity chromatin autoantigen-reactive T-cells is essential for sustaining anti-DNA/histone autoantibody-producing B-cells. 17–19

In addition to DNA and nucleosomes, human T-cells reactive with various small nuclear ribonucleoprotein self-antigens (including Sm-B, Sm-D, U1-70kd, and U1-A) have been identified and characterized. The characteristic features of autoantigen-reactive T-cells that have been described are outlined in Table 10.3 . 10–12 , 20, 21, 23–32 These small nuclear ribonucleoproteins are ubiquitous self-antigens that are components of the spliceosome complex, which physiologically functions to excise introns and generate messenger RNA transcripts lacking intervening RNA. 33, 34 Sm-reactive T-cell lines and T-cell clones reactive with the Sm-D or Sm-B small nuclear ribonucleoproteins were first described by Hoffman and colleagues from patients with SLE. 10 U1 small nuclear ribonucleoprotein-reactive peripheral blood T-cells were first reported by Oɻrien and colleagues from patients classified as SLE. 20 T-cell clones from connective tissue disease patients reactive with the U1-70kD small nuclear ribonucleoproteins antigen were described by Hoffman and colleagues. 23, 26 Okubo and colleagues were the first to describe peripheral blood mononuclear-cell-derived CD4+ T-cells from SLE or MCTD patients that reacted to the U1-A small nuclear ribonucleoprotein. 21 Subsequently, such various small nuclear ribonucleoprotein-reactive T-cell clones have been extensively characterized (see Table 10.3 ). Typically, they are CD4-positve T-cells that produce large amounts of IFN gamma, moderate quantities of IL-2, and variable quantities of IL-4 and IL-10. 4, 23–30 They recognize antigen in the context of HLA-DR. T- and B-cell responses are linked in SLE, and both small nuclear ribonucleoprotein and hnRNP-reactive T-cells can provide B-cell help for autoantibody production. 31

T-cell epitope mapping studies of human T-cell clones reactive with the small nuclear ribonucleoproteins U1-70kD, Sm-B, and Sm-D have been done to determine the precise regions recognized on the autoantigen by T-cells. These studies have revealed that there are limited T-cell epitopes on these antigens. Interestingly, virtually all T-cell antigen recognition regions (or so-called T-cell epitopes) reside within functional regions of the protein—either within the Sm motifs for Sm-B and Sm-D or within the RNA binding domain for U1-70kD and hnRNP. 27, 29, 34

T-cell clones have recently been identified and characterized from patients with SLE that are reactive with another nuclear ribonucleoprotein antigen known as hnRNP A2. 31 Greidinger and colleagues cloned human T-cells reactive with hnRNP A2 from SLE patients and found that such hnRNP-reactive T-cells when cocultured in vitro with autologous B-cells could augment anti-hnRNP autoantibody production. 31 Haffman and Steiner also identified and characterized hnRNP-reactive T-cells and found that similar to the findings described previously for U1-70kD T-cell epitope mapping, hnRNP-reactive T-cells also recognize the RNA binding domain portion of the antigen. 32 Collectively, these studies reveal a recurring theme: ribonucleoprotein-reactive T-cells are directed against highly conserved regions that function to bind their associated RNA (the RNA binding domain of U1-RNP and hnRNP) or their associated proteins (Sm protein-protein binding domains).

Finally, a novel mechanism for autoantigen “cross reactivity” by T-cells in SLE has recently been reported. De Silva-Udawatta and colleagues reported that T-cell receptor usage by small nuclear ribonucleoprotein-reactive T-cells can have significant flexibility or “plasticity”. 30 For example, they found that a single T-cell receptor can recognize two distinct small nuclear ribonucleoprotein autoantigenic peptides that have no apparent sequence homology. 30 This cross reactivity is limited to the U1-70kD and a Sm-B peptide. However, a series of other closely related small nuclear ribonucleoprotein-derived peptides did not cross stimulate the T-cell receptor. These studies indicate that there are now a number of distinct mechanisms for immunologic cross reactivity that may result in loss of tolerance in SLE, including cross reactivity occurring at the level of the T-cell receptor.

Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.

The Outside of the DNA Helix Can Be Read by Proteins

As discussed in Chapter 4, the DNA in a chromosome consists of a very long double helix (Figure 7-6). Gene regulatory proteins must recognize specific nucleotide sequences embedded within this structure. It was originally thought that these proteins might require direct access to the hydrogen bonds between base pairs in the interior of the double helix to distinguish between one DNA sequence and another. It is now clear, however, that the outside of the double helix is studded with DNA sequence information that gene regulatory proteins can recognize without having to open the double helix. The edge of each base pair is exposed at the surface of the double helix, presenting a distinctive pattern of hydrogen bond donors, hydrogen bond acceptors, and hydrophobic patches for proteins to recognize in both the major and minor groove (Figure 7-7). But only in the major groove are the patterns markedly different for each of the four base-pair arrangements (Figure 7-8). For this reason, gene regulatory proteins generally bind to the major groove𠅊s we shall see.

Figure 7-6

Double-helical structure of DNA. The major and minor grooves on the outside of the double helix are indicated. The atoms are colored as follows: carbon, dark blue nitrogen, light blue hydrogen, white oxygen, red phosphorus, yellow.

Figure 7-7

How the different base pairs in DNA can be recognized from their edges without the need to open the double helix. The four possible configurations of base pairs are shown, with potential hydrogen bond donors indicated in blue, potential hydrogen bond (more. )

Figure 7-8

A DNA recognition code. The edge of each base pair, seen here looking directly at the major or minor groove, contains a distinctive pattern of hydrogen bond donors, hydrogen bond acceptors, and methyl groups. From the major groove, each of the four base-pair (more. )

Although the patterns of hydrogen bond donor and acceptor groups are the most important features recognized by gene regulatory proteins, they are not the only ones: the nucleotide sequence also determines the overall geometry of the double helix, creating distortions of the “idealized” helix that can also be recognized.

Driving forces in the origins of life

What were the physico-chemical forces that drove the origins of life? We discuss four major prebiotic ‘discoveries’: persistent sampling of chemical reaction space sequence-encodable foldable catalysts assembly of functional pathways and encapsulation and heritability. We describe how a ‘proteins-first’ world gives plausible mechanisms. We note the importance of hydrophobic and polar compositions of matter in these advances.

1. What forces drove the origins of biology?

How did life begin? What drove the transition, more than 3 billion years ago, from physical chemistry to biology (Pchem2Bio)? We seek the origins of biology’s forces of sustainability and persistent innovation. To be clear, this is not the same as seeking mechanisms of self-replication. Here is a metaphor. Consider an imaginary self-replicating mouse trap. This device is outfitted so that it can reach into a bin of metal and wood parts and assemble a copy of itself. But what happens when the bin runs out of parts? Self-replication, by itself, is not a sustaining force. Nor does it explain how it’s self-replication abilities arose from physico-chemical stochastic processes in the first place. Here, we are interested in the causative actions that could have driven physical chemistry (Pchem) to discover biology (Bio), with its unique abilities to propagate in ways that are resourceful, adaptive and persistent.

First, an overview of related research. The origins field has a long history, dating back, at least, to Darwin’s idea in 1871 of a ‘warm little pond’ [1,2] and then of a ‘primordial soup’ [3,4]. Many are studies of prebiotic chemistry, including prominent early ones by Urey [5] and Miller [6] in the early 1950s, and Orgel in 1968, [7], which have sought molecules and conditions that were plausible on the early earth and their possible reactions. Others have focused on what biological precursor molecules might have come from space, for example, in the Murchison and other meteorites [8]. There have been speculations on chicken-and-egg ‘what-came-first’ problems. Metabolism first [9]? Proteins and functionality? Nucleic acids and information? An RNA world first [10,11]? A world of encapsulated replicating RNAs [11]? A lipid world [12]? What interactions might have led to the genetic code [13–15]? For general reviews, see [16–19]. And since there are no definitive experiments yet, much work is speculation using theory and modelling, such as of primitive replication, in Eigen’s quasi-species models [20,21], the GARD model [22] and others [23–25]. The present work is aimed in a different direction: to seek plausible origins of biology’s drive towards persistence and long-term innovation. Here are our starting points.

2. Our premises about prebiotic chemistry

Life originated on Earth. We assume that origins happened on Earth. While amino acids and simple organics are found on meteorites from space, we are interested in more life-like complexity, which is unlikely to have come from panspermia (i.e. originating in space before coming to Earth [26,27]).

Life arose by natural laws, including chemical transformations of simpler molecules into more complex ones as well as physical processes such as diffusion, binding, catalysis, chemical reactions and changes in molecular concentrations and conformations.

Like today, it was far away from equilibrium. Life is a non-equilibrium (NEQ) state. It requires continual input of energy and matter. Earth’s energy input from the sun is huge [28]. At some point during life’s origin, some chemical reactions became linked with energy to drive them. Chemistry ‘learned’ to harness energy, through gradients of ions or protons, or daily cycles—of light and dark, or heating and drying, or changes in salts, temperature, or redox or pH states, for example.

It started with simple chemicals, maybe in a special environment, like a prebiotic soup, a shared space, maybe ‘Darwin’s warm little pond’ [2] or a hot hydrothermal vent in a sea floor. That medium contained prebiotically plausible simple molecules, such as methane, ammonia, water, some amino acids and nucleic acids, catalysed by surfaces, minerals and metals [3,4].

3. Distinguishing between life and non-life

is ‘wet’ (i.e. made of molecules)

has units of agency, such as cells

metabolizes, taking in matter and energy

grows and replicates independently and

has lineages and heritable variation.

3.1. The dynamics is different: persistence versus relaxation to equilibria

The biological dynamics we consider is evolutionary change. Both living and non-living matter have dynamical behaviours that entail stochastic searching of degrees of freedom (DOF), sampled by the actions of random forces and driven toward macrostates that can be predicted by a variational principle (the second law of thermodynamics in physical systems survival of the fittest in biology). But the details are very different see table 1. For one thing, biology’s evolutionary tendencies are not a drive toward equilibrium. For more than 3 billion years, life has been in a stable non-equilibrium. Survival of the fittest (SOF) is a principle of long-term sustained dynamics, not equilibrium. For another thing, different dynamical processes dominate biological evolution versus chemistry. In Pchem, atoms and molecules search positions, velocities and conformations, sampled by random thermal forces. In biology, cells search different growth rates, sampled by random changes in monomer sequences in proteins and nucleic acids. And, the nature of disorder is different their corresponding entropies do not even have the same units. How did Pchem come upon, and enable, biology’s processes and forces?

Table 1. Dynamical processes in biological evolution are different than in physical chemistry.

4. Survival of the fittest is a persistence principle

4.1. Evolution is sustained by positive feedback

What is the nature of SOF as a dynamical variational principle? Much of textbook physical chemistry describes systems subject to negative feedback: they are stable, subject to restoring forces, having states of equilibria to which they return after perturbation. By contrast, the centerpiece of biology’s evolutionary dynamics—SOF—is a principle of persistence, i.e. a sustained capacity for a particular type of positive feedback 1 or what, in simpler chemical systems, would be called autocatalysis [33,34]. One example of autocatalysis is A + B → 2B. Another is a forest fire, where burning is cooperative among fuel elements that are at high density. Here, we refer to this positive feedback as bootstrapping, taken from the expression: lifting yourself up by your bootstraps. 2 Biological evolution is sustained by SOF. What physical chemistry begat that principle?

4.2. The SOF principle, described in general terms

Suppose you have some operational device that has persistent input and output for example, a cell, a machine or a company. You can tweak the inner workings of the device to alter its productivity. Fitness is a measure of how effectively (by some metric) the input resources are converted to output. A company can tweak its process to make more product from less resource. In this context, survival measures the amount of input resource the device takes in. If a company makes product more efficiently, then the company gains a bigger market. This gives it access to even more resources, allowing it to outcompete other such companies for resources. In SOF, there is a feedback loop: advantageous actions are rewarded by new capacity to take more actions. The better the performance, the greater the access to even more resources, creating a virtuous cycle of improvement and dominance over the resource pool.

4.3. Biology implements SOF in a specific, clever and convoluted way

The pawn that the hand of evolution moves is not the cell, but cell lineages. The metric of survival is the population of a cell lineage relative to others. The ‘knob’ that evolution turns to change that population is the growth rates of cells. Evolution ‘turns that knob’ by random mutations of proteins (and also recombination, lateral gene transfer, plasmids and gene duplication today). A cell’s growth rate is largely determined by its rate of protein production. Hence, here is how the SOF positive feedback loop is implemented in biology: a change such as a mutation increases a cell’s growth rate, causing the cell to duplicate faster, increasing the population of that cell’s lineage of ancestors relative to other lineages. This gives that cell’s lineage greater access to resources in the next generation. This positive feedback principle leads to some of biology’s most marvellous features, described below.

4.4. SOF acts by advantages, not by averages

Positive feedback processes can be controlled by small fluctuations. Compare to a river. A river’s flow properties are dominated by the largest and deepest channels, not the small tributaries, because the typical observables are averages, which are dominated by the biggest flows. By contrast, a key feature of positive feedback is that it can become dominated by the very smallest metaphorical tributaries, provided that those flows are somehow advantageous to the process [35,36]. It allows for ratcheting of advantage. It raises up winners: the few and the good can bootstrap up to dominate over the many and the average. If a single individual cell happens to be well fit for its environment, it grows rapidly. Its lineage can come to dominate the population. This positive feedback manifests as adaptability, innovation, improved match for environments and apparent goal directedness. We note that once an improvability process such as SOF is discovered, there are no limits to the marvellous intricacies it can lead to. 3

In the Pchem2Bio transition, how did stochastic physical dynamics ‘discover’ stochastic biological dynamics? How did polymer chain sequences emerge as the searchable degrees of freedom? What random processes searched and sampled them? And what autocatalytic chemical or physical process could have bootstrapped its way to becoming cellular SOF? Below are four important ‘discoveries’ that Pchem made to reach biology, three of which are positive-feedback bootstrap processes.

4.5. Pchem2Bio in steps

Consider Pchem2Bio as a kinetic process. We are free to divide the average pathway into two sequential steps, real or conceptual, since we can arbitrarily choose the barrier heights, one of which could be zero. The point of division into two steps is to help elucidate the mechanism. The second kinetic barrier, the final step to biology, as defined above, must have had all ingredients present: proteins for function, RNA or DNA for information, and encapsulation and metabolites. But two-state kinetics gives no mechanistic insight it happens as a single event. Keeping in mind the primacy of understanding driving forces, we postulate below a prior step: proteins develop primitive functions before RNA and proteins together create a genetic code. We argue that protein folding offers a driving principle.

In this view, the first step is amino acids becoming linked into short random peptides by Pchem processes, catalysed by surfaces or metals, for example. Proteins grew longer and catalytic through an autocatalytic foldamer-catalysis process (the foldcat bootstrap), generating a diversity of actions. Proteins and metabolites assembled into primitive biochemical pathways, through the catpath bootstrap. This results in a stable community of molecules, a nearbiotic soup. This soup, however, does not satisfy our definition of a system that is live. Rather, this is just a non-equilibrium chemical intermediate state along the way.

In the second step, the nearbiotic soup could then divide into compartmentalized units of individuals (i.e. proto-cells) that could compete for resources. Those units have heritability, encoded in informational memory molecules, defining lineages on which SOF can act.

5. Major discoveries in Pchem2Bio

Below we list key discoveries made by physico-chemical processes on the road to biology. (1) Coupling drivers to chemistry. Non-equilibria (NEQ) sampled and drove chemical reactions and molecular processes. (2) Proteins as mobile programmable catalysts. Monomer sequences in proteins became searchable degrees of freedom, giving programming catalysts and molecular machines. (3) Assembling biochemical pathways. Functionally similar reactions associated into spatially localized pathways. (4) Creating individuals and lineages. Encapsulation into cells allowed for a distinction of SELF and competition. A genetic code, memory and heritability allowed for survival of the fittest. Our proposition here is that they needn’t have happened all at once. A first step of (1)–(3) would require only proteins. Even today, the existence of horizontal gene transfer implies that linear heritability is not an obligatory early step.

5.1. Dynamical processes can sample and drive molecular processes

Was there some special aspect of dynamics in general that created or enabled life [38]? We consider two roles of dynamics, per se, in origins: (i) as a mixer and random driver of chemical reactions, and (ii) through specific mechanisms that can drive particular relevant innovations.

5.1.1. Forces of disorder can explore chemical reaction space

In general, NEQ per se, is not a driver towards order. The sun, winds, waves and volcanoes drive randomness, mixing and disorder. Even so, disordering can give predictable outcomes. For example, thermal forces that randomize the velocities of gas atoms lead to the ideal gas law, a precise relationship. But the randomization that matters on the road to biology is over a very different space than that of gas velocities it is over the space of chemical reactions. Early earth dynamics could drive different molecules together randomly, sometimes reacting with each other, sometimes catalysed by surfaces, and continually producing product wherever there are continual inputs of appropriate energy and matter [39].

And although organic-molecule reaction space is very large [40], the space of today’s biochemical reactions is relatively small and simple [41] (figure 1), hence ancestral versions of them must have been similar [42,43]. There is no reason to believe there was a specific goal-driven force to select out those reactions that would become biochemistry. But geophysical mixing dynamics could at least have searched and sampled some simple reactions, which, through particular dynamical mechanisms described below, could have led to biology.

Figure 1. Overview of the biochemistry of living systems. These processes at the core of life are relatively simple, few and coupled. Figure adapted with permission from [41].

5.1.2. Far from equilibrium drivers toward persistence and innovation not just restoring forces

Prigogine and colleagues popularized the view that biology-like spatio-temporal patterning—in chemical oscillators like the Belousov–Zhabotinsky reaction, for example—can arise from NEQ processes [44,45]. Non-equilibrium forces are special they differ in at least two ways from equilibrium forces.

First, non-equilibrium forces are zero at equilibrium. For example, while bar magnets have a static pull, electromagnets have no pull when the electric field is turned off. In Fick’s law, particles stop flowing when there is no concentration gradient. Also, hurricanes operate only when the underlying thermal conditions drive them. Non-equilibrium structures and organization are sustained by non-equilibrium inputs of matter and energy. Second, NEQ differs by push versus pull, i.e. by supply versus demand. Near-equilibrium processes are pulled toward equilibria, a tendency towards a state of minimum free energy. They are governed by the second law of thermodynamics. By contrast, FFE is pushed by input energy and matter that are out of equilibrium. Imagine a flood that carves a new river bed it does not aim to go any particular place, it just pushes water, which flows through a path of least resistance. Evolution does not steadily march towards predetermined goals [46], like second law equilibrium restoring processes do. 4 The NEQ realm is broad and innovative, through particular mechanisms, many of which are not yet fully understood, and two of which are described below.

5.2. The foldcat bootstrap: protein foldamers as programmable catalysts

5.2.1. The importance of proteins as programmable catalysts

Biology would be impossible without its machines and catalysts, protein enzymes. On the one hand, Orgel and others argued that there is severe difficulty in achieving biochemistry-like reactions with only prebiotically available catalysts [18,47]. On the other hand, important recent experiments have achieved significant reactions using prebiotically available catalysts [48–52]. Even so, chemistry in the prebiotic era was hostile to chemical innovation. The catalysts for those reactions were mineral surfaces or metal ions, many of which were spatially immobile (not accessible to substrates), capable only of catalysing limited reactions, each only under limited and different conditions, and only where substrates were sufficiently concentrated.

Biology is more innovative than prebiotic chemistry. Biology’s catalysts—mostly proteins—are mobile and can go where the substrates are can be altered to work in different environments, including just in water, or in membranes can operate at whatever ambient temperature is needed for the organism and are readily tunable to any degree that is needed to fit within whole reaction pathways and cycles. Protein catalysts could be called programmable, in the sense that their extraordinarily wide range of capabilities can be controlled by just a simple single kind of process, namely mutating amino acid sequences.

This importance of this breakthrough—of discovering programmable catalysts—can be illuminated with a metaphor. Compare a fictitious prebiotic organic chemist ‘demon’ (i.e. working with random processes) to a corresponding biology demon. The Ochem demon cannot create a complex multi-step process without many different specific catalysts, each chosen for different conditions, some with intermediate products produced in particular ways. This is sufficiently challenging that academic organic chemists can publish research papers about them! By contrast, the Bio demon just spins some dials on a big dashboard, picking a reaction type, picking the solvent and temperature conditions, picking the desired acceleration and linking multiple reactions together by stringing together pathways of multiple enzymes. Of course, much trial and error is needed for both demons. The early discovery, by physical chemical processes, of catalysts that are explorable and optimizable through random changes of sequences of monomers in a polymer chain is arguably one of the most important steps made during the origins of life because of its capacity for rapid trial-and-error invention of complex chemical processes and diverse functionalities, all brought together under single conditions. Our term ‘programmability’ here does not refer to heritability or a genetic encoding rather, it is simply intended to express that changing an amino acid sequence can change a molecule’s functional capability.

Here, we describe a mechanism for the origins of proteins as programmable catalysts, controllable through their amino acid sequences. We call it the foldcat bootstrap mechanism. It is an autocatalytic process by which short peptides become elongated, sequence selective and develop primitive versions of the today’s protein enzymes and machines. It addresses the following question: what physical process might drive particular subpopulations of chain sequences to self-amplify at the expense of other subpopulations? In this mechanism, random peptides fold and help catalyse the elongation of others in a primitive ribosome-like way. In this way, short-chain peptides grow longer and more plentiful, growing protein mass.

There are many plausible prebiotic processes that can polymerize individual amino acids into peptides, or nucleic acids into short DNA or RNA molecules. But these polymerizations all suffer from the so-called Flory problem, namely that the resultant chains are mostly very short (≈2–8-mers) longer chains are exponentially less probable (figure 2a).

Figure 2. (a) The Flory problem. Typical polymerizations create mostly short chains. Longer chains are exponentially less populated. (b) The HP model of foldamers. H(ydrophobic) are red, P(olar) are blue. Different HP sequences fold to different native structures. HP sequences fold in ways that lead to maximal burial of the H monomers into a core, minimizing contact with water. Figures reproduced from [53].

Known prebiotic polymerizations also do not address (i) how the randomness in polymerized sequences leads to ordered and informational sequences, and (ii) how such processes became autocatalytic, leading to stable steady states of production of long-chain informational-sequence polymers.

The foldamer catalyst hypothesis [53] offers an explanation. In this hypothesis, chains are polymerized using two types of monomers: hydrophobic (H) and polar (P), as modern-day proteins are. 5 When H and P monomers are linked into long chains, like today’s proteins, different HP sequences spontaneously fold in water to different ‘native’ structures [55] (figure 2b). The structures are driven by the oil–water principle that hydrophobic monomers seek to minimize contact with water.

According to this hypothesis, some short-chain HP sequences will compactify in aqueous solutions into structures that have some exposure of their hydrophobic residues on their surface. Call those hydrophobic surfaces landing pads, and those chains catalysts. If a second short peptide chain lands its own H monomers on the sticky hydrophobic surface of the first one, a catalyst, then the second chain will undergo an enhanced rate of covalent elongation because of the sticky localization of the chain and an H monomer to be added (figure 3a).

Figure 3. (a) The blob chain elongates the string chain. It folds, and has a landing pad, putting the string chain next to new monomers, thus elongating the second chain. (b) This foldcat mechanism (orange) bootstraps to longer-chain populations, overcoming the Flory length problem (green). Figures reproduced from [53].

The HP foldcat mechanism gives the three properties sought above. First, exact enumeration in the HP lattice model shows that this mechanism leads to amplified populations of longer chains (figure 3b). It also leads to reduced subspace of HP sequences, initiating a process of converting random sequences to informational polymers. And, it generates an autocatalytic set that continues propagating other sequences in that set see figure 4. The following paragraphs give arguments for the plausibility of this mechanism.

Figure 4. Only a small subset of all HP sequences form an autocatalytic set, propagating themselves, at the expense of other sequences. In this way, random short chains become informational longer chains that populate their volume in steady state. Figures reproduced from [53].

5.2.2. Evidence for folding in HP polymers

Today’s protein folding code is dominated by the binary HP patterning in the sequence [53,55]. This is proven in experiments where proteins that have been massively mutated, in ways that preserve only a given HP pattern, still fold to their appropriate native structures [56–59]. 6 Moreover, HP foldability does not even require that the polymer backbone be a peptide. Peptoid chains (polymers of N-substituted glycines) can also fold into HP-sequence-dominated structures [60]. Further evidence for the early role of hydrophobicity is that ancestral proteomes are more hydrophobic [61,62].

5.2.3. Catalysis and binding are ubiquitous in peptides and proteins

Functional peptides are ubiquitous in today’s biology (the Handbook of Biologically Active Peptides [63] is more than 2000 pages long!). Short proteins function as hormones, signalling molecules, growth factors, venoms, antibiotics and more Enzymatic activities are known in chains even as short as dipeptides [64–66], and including ATP binding activity [67]. 7-mer amyloid peptides can catalyse reactions and auto-catalyse their own formation [68,69]. So, amyloid structures might have been prebiotic catalysts [70]. Moreover, proteins are highly promiscuous binders. For example, half the yeast proteome has protein–protein binding affinities stronger than 1 kcal mol −1 [71]. And regarding whether simple peptides could help elongate others, we note that non-ribosomal peptide extension and chemical modification is done on peptide scaffolds [72]. Furthermore, once a protein has a binding site, that site often readily mutates to become an active site [73].

5.2.4. Perspectives on the foldcat mechanism

Here, we note some caveats and suggest some experimental tests. First, we are not aware of any evidence yet for simple peptides folding and catalysing chain elongation in other peptides. But we are also not aware of any tests of it. The value, as we see it, in the present theoretical speculation, is in giving a mechanism that is sufficiently detailed that it can be tested through experiments.

Second, what are the limitations of the model? While figures 2–4 illustrate the foldcat mechanism with graphic simplifications—to two dimensions, to a code that is only binary (H and P), and to conformations that are confined to a lattice—extensive studies with larger code alphabets and in 3D [56,57] have shown that this simple model recapitulates important behaviours of real proteins. The 2D HP model has its equivalent of secondary and tertiary structures the thermodynamic behaviours of short chains in 2D resembles longer chains in 3D because of the dominance of surface-to-volume ratios and hydrophobic interactions and as noted above, the sequence-to-structure code degeneracy in real proteins is known from experiments to be close to binary [55]. For understanding the nature of both conformational and sequence spaces, microscopic atomic details often matter much less than an ability to do coarse-grained enumerations, which is readily done in simple models. At present, it is not possible to draw unbiased inferences about the nature of sequence space with more atomistically detailed models than HP lattices. And, while the mechanism illustrated here adds only H monomers, driven only by hydrophobicity, this is just an illustration because any broader distribution of amino acids that would have been used in primitive proteins would have likely harnessed additional interactions as well.

Third, while the example above of the Foldcat mechanism illustrates ‘inventing’ primitive ribosomes, it also follows that there would be broad random coverage of sequence-structure space, so other (weak) protein machines would be generated too. We infer that proteins and functional diversity could have been a first step in Pchem2Bio, followed by encapsulation, heritability and memory.

5.3. The catpath mechanism assembles functional pathways

Imagine the prebiotic stew above, of small molecules and catalysts. How could that stew have been divided up and encapsulated into individual cells? Physico-chemical actions would only aggregate them together randomly into vesicles or droplets. That would not lead to biology. Each cell needs assemblies of reactions that form functional pathways, cycles and hypercycles (i.e. interlinked cycles)? What would cause different enzymes with related functions to come together in space, like bucket brigades, in which the output of one reaction is close enough to become the input of another reaction? Here, we describe such a process.

The catpath mechanism is a non-equilibrium reaction-diffusion mechanism that brings reactions together in space based on their related functionalities [74]. In this process, a catalyst A, fixed at a given location, draws a catalyst B in its spatial neighbourhood the effective attraction between the catalysts (cats) is mediated by a common substrate or product, on which they both act.

Figure 5 (top) shows the catpath mechanism. The square-box objects in the figure are catalysts, such as enzymes. The letter inside each catalyst box is an identifier of the reaction it catalyses. The catalysts are mobile and free to diffuse, towards or away from other such catalysts. The circular objects are the substrates and products, typically small molecules. Inside the circles are numbers that identify or label them. The arrow in each icon shows the direction of catalysis, from substrate to product.

Figure 5. The catpath attraction: one catalyst, A, attracts another, B, mediated by a common substrate. A converts 1s to 2s. B converts 2s to 3s. In steady-state, S produces concentrated 2s, which bind to B, attracting B to A.

In the catpath mechanism, a mobile catalyst molecule B, which converts 2s to 3s, diffuses toward the position of a catalyst molecule A, which converts 1s to 2s see figure 5 (bottom). This attraction is a reaction–diffusion process [75]. Because the A cats are continuously supplied with 1s, so they continuously produce 2s. These product 2s will diffuse away from the parent A at some rate, but will concentrate around A for certain relative speeds. The B cats have a binding affinity for their substrates, 2s in this case. So Bs will diffuse toward the 2s, thus toward the A cats. In this way, A and B cats are attracted to each other, mediated by a small molecule substrate/product in common.

5.3.1. In the catpath mechanism, function dictates structure

The catpath process contrasts with two standard situations: (1) two independent particles will simply diffuse away from each other, or (2) two particles with mutual affinity will come together and bind each other. The catpath attraction is not based on a binding affinity, A–B rather, it is an example of function driving structure 7 : processes that have a common mediator come together. Unlike simple A–B binding affinity, catpath is a non-equilibrium force there is no attraction unless 1s are continuously supplied. It is driven only by the commonality of the small-molecule agent that is the product of one cat and the substrate of the other. We note two additional points. First, the catpath mechanism is not unique to protein catalysts, and would also apply, for example, to RNA catalysts. Second, the catpath mechanism bears some resemblance to, and might have been a molecular precursor to, chemotaxis in bacteria [76] (see figure 6), when the due distinctions are taken into account [77,78].

Figure 6. Simulations of the catpath mechanism shows that enzymes A and B can attract each other if they have a substrate or product in common (from C. Kocher, L. Agozzino & K.A. Dill 2021, unpublished data and [74]). This non-equilibrium force can drive assembly of functional pathways.

5.3.2. The catpath mechanism could assemble transducers and machines

Critically important in biology is energy transduction coupled to chemical reactions. Often one domain of a protein performs an energetically uphill reaction, driven by an energetically downhill reaction in another domain, typically by converting ATP to ADP or by flows of protons or ions down their concentration gradients. Without such coupling, it would be impossible to metabolize food to synthesize biomolecules, to run molecular motors, chaperones, ribosomes or other machines, to perform signalling, or to synthesize biomolecules such as proteins and nucleic acids. Today’s processes are well understood through the physical chemistry of binding events coupled to conformational changes in proteins see figure 7. These processes, such as in ATPases and GTPases, entail multiple protein domains that are bound together into a complex: one domain performs the uphill action and the other domain converts to ATP to ADP to ‘pay the energetic price’ for the uphill step. A crucial ‘discovery’ during origins of life must have been the combining of two protein domains in such transduction processes [79]. The innovation this allowed, on the road from chemistry to biology, was the ability to power energy cycles and biochemical circuits. Protein domains may have been driven to assemble by the catpath mechanism, but there are no studies yet as far as we know.

Figure 7. An essential process in origins of biology: couplings that drive uphill chemistry by downhill energy changes. (a) Downhill conversion from ATP to ADP can drive uphill processes like moving molecules uphill against their concentration gradients.

5.3.3. The catpath mechanism can drive SOF-like bootstrapping

Where does the SOF principle come from? Might its prebiotic precursor have been some simple autocatalytic chemical cycle, such as shown in figure 8? Here is what we are seeking to explain. If a chemical process is changed in a way that causes it to run faster (in biological language, a mutation increases the fitness), how does that lead the process to recruit more resources for itself (more survival)? For the autocatalytic cycle in figure 8, the catpath Mechanism can link survival to fitness. Catalyst A converts substrate 1s and substrate 2s to product 3s. Catalyst B converts substrate 3s and substrate 4s to product 1s. The two catalysts are linked as a cycle: the head of each reaction is the foot of the other. The substrates and products, 1s and 3s, are common to the two reactions. Mutating catalyst A to a better one, A′ increases the cycle speed. Because of the catpath force, the greater cycle speed drives greater attraction to B of A′ relative to A. The machine A′B is more stable and persistent than the machine AB, hence is the more reliable consumer of new resources.

Figure 8. Simple chemical autocatalysis, i.e. positive feedback, based on 2 reactions: 1 + 2 → 3 and 3 + 4 → 1. Swapping in a better A (call it A’) speeds up the cycle, making 3s even faster, further accelerating the cycle, etc.

5.4. The heritability bootstrap: replicating the ‘self’

Achieving SOF requires informational linkage between how fast a cell replicates, on the one hand, and the size of the population of its lineage, on the other hand. This requires, first, that living systems come in discrete units, i.e. individual agents such as a cell (call it ‘the self’). This compartmentalization is enforced by lipid bilayers and related boundaries. The cell must contain information about how it achieves its growth speed. And it also requires a mechanism for transmitting information down generations, from parents to daughters. Below, we just make brief points about the physical chemistry of encapsulation and heritability.

5.4.1. Encapsulation distinguishes individuals and lineages, enabling competition

In the origins of life, compartmentalization could have arisen from oil droplets or vesicles in a lipid world [12,80,81]. They readily grow and divide. Droplets or vesicles or containers can grow in proportion to the amount of material inside them, providing the first step in a growth-based SOF mechanism. Natural surface-to-volume forces will cause such compartments to split into two when they get big enough, giving a physico-chemical basis for the divide and replicate aspects of SOF. The interiors of such primitive cells would be concentrated proteins, as in today’s cells. Their growth could come from the foldcat mechanism, for example. It would be interesting to see more detailed modelling.

5.4.2. Genomes implement memory for precise heritability

SOF requires accurate information transmission: of cell growth rates to lineage populations. This is achieved today by covalent memory in RNA and DNA genomes. A plausible explanation for the physico-chemical origin of the genetic code is the stereochemical hypothesis [13,82–86]. In this view, the genetic code arose from weak stereochemical binding affinities between nucleic acids and peptides, ultimately leading to codons and anticodons in today’s more complex machinery. Here are the lines of evidence supporting that mechanism. mRNA coding sequences undergo co-aligned binding to protein sequences [87]. In pyrimidine solvents, amino acids bind to pyrimidine and purine bases in proportion to their hydrophobicities [88] see figure 9. Nucleic acid base-stacking is driven by hydrophobic interactions and hydrogen bonding [90] nucleic acids at high concentrations assemble into non-covalent base stacks even without a backbone [91] free histidine binds an RNA aptamer when selected for affinity [92] and adenine binds to peptide backbones [93]. Evidence of physical affinities also appears in the identity recognition elements by which AA-tRNA synthetases recognize cognate tRNAs [94].

Figure 9. Support for the stereochemical hypothesis. From known structures of protein–RNA complexes, the purine content of the RNA codons (x-axis) correlates with the base binding preferences of the amino acid sequences they code for. Shown here is the guanine preference (reproduced from [89]).

6. First a soup of protein machines, then encapsulation and lineages

We have postulated two stages in Pchem2Bio: forming a nearbiotic soup requiring only peptide foldamers and metabolites, followed by cellular encapsulation and informational molecules. Here, we give additional context.

6.1. Unlike an RNA world, a ‘proteins-first’ world has a plausible sustainability mechanism

The final step to biology, whatever it may have been, would have required all ingredients: proteins, informers, metabolites and encapsulation. But dividing Pchem2Bio into two steps, either real or conceptual, allows us to model possible mechanisms in more granular detail. The principal argument here for proteins first is simply that we can identify a possible mechanism. Here, we have argued for the importance and primacy of establishing the driving forces. By contrast, if instead, nucleic acids as a vehicle for memory and information, were to have come first, what would have driven it? We know of no principle or force that would have caused it to happen. Moreover, if memory were first, what machinery would construct it? It is not clear how or why it would construct itself in a self-sustaining way.

6.2. Why not an RNA world first?

The idea that origins could have started with RNA came after the discovery by Cech [95] and Altman [96] of ribozymes, namely that RNA can catalyse reactions, making RNA a type of molecule that bridges the folding and function world with the information/genes world [10,97–99]. The RNA-first view has driven many important experiments in prebiotic and nanotech research [100–103].

But the RNA-first idea has some notable difficulties [104–106]. First, RNA just names a type of molecule, and not a driving principle that would sustain it. RNA is useless without a copying machine. Second, proteins are better catalysts [17]. Even where a protein and an RNA molecule can catalyse the same reaction, such as an RNase, which breaks down RNA molecules, the protein version is 100 000-fold better than the hammerhead ribozyme [17]. And RNA-based catalysts are limited, mostly phosphoryl transferases, such as RNA polymerases, ligases and RNA nucleases. The catalytic power of proteins, with 20 amino acids of very different chemical moieties, is much broader than of RNA molecules, with only the four bases, with recognition driven largely by hydrogen bonding.

Third, the most common reaction products from many prebiotic syntheses of small molecules are amino acids, possibly because with only around 15 atoms each, they are easier to synthesize than nucleic acid bases, having around 35 atoms each. And, the yields of the different amino acids in those experiments resembles the compositions in today’s proteins [21]. Fourth, Carter & Wills show that aa-tRNA synthetases came before ribozymes, not the other way around [107]. Fifth, and more importantly, the implication of the Guseva mechanism [53] is that the foldability of polymer chains is the crucial ingredient that enables the autocatalytic explosion of functionality in Pchem2Bio. Foldability is mainly a property of proteins, not RNA molecules.

6.3. Proteins are better for function DNA is better for information

There is a plausible explanation for biology’s current division of labour in which proteins are functional and DNA is informational. For functionality, you need sequence-structure relations: changing the sequence, changes the structure, changes the function. The physics that enables this is folding. Proteins fold better—and for essentially all sequences—than RNA does. For information, and for memory-like actions, you specifically want the opposite. You want a type of molecule that can store all information the same, with no preferences, with the absolute minimum possible sequence structure relationships. DNA is an almost perfect informational molecule: it is very stiff, has no fold and its double-strandedness protects either strand from binding to external agents (apart of course, from transcription and such.)

6.4. A full story of Pchem2Bio would entail informers and proteins emerging together

After a nearbiotic soup, the emergence of a genetic code requires both proteins and informational molecules to develop together [108]. Here is evidence for their concurrent development. For one thing, nucleic acids and amino acids can both arise in common from the same prebiotic processes [52]. For another thing, RNA and peptides have binding complementarity, like hands in gloves [86]. So, if a peptide-first world already drives preferences for some peptides over others, it’s easy to imagine them coupling with companion informational molecules. Interestingly, frameshifting at the mRNA/DNA level leads to protein sequences with largely unchanged hydrophobicity profiles [109], indicating how even coarse-grained hydrophobic composition alone, in the absence of specific sequences, could have carried information. In addition, Carter has shown that ‘urzymes’, which are shrunken cores of amino-acid-tRNA synthetase (aaRS) proteins, and which may have been evolutionary precursors, are unstructured small proteins having hydrophobic cores that can work with low-fidelity peptides [110,111]. Carter & Wills have argued that aminoacylated-tRNA molecules must have evolved in parallel with the proteins that they are responsible for helping to make, not preceding them [107].

6.5. First, a single happy pond later, bickering individual lineages

Modelling has suggested that the origins of life started from a single community in a cauldron, something like a localized pond, before becoming individual competing cellular lineages, perhaps through an autocatalytic phase-transition-like event [112]. Community in a cauldron as a first step has the advantage that it can be communally supportive since there are no predators yet. The pond doesn’t need to compete, just to survive. Crick speculated [13] that the community-first mechanism explains today’s single genetic code, i.e. that ‘all life evolved from a single organism (more strictly, from a single closely interbreeding population)’. Although there are now counter-examples and non-universality in codes, for example in mitochondria and some nuclear genomes, the differences are small [113].

7. Conclusion

By what stochastic physical chemistry did dead matter ‘invent’ live matter? We cannot look to equilibrium principles because life has remained far from equilibrium (FFE) for 3 billion years. Unlike equilibria, which are pulled by goal-like end states, FFE dynamics are driven by the pushing flows of available matter and energy. Fitness is a tendency towards matching to environments, a driver for effective utilization of resources.

What mechanisms might have led to the autocatalysis and SOF? We describe three bootstraps. In the foldcat bootstrap, proteins became controllable catalysts, programmable through their sequences. In the catpath bootstrap, different enzymes come together in space to form pathways. In the encapsulation/heritability bootstrap, biochemistry becomes encapsulated and compartmentalized into cells, and outfitted with genetic memory to link past to future. Proteins and biochemistry, through the first two bootstraps, could have been stably self-sustaining, prior to encapsulation and heritability. Of course, this is presently just a speculation. But, there is no evident alternative mechanism by which nucleic acids could achieve persistent sustainability prior to proteins. A thread through these mechanisms is the antipathy between hydrophobic and polar interactions, in protein chains, in folding, in encapsulation, and in protein-nucleic acid interactions.

Related Biology Terms

  • Monocistronic mRNA – mRNA transcript that codes for a single protein.
  • Transposons – Small segments of DNA that can move around the genome, inserting themselves into loci far removed from their original site, often involving an RNA intermediate.
  • hnRNA – Heterogenous nuclear RNA are considered the original products of transcription and consist mostly of mRNA precursors.
  • Poly-A polymerase – Enzyme that adds a stretch of adenine nucleotides to the end of a primary transcript.

1. Which of these properties makes DNA a more stable genetic material?
A. The hydrogen bonds between the bases are stronger
B. DNA is longer than RNA
C. Presence of thymine bases
D. Resistance to degradation through alkaline hydrolysis

2. What is the size of a nuclear pore in eukaryotes?
A. Less than 10 nm
B. More than 10 nm
C. Over 2000 nm
D. 25-30 nm

3. Which of these is NOT a feature of prokaryotic gene expression?
A. Coupled transcription and translation
B. Extensive post-transcriptional modification of the RNA transcript
C. Sigma factor for transcription initiation
D. None of the above

5 Major Stages of Protein Synthesis (explained with diagram) | Biology

Some of the major stages of Protein Synthesis are: (a) Activation of amino acids, (b) Transfer of amino acid to tRNA, (c) Initiation of polypeptide chain, (d) Chain Termination, (e) Protein translocation

There are five major stages in protein synthesis each requiring a number of components in E. coli and other prokaryotes.

Protein synthesis in eukaryotic cells follows the same pattern with some differences.

(a) Activation of amino acids:

This reaction is brought about by the binding of an amino acid with ATP. The step requires enzymes called amino acyI RNA synthetases. Due to this reaction amino acid (AA) and adenosine triphosphate (ATP), mediated by above enzyme, amino acyl – AMP – enzyme complex is formed (Fig. 6.40).

AA + ATP Enzyme -AA – AMP – enzyme complex + PP

It should be noticed that amino acyl RNA synthetases are specific with various amino acids.

(b) Transfer of amino acid to tRNA:

The AA – AMP – enzyme complex formed reacts with specific tRNA. Thus amino acid is transferred to tRNA. As a result the enzyme and AMP are liberated.

AA – AMP – enzyme Complex + tRNA- AA – tRNA + AMP enzyme

(c) Initiation of polypeptide chain:

Charged tRNA shifts to ribosome (Fig. 6.41). The ribosome consists of structural RNAs and 80 different proteins. Ribosome is the site where the protein synthesis occurs. The mRNA binds to SOS sub-unit of ribosome of 70S type.

It has already been discussed that ribosomes are made up of an rRNA (ribosomal RNA) and proteins. Ribosome also acts as a catalyst (23sRNA in bacteria is the enzyme— ribozyme) for the formation of peptide bond. Ribosomes consist of two subimits, a larger and a smaller one.

The information for the sequence of amino acids is present in the sequence of nitrogenous bases of mRNA. Each amino acid is coded for three letters word of nucleic acid. The initiation of polypeptide chain in prokaryotes is always brought about by the amino acid methionine which is regularly coded by the codon AUG but rarely also by GUG (for valine) as also initiating codon. In prokaryotes, formulation of initiating amino acid methionine is essential requirement.

Ribosomes have two sites for binding amino-acyl- tRNA.

(i) Amino-acyl or A site (acceptor site).

(ii) Peptidyl site or P site (donor site). Each site is composite of specific portions of SOS and 30S sub-units. The initiating formyl methionine tRNA i.e. (AA, f Met tRNA) can bind only with P site (Fig. 6.41).

However, it is an exception. All other newly coming amino-acyl- tRNAs (AA2, AA3 — tRNA) bind to A site. Thus, P site is the site from which empty tRNA leaves and to which growing peptidyl tRNA becomes bound.

In the first step, the next amino acyl-tRNA is bound to complex of elongation factor Tu containing a molecule of bound GTP the resulting amino-acyl-tRNA-Tu-GTP complex is now bound to the 70S initiation complex. GTP is hydrolysed and Tu-GDP complex is released form the 70S ribosome (Fig. 6.42). The new amino acyl tRNA is now bound to the amino acyl or A site on the ribosome.

In the second step of elongation, the new peptide bond is formed between the amino acids whose tRNAs are located on the A and P sites on the ribosomes. This step occurs by the transfer of initiating formyl methionine acyl group from its tRNA to the amino group of new amino acid that has just entered the A site.

The peptide formation is catalysed by the peptidyl transferase, a ribosomal protein in 50 S sub-unit. A dipeptidyl tRNA is formed on the A site and now empty tRNA remains bound to the P site.

In the third step of elongation, the ribosome moves along the mRNA towards its 3′ end by a distance of codon (i.e., 1st to 2nd codon and 2nd to 3rd on the mRNA). Since the dipeptidyl tRNA is still attached to second codon (Fig. 6.43), the movement of ribosomes shifts the dipeptidyl tRNA from A site to the P-site. This shifting causes the release of the tRNA which is empty.

Now the third codon of mRNA is on the A-site and the second codon on P-site. This shift of ribosomes along mRNA is called translocation step. This step requires elongation factor G (also called translocase). And also simultaneously the hydrolysis of another molecule of GTP takes place. The hydrolysis of GTP provides energy for the translocation.

The ribosome with its attached dipetidyl tRNA and mRNA is ready for another elongation cycle to attach the third amino acid (Fig. 6.44). It takes place in the same way as the addition of second.

As a result of this repetitive action for chain elongation, the polypeptide chain elongates. As the ribosome moves from codon to codon along the mRNA towards its 3′ end, the polypeptide chain of the last amino acid is to be inserted.

(d) Chain Termination:

The termination of polypeptide is signalled by one of the three terminal triplets (codons) in the mRNA. The three terminal codons are UAG (Amber), UAA (Ochre) and UGA (Opal). They are also called stop signals.

At the time of termination, the terminal codon immediately follows the last amino acid codon. After this, the polypeptide chain, tRNA, mRNA are released. The subunits of ribosomes get dissociated.

Termination also requires the activities of three termination or releasing factors named as R1, R and S.

(e) Protein translocation:

Two classes of poly­ribosomes have been identified (Fig. 6.45).

(ii) Membrane bound polyribosomes.

For free ribosomes, termination of protein synthesis leads to the release of completed protein into cytoplasm. Some of these specific proteins are translocated to mitochondria and nucleus by special type of mechanisms.

On the other hand in membrane bound polyribosomes, polypeptide chain which grows on mRNA is inserted into the lumen of ER membrane. Some of these proteins become integral part of the membrane.

Outline of the families of DNA-binding proteins

A complete outline of the families of DNA-binding proteins and their functional, structural and binding properties follows. Box 1 shows the selection process by which the dataset was compiled. Table 1 provides a summary of the families and Table 2 lists the 240 structures of protein-DNA complexes in the database. Figures 1-8 show ribbon diagrams of the relevant structures.

Group I: Helix-turn-helix (HTH) group

1. Cro and Repressor family

Function. The Cro and Repressor proteins (Figure 1a) are part of the lysogenic/lytic growth switch mechanism in bacteriophages and function as transcriptional regulators at a set of six related operons.

Structure. Both protein types function as homodimers. Each Repressor subunit has two domains: an amino-terminal five-helix bundle whose second and third a helices comprise a HTH motif and a carboxy-terminal domain that mediates dimerization (Figure 1a). Cro is a single-domain protein with a structure homologous to the amino-terminal region of Repressor. The fourth and fifth a helices mediate dimerization [66].

Binding. Cro and Repressor bind six related operons with varying affinities. Each operon is 14 bp long and pseudosymmetrical four bases at either end are conserved between sites and the variation in the sequence of the central 6 bp are thought to modulate the binding affinity of the protein. The recognition helix of the HTH motif contacts base edges in the DNA major groove.

2. Homeodomain family

Function. These are transcription regulators for a wide range of genes in particular many have a vital role in development and cell differentiation (for example, Mat a-2 1apl). Some are expressed broadly whereas others are tissue specific.

Structure. The proteins are small (just over 100 amino-acid residues in length) and consist of four helices.

Binding. The protein binds DNA either as a monomer or a dimer, depending on the protein and many are capable of both. Typical HTH binding is displayed in Figure 1b, with the second helix of the motif inserted in the DNA major groove.

3. LacI repressor family

Function. Lac repressor regulates the lac operon, which codes for proteins required to transport and degrade lactose. The purine repressor proteins of the LacI repressor family regulate de novo purine and pyrimidine synthesis by repression of genes encoding enzymes that participate in the synthesis pathway. Guanine and hypoxanthine act as co-repressors on binding to the protein. Other members of the LacI repressor family, not represented in the current dataset, display high structural and sequence similarity and control a wide range of biosynthetic pathways [21].

Structure. Purine repressors function as homodimers, as do most other family members (Figure 1c). The lactose, fructose and raffinose repressors are exceptions, and appear to exist as tetramers [67]. Each subunit is a two-domain structure. The amino-terminal domain (approximately 60 residues) contains a three-helix bundle followed by a loop and an additional helix. The first two a helices form the HTH motif and the fourth is called the hinge helix. The larger carboxy-terminal domain (about 280 residues) is a mixture of a helices and ß strands and binds the co-repressor.

Binding. Binding sites are typically 16-18 bp long and pseudo-palindromic. The recognition helix of the HTH motif binds in the major groove and phosphate backbone contacts are mediated by the remainder of the helical bundle. The hinge helix from each subunit is inserted in the same DNA minor groove at the center of the binding site and jointly introduce a kink by intercalation of leucine sidechains [68,69].

4. Endonuclease FokI family

Function. Endonuclease FokI is a bipartite restriction enzyme which recognizes a specific DNA sequence and non-specifically cleaves at a position a short distance away.

Structure. The protein acts as a monomer with two functional regions (Figure 1d). The amino-terminal DNA-recognition region (about 390 residues) may be divided into three further subregions. D1, a roughly 160 residue subregion made of an amino-terminal arm, ten a helices and a two-stranded ß sheet. Helices 5, 6 and 8 form a pseudo-HTH motif. Helices 5 and 6 lie on the same helical axis, jointly forming the first a helix, and helix 8 acts as the recognition helix. A subregion (D2), of about 110 residues, contains six a helices and a three-stranded ß sheet with the a helices packing in a triangular formation and the second and fifth a helices arranged in a HTH-like manner. The turn is replaced by an extensive loop region - D3 - an approximately 80-residue segment containing five a helices and a three-stranded ß sheet. The carboxy-terminal catalytic domain (about 180 residues) is made of a five-stranded ß sheet flanked by seven a helices. The active site is situated on the first three ß strands in the region [70].

Binding. Binding is to a site containing the sequence 5'-GGATG-3' and staggered cleavage occurs 9 and 13 bp away from the target sequence. All base contacts to the recognition sequence are made by subregions D1 and D2. The amino-terminal arm and second a helix from D1 bind in the major groove and a loop preceding this recognition helix is found in the minor groove. The recognition helix from the HTH motif in D2 contacts the major groove. The catalytic region is positioned adjacent to the DNA-recognition region.

5. ?d-resolvase family

Function. The ?d-resolvase is a site-specific recombinase which converts negatively supercoiled circular DNA containing two directly repeated copies of the recombination site into two interlinked rings.

Structure. The protein functions as a homodimer (Figure 1e). Each subunit is made of two domains. The amino-terminal domain (about 120 residues) contains the catalytic center and the dimerization interface. It consists of a five-stranded ß sheet flanked by three a helices on one side and a single a helix on the other. The longest a helix packs with its counterpart in the other subunit to stabilize the dimer. The carboxy-terminal domain (approximately 40 residues) is a three-helix bundle with the second and third a helices forming a HTH motif. An extended arm region (about 20 residues), comprising the carboxy-terminal half of the dimerization helix and a loop, connect the two domains.

Binding. Each 114 bp recombination region consists of three resolvase-binding sites, I, II and III. Each site binds a resolvase dimer and is made of an inverted repeat of a 12 bp recognition sequence with varying base sequence and spacing between the half-sites. The structure found in 1gdt (Figure 1e) is thought to represent the conformation found prior to the recombination process. Two main DNA-binding regions are found in each subunit. The recognition helix of the carboxy-terminal HTH motif binds in the major groove at the outer ends of the binding site. The extended helix in the arm region is inserted in the minor groove near the center of the binding site in a similar manner to the recognition helices from leucine-zipper structures. The DNA is bent 60° away from the main body of the protein. The DNA is slightly kinked at the center of the site owing to partial intercalation of threonine residues from the arm region [71].

6. Hin recombinase family

Function. The Hin recombinase protein catalyzes site-specific recombination in the Salmonella chromosome.

Structure. The structure 1hcr is of the domain involved in DNA sequence recognition (Figure 1f). It is a three-helix bundle flanked by short peptide chains at either end (about 50 residues). The second two a helices form the HTH motif [18].

Binding. The full protein cooperatively binds as a homodimer at a 26 bp site. The recognition helix in the HTH motif is inserted in the major groove and surrounding helices make contacts with the phosphate backbone. The amino- and carboxy-terminal tails bind in adjacent minor grooves although their importance in sequence recognition is unknown.

7. RAP1 family

Function. The RAP1 protein performs two functions. The first is the periodic binding of DNA to regulate telomere length. Telomeres are nucleoprotein complexes found at the ends of eukaryotic chromosomes where the DNA consists of a repeated array of short, species-specific sequence motifs. The second function is that of transcription regulation RAP1 functions as an activator or repressor for a large number of genes.

Structure. RAP1 is a monomeric protein with two homologous domains and a carboxy-terminal tail (Figure 1g). Domain 1 (about 80 residues) contains a three-helix bundle and an amino-terminal tail, whereas domain 2 (about 80 residues) contains an additional fourth a helix. In each, the second and third a helices form the HTH motif. The two domains are connected by a 30 residue linker region and are positioned 8 bp apart. The carboxy-terminal tail is a 20 residue segment which emerges from domain 2 and folds back towards domain 1 [72].

Binding. The binding site is 16 bp long and shows a tandem repeat at an 8 base interval. The two domains bind in a similar fashion at opposite ends of the binding site the recognition helices of the HTH motif are inserted in the major groove and the remaining a helices contact the neighboring DNA backbone. The amino-terminal and the linker regions interact with the minor groove and the carboxy-terminal tail interacts with the major groove as it folds back. The flexibility of the linker allows for slight variations in spacing between tandem repeats.

8. Prd paired domain family

Function. The Prd paired domain is a functional domain found in a set of transcription regulatory proteins which are important in cell development.

Structure. The protein acts as a monomer with two structural domains (Figure 1h). The amino-terminal domain (about 70 residues) contains a short antiparallel ß sheet and a ß turn followed by a three-helix bundle and extended carboxy-terminal tail. The second and third a helices in the bundle form an HTH motif. The carboxy-terminal domain (approximately 50 residues) also contains a three-helix bundle which has an HTH motif [73].

Binding. Prd proteins bind to 13-20 bp sites which share a common core sequence. The recognition helix in the HTH and the ß turn of the amino-terminal domain make base contacts in the major and minor grooves respectively. The rest of the domain interacts with the DNA backbone. The carboxy-terminal domain does not contact the DNA, but domain structure and biochemical evidence suggest it does bind DNA in certain family members (for example Pax proteins).

9. Tc3 transposase family

Function. The structure contained in 1tc3 (Figure 1i) is of the DNA-recognition domain found in the amino terminus of Tc3 transposase. The function of the enzymes is to move specific segments of DNA from one position of the genome to another.

Structure. The domain (about 50 residues) contains a three-helix bundle and an amino-terminal tail (Figure 1i). The last two a helices form the HTH motif [74].

Binding. Binding is to a 20 bp site. The recognition helix of the HTH motif is bound in the major groove and other a helices make DNA backbone contacts. The amino-terminal tail binds in an adjacent minor groove although the interactions are not thought to be specific.

10. Trp repressor family

Function. The Trp repressor is involved in the regulation of tryptophan synthesis by binding three different operator sites. L-tryptophan acts as co-repressor.

Structure. Each subunit (about 100 residues) forms a six-helix bundle (Figure 1j). Helices 4 and 5 correspond to the HTH motif whereas the remaining four a helices provide the dimerization interface. Tryptophan also binds the helical bundle [23].

Binding. Binding is to three related 16 bp operator sites which the protein binds in the presence of tryptophan. The HTH motifs are reoriented on binding of the co-repressor to enable DNA-binding. The recognition helix is positioned in the major groove and most base contacts are made through a network of intermediate water molecules. Operator sites are symmetrical and also show approximate symmetry within the half-site, which leads to two alternative modes of binding. In the first, the dimer subunits bind each half-site symmetrically about the central base-pairs. This is similar to what is observed for the other prokaryotic HTH proteins. In the second, two dimers co-operatively bind to a single operator site in tandem. Dimers are staggered by 8 bp and rotated through 270° about the DNA axis and the crystal structure 1trr (Figure 1j) displays a superhelix of dimers binding successive binding sites.

11. Diphtheria tox repressor family

Function. The virulent phenotype of the pathogenic bacterium Corynebacterium diphtheriae is conferred by diphtheria toxin, whose expression is an adaptive response to low concentrations of iron. The expression of the toxin gene (tox) is regulated by the repressor diphtheria Tox, which is activated by transition metal ions.

Structure. Diphtheria tox is a 225-residue protein that binds as a dimer to DNA. Each monomer consists of six helices and a short two-stranded ß sheet, with helices 2 and 3 constituting the HTH motif (Figure 1k).

Binding. The DNA interacts with two dimers bound to opposite sides of the tox operator, with each dimer interacting with two major groove regions. Together, the two HTH motifs (one in each dimer) bind a 24 bp sequence.

12. Transcription factor TFIIB

Function. The transcription factor TFIIB is an essential part of the multiprotein transcription initiator complex that assembles on RNA polymerase II promoters. TFIIB binds a 7 bp region upstream of the TATA box called the B recognition box.

Structure. TFIIB is composed almost entirely of a helices and is approximately 200 residues long (Figure 1l).

Binding. TFIIB binds DNA in two places as a result of the nucleic acid distortion caused by the interaction of the TATA box-binding protein. The main interactions are due to a carboxy-terminal HTH motif binding DNA in the major groove at the upstream site. The protein also binds DNA in the minor groove at a downstream site using the amino terminus of a helix to contact the DNA backbone.

'Winged' HTH proteins

13. Interferon regulatory factor family

Function. The family of interferon regulatory factor (IRF) transcription factors is important in the regulation of inter-ferons in response to infection by virus and in the regulation of interferon-inducible genes.

Structure. The IRF family is characterized by a unique 'tryptophan cluster' DNA-binding region of five tryptophan residues. The protein binds as a monomer with a HTH motif binding DNA through three of the five conserved tryptophans. The IRF DNA-binding region has an a/ß architecture consisting of a cluster of three a helices flanked on one side by a mixed four-stranded ß sheet (Figure 1m).

Binding. Helices 2 and 3 comprise the HTH motif, with helix 3 lying in the DNA major groove. Contacts to bases within the major groove are localized to a GAAA core sequence within a 13 bp DNA element in the interferon promoter.

14. Catabolite gene activator (CAP) family

Function. CAP is a cAMP-dependent transcription regulator. A rise in cAMP concentration leads to increased affinity of CAP for catabolite-sensitive operons.

Structure. The protein functions as a homodimer, and each subunit comprises a two-domain structure (Figure 1n). The carboxy-terminal domain (about 60 residues) mainly consists of a three-helix bundle with the second two a helices forming the HTH motif. The domain contains a small ß sheet that also contributes to DNA binding. The larger amino-terminal domain (approximately 130 residues) has an extensive ß sheet that mediates cAMP binding, and a long a helix that forms the dimer interface [75].

Binding. The consensus binding sequence is a symmetric 22 bp site. Binding by the recognition helix of the HTH motif in the major groove induces a sharp, highly localized bend in the DNA and additional contacts with the phosphate backbone are made by the ß strands from the same domain.

15. Transcription factor family

Heat-shock and E2F/DP transcription factors

Function. The protein 3hts (Figure 1o) recognizes the promoters of the heat-shock protein genes through upstream DNA sequences (heat-shock elements, HSEs). An HSE consists of alternating, inverted repeats of the sequence nGAAn, where n can be any nucleotide. The E2F and DP protein families form heterodimeric transcription factors that have a central role in the expression of cell-cycle-regulated genes and recognize a c/gGCGCg/c sequence.

Structure. The DNA-binding domains of these proteins have a 'winged' HTH fold - that is, a three-helix bundle capped by an antiparallel ß sheet. Helices 2 and 3 constitute the HTH motif.

Binding. The third helix of the HTH is docked into the major groove. The DNA-binding domain makes additional contacts to the DNA through the amino terminus of the first helix and the turn of the HTH motif. The only other HTH fold that contacts the DNA with the residues of the turn is the Ets family.

16. Ets domain family

Function. The Ets family of transcription factors, of which there are now about 35 members, regulate gene expression during growth and development. They share a conserved domain of around 85 amino acids which binds as a monomer to the DNA sequence 5'-C/AGGAA/T-3'.

Structure. The 'winged' HTH motif interacts with a 10 bp region of duplex DNA that takes up a uniform curve of 8° (Figure 1p).

Binding. The domain contacts the DNA by a loop-helix-loop architecture, the turn of the HTH motif and the loop at the end of helix 1 before the ß sheet contacting the DNA backbone.

Group II: zinc-coordinating proteins

17. ßßa zinc-finger family

Function. The ßßa zinc-finger proteins constitute the largest individual family in this group. The DNA-binding motif is found in many transcription regulators and more than a thousand distinct motifs have been identified through sequence analysis [26].

Structure. The structure of the finger is characterized by a short two-stranded antiparallel ß sheet followed by an a helix (Figure 2a) and a single zinc ion bound by two pairs of conserved histidine and cysteine residues situated in the a helix and second ß strand. Proteins generally contain multiple copies of fingers in a single peptide chain which wrap round the DNA along the major groove in a spiral manner.

Binding. The recognition pattern of the probe a helix has been well characterized each finger binds adjacent 3 bp sub-sites on the DNA using amino acids at positions -1, 2, 3 and 6 relative to the start of the a helix, -1 being the residue position preceding the helix [2,24,76]. Although exceptions to this rule have been observed in specific examples [29,77], experiments have shown that by altering the amino-acid types at the key positions, different subsite sequences are recognized, suggesting that these residue positions are usually sufficient for specific binding [30,78]. By varying the number of fingers used in a protein chain, this relatively simple motif allows recognition of a wide range of binding sites with different degrees of specificity. For example, a protein with five fingers is expected to bind a site very selectively, whereas a protein with only a single finger would bind a wide range of sites containing the required 3 bp sequence. However, the structure of the human glioblastoma protein suggests that binding is not always straightforward of the five fingers in the structure, one does not contact the DNA at all and only two appear to make specific contacts with bases [31]. As described earlier, the protein subunits in this study have been split into distinct domains, each containing a single zinc-finger motif. The pairwise sequence identities of the aligned domains are all high, ranging from 73% (for example, human zinc-finger protein, 1udbA1, and Drosophila tramtrack protein, 2drpA1) to 100% (for example, mouse Zif268 protein, 1aayA1, and artificial protein, 1mey). All domains are structurally very similar, returning SSAP scores of over 90.

18. Hormone receptor family

Function. Members of the hormone receptor family translocate from the cytoplasm to the nucleus and regulate transcription at DNA sequences called hormone response elements on binding of steroid and other hormones [2,32].

Structure. Hormone receptors function as homo- or hetero-dimers and each monomer typically consists of a ligand-binding, a DNA-binding and a transcription regulatory domain (Figure 2b). The zinc-coordinating motif is found in the DNA-binding domain and is characterized by two antiparallel a helices capped by loops at their amino-terminal ends. Each helix-loop pair coordinates a single zinc ion using four conserved cysteines. The two a helices lie approximately at right angles to each other the first is inserted in the DNA major groove to provide interactions with bases whereas the loops and the second a helix contact the DNA backbone. The DNA-binding domain alone is sufficient for dimerization, the interface being formed by the loops leading into the second a helix.

Binding. All receptor subunits bind to one of two half-site sequences, 5'-AGAACA-3' or 5'-AGGTCA-3'. A hormone-response element contains two half-sites and the identity of the response element is determined by the sequences that are present, the relative orientation between them (either symmetric or palindromic) and the spacing between them (between 3 and 6 bp). Thus recognition of the target sequence by the whole hormone receptor depends on read-out of half-site sequences by each subunit and the structure of the homo-or heterodimeric protein [33]. The sequences of all subunits in the current dataset are very similar (sequence identities > 90%) except for the DNA-binding domain of the thyroid hormone receptor (for example, 1bsx), which has two extra helices in the carboxy-terminal tail. The structures are all very similar with pairwise SSAP scores of over 90.

19. Loop-sheet-helix family

Function. The loop-sheet-helix zinc-binding motif is represented solely by the DNA-binding region of p53, a transcriptional activator implicated in tumor suppression [2,34].

Structure. As the name indicates, the DNA-binding domain consists of a loop leading out of the main body of the protein, followed by a small ß sheet, an a helix and then another loop that leads back into the protein (Figure 2c). The zinc ion is coordinated by three cysteines and a histidine in the two loop regions.

Binding. Base contacts are supplied by the a helix in the DNA major groove and by the loops in the minor groove, although the latter are not thought to confer much specificity. The protein functions as a tetramer, with each subunit contacting a separate 5 bp recognition sequence positioned one after another. All intersubunit interactions are made by regions outside the DNA-binding motif.

20. Gal4-type family

Function. The final zinc-coordinating family contains only the Gal4 protein [24,35]. It is a transcriptional regulator of galactose-induced genes and its zinc-coordinating motif has so far only been identified in proteins from Saccharomyces cerevisiae.

Structure. The motif consists of a pair of a helices that coordinate two zinc ions through six cysteine residues, where two of the cysteines are shared by both metal atoms (Figure 2d).

Binding. The first a helix is presented in the DNA major groove for binding with bases, and backbone interactions are made by the second a helix. Gal4 functions as a homodimer and the dimerization interface is located outside the zinc-coordinating motif.

Group III: zipper-type proteins

21. Leucine zipper family

Function. The leucine zipper family consists of the yeast GCN4 proteins that bind promoter regions of genes encoding enzymes involved in amino-acid biosynthesis, and the Fos-Jun heterodimer, which activates the expression of many immune-response genes.

Structure. The structure of the zipper-type proteins may be split into two parts: the dimerization and DNA-binding regions. As shown in (Figure 3a), each subunit in the leucine zipper protein consists of a single a helix about 60 amino acids long. Dimerization is mediated through the formation of a coiled coil by a section of 30 amino acids at the carboxy-terminal end of each helix. The segment, known as the zipper region, consists of leucine or a similar hydrophobic amino acid every eight residue positions, roughly every two turns of the a helix. Corresponding side chains from each subunit mediate hydrophobic contacts at the interface through side-by-side packing. The DNA-binding region, also known as the basic region, is found in the amino terminus, and for the leucine zipper proteins, the binding segment is a direct extension of the dimerization region.

Binding. The a helices of the two subunits diverge from the coiled coil and enter the DNA major groove in opposing directions, each binding to half of the target sequence [2,36,79].

22. Helix-loop-helix family

Function. The helix-loop-helix proteins are transcription factors that control the expression of a wide range of genes involved in differentiation and development.

Structure. As the name suggests, helix-loop-helix proteins are a modification of the continuous a helices of the leucine zipper proteins in which the DNA-binding and dimerization regions are separated by a loop, resulting in a four-helix bundle (Figure 3b).

Binding. Like the leucine zippers, the dimerization helices interact with each other in a coiled-coil arrangement and the DNA-binding helices are inserted into the DNA major groove. By separating the two segments, more flexibility is allowed in positioning the probe helices on binding nucleic [2,37,38].

The helix-loop-helix family is represented by the mouse and human forms of Max, Srebp-1, mouse MyoD and human USF proteins. Sequence identities range from 66% (Max protein, 1an2A, and USF protein, 1an4A) to 97% (mouse Max protein, 1an2A, and human Max protein, 1hloA) and with the exception of the MyoD (1mdyA) and USF (1an4A) protein pair (pairwise SSAP score 70), SSAP scores are above 80. Structural differences between proteins mainly arise from the variation in lengths and positioning of the loops.

Group IV: Other a-helix proteins

23. Papillomavirus-1 E2 family

Function. This family has a single member, the papillo-mavirus-1 E2 protein, which uses a probe helix as part of the DNA-recognition domain. The protein is a viral transcription regulator that acts at all viral promoters and also functions as a viral replication initiator.

Structure. The DNA-binding region of the E2 protein (Figure 4a) is about 85 residues long and consists of four ß strands and two interstrand a helices. Two subunits combine to form an eight-strand ß-barrel, which provides the interface for the resulting homodimer.

Binding. The larger a helix from each subunit is symmetrically inserted in the DNA major groove making base and backbone contacts. Additional interactions to the backbone are provided by interstrand loops [41].

24. Histone family

Function. DNA in chromatin is organized in arrays of nucleosomes. The nucleosome, in its role as the principal packaging element of DNA within the nucleus, is the primary determinant of DNA accessibility.

Structure. Two copies of each of four histone proteins are assembled into an octamer that has 145-147 bp of DNA wrapped in a superhelix around it to form a nucleosome core.

Binding. The protein octamer is divided into four 'histone-fold' dimers, each dimer being defined by H3-H4 and H2A-H2B histone pairs. The central histone-fold domains of all four core histone proteins share a highly similar structural motif constructed from three a helices connected by two loops. The two H3-H4 pairs interact through a four-helix bundle formed only from the two H3 histone folds to define the H3-H4 tetramer. Each H2A-H2B pair interacts with this tetramer through a second, homologous four-helix bundle between H2B and H4 histone folds. The histone-fold regions of each tetramer bind to the center of the DNA, which is wrapped into a superhelix. Further a helices and coil elements extend from the histone-fold regions and are also an integral part of the core protein within the confines of the DNA superhelix.

25. EBNA1 protein (Epstein-Barr nuclear antigen 1)

Function. EBNA1 binds to four recognition sites in the origin of latent DNA replication of Epstein-Barr virus and activates latent-phase replication of the viral genomes.

Structure. EBNA1 comprises two domains (Figure 4c, a flanking and a core domain (which is structurally homologous to the complete DNA-binding domain of the bovine papilloma virus E2 protein) and binds DNA as a dimer.

Binding. The flanking domain, which includes a helix that projects into the major groove and an extended chain that travels along the minor groove, makes all of the sequence-determining contacts with the DNA. The core domain makes no direct contacts with the DNA bases.

26. Skn-1

Function. Skn-1 is a developmental transcription factor that specifies mesoderm in Caenorhabditis elegans.

Structure. Skn-1 consists of a compact four-helix unit with one helix more than twice as long as any of the others (Figure 4d.

Binding. It binds as a monomer and binds DNA at two contact points. At the carboxy terminus, the longest helix extends from the domain to occupy the major groove of DNA in a manner similar to zipper proteins. Skn-1, however, lacks the leucine zipper found in all zipper. Additional contacts with the DNA are made by a short basic segment at the amino terminus of the domain, reminiscent of the 'homeodomain arm'.

27. Cre recombinase family

Function.: Cre recombinase catalyzes a site-specific recombination reaction between two 34-bp loxA and loxP sites in bacteriophage ?.

Structure. Cre is a 320-residue protein and folds into two distinct domains that are separated by a short linker. The amino-terminal domain contains five helices and the large carboxy-terminal domain is primary a helical with a small ß sheet packing against a nine-helix domain (Figure 4e.

Binding. The protein binds DNA as a dimer, each monomer binding the outermost 15 bp of one lox half-site. The amino- and carboxy-terminal domains form a clamp around the half-sites making extensive contacts with both major and minor grooves. Helices 2 and 4 of the amino-terminal domain cross each other, both contacting the major groove of the lox half-site. The interface of DNA with the carboxy-terminal domain is complex, involving the entire face of the domain, with both helices and connecting loops interacting with the major and minor grooves and the DNA backbone.

28. High-mobility group family

Function. The high-mobility group (HMG) chromosomal proteins, which are common to all eukaryotes, bind DNA in a non-sequence-specific fashion to promote chromatin function and gene regulation. They interact directly with nucleosomes and are believed to be modulators of chromatin structure. They are also important in activating a number of regulators of gene expression, including p53, Hox transcription factors and steroid hormone receptors, by increasing their affinity for DNA.

Structure. Chromosomal HMG proteins have a global fold of three helices stabilized in an 'L-shaped' configuration by two hydrophobic cores (Figure 4f.

Binding. The HMG domain binds to an AT-rich DNA sequence using a large surface on the concave face of the protein, to bind the minor groove of the DNA. This bends the DNA helix axis away from the site of contact. The first and second helices contact the DNA, their amino termini fitting into the minor groove, whereas helix 3 is primarily exposed to solvent. Partial intercalation of aliphatic and aromatic residues in helix 2 occurs in the minor groove.

29. MADS-box family

Function. The MADS-box motif is found in various DNA-binding proteins, commonly transcription factors, and specifies DNA binding, dimerization and interaction with accessory factors.

Structure. MADS proteins bind DNA as dimers as part of a larger cooperative DNA-binding complex containing other DNA-binding proteins. The MADS domain is a 56-residue motif consisting of a pair of antiparallel coiled-coil a helices packed against an antiparallel two-stranded ß sheet. This ß sheet of the motif is also involved in interprotein interactions with other accessory proteins.

Binding. MADS dimerization occurs along the extensive flat side of the monomer involving the helices and ß sheet. The MADS protein shown here, MCM-1 (Figure 4g), interacts with DNA predominantly with its long a helices located nearly parallel to the minor groove at the center of the binding site. These a helices extend into the major groove on either side of the dyad direct contacts made within the major groove and along the phosphate backbone cause the DNA to bend around the MADS box. The amino-terminal strand of the MADS region (before the first helix of the MADS motif) often passes over and interacts with the DNA backbone.

Group V: ß-sheet proteins

30. TATA box-binding family

This group, which only contains the TATA box-binding protein family, is characterized by the use a large ß-sheet structures to bind the DNA (Figure 5).

Function. TATA box-binding proteins are an essential component of the multiprotein transcription initiator complex that assembles on promoters bound by RNA polymerase II.

Structure. Although they are single-chain molecules, their structures are generally considered to consist of two pseudoidentical domains. A ten-stranded antiparallel ß sheet joins the domains.

Binding. The ß sheet covers the DNA minor groove and creates two substantial kinks away from the main body of the protein, by intercalating phenylalanine side chains from either end of the sheet [46,47].

The family is represented by Pyrococcus woesei, Saccha-romyces cerevisiae and human forms of the protein. Unsurprisingly, both sequence and structural alignments of the various subunits yield very high scores (> 90% and 90 or more respectively).

Group VI: ß-hairpin/ribbon proteins

31. MetJ repressor family

Function. Transcriptional regulator of the expression of methionine biosynthetic enzymes in E. coli.

Structure. The MetJ repressor binds DNA as a dimer (Figure 6a), each subunit comprising a helical bundle and a single ß strand the strands from each subunit form the antiparallel sheet for DNA-binding (colored red).

Binding.: The two ß strands fit into the major groove and do not alter the DNA structure significantly on binding. They lie flat against the base of the groove and interactions are only made from one face of the sheet. Supporting backbone contacts are made by the surrounding helices and the amino-terminal loop regions [48].

32. Tus replication terminator family

Function. Tus protein terminates replication of DNA inE. coli.

Structure. The protein consists of two a-helical bundles at the amino and carboxy termini, connected by a large ß-sheet region and binds DNA as a monomer.

Binding. The DNA-binding region of the Tus family is made of four antiparallel ß strands (colored red in Figure 6b) which links the amino- and carboxy-terminal domains and produces a large central cleft in the protein. The DNA is bound in this cleft, with the interdomain ß strands contacting bases in the major groove. DNA backbone contacts are provided by the whole protein. The ß strands are positioned almost perpendicular to the base edges in the groove, enabling contacts from amino acids that expose their side chains on either face of the sheet [50].

33. Integration host factor family

Function. Integration host factor (IHF) is a small heterodimeric protein that specifically binds to DNA and functions as an architectural factor in many cellular processes in prokaryotes.

Structure. The protein is a heterodimer of two related subunits each made of three helices and a two-stranded ß sheet.

Binding. In contrast to the two families above, the integration host factor forces an enormous distortion in the DNA by inserting a ß hairpin from each subunit in the minor groove (red in Figure 6c). As seen in the TATA box-binding family, the protein produces kinks by intercalating side chains between base steps at the edges of the binding sites. The intercalating prolines are found at the tips of the ß hairpins that extend from the protein towards the other side of the DNA. The nucleic acid is bent towards the main body of the protein and the deformation is stabilized by contacts with the phosphate groups [52,80].

34. T-domain family

Function. The T domain (Figure 6d) is an approximately 180-residue homodimeric domain found in transcriptional regulators for genes essential in tissue specification, morphogenesis and organogenesis.

Structure. Each subunit consists of a seven-strand antiparallel ß barrel one opening of this barrel forms a dimer interface with the equivalent segment of the other subunit while the other end points towards the DNA.

Binding. Two ß strands protrude from the barrel, one of which extends into the DNA major groove. The probe helix is situated in a three-helix bundle in the carboxy-terminal tail. In contrast to many protein families, the a helix binds base and backbone groups from the DNA minor groove [51].

35. Hyperthermophile chromosomal proteins

Function. These proteins are found in hyperthermophilic archaebacteria and have high thermal, acid and chemical stability. They bind DNA without marked sequence preference and increase the Tm of DNA by about 40°C.

Structure. The proteins consist of an incomplete five-stranded ß-barrel capped by an a helix abutting three ß strands (Figure 6e).

Binding. The proteins bind the minor groove with the three-stranded ß sheet causing the DNA to kink severely. The kink results from the intercalation of specific hydrophobic side chains into the DNA structure, but without causing any significant distortion of the protein structure relative to the uncomplexed protein in solution.

36. Arc repressor

Function. Transcription of the ant gene during lytic growth of bacteriophage P22 is regulated by the cooperative binding of two Arc repressor dimers to a 21-bp operator site.

Structure. Arc is a small (about 100 residues), homodimeric repressor of the ribbon-helix-helix family of transcription factors. Each monomer consists of a pair of helices connected by an antiparallel ß sheet (Figure 6f).

Binding. Each Arc dimer uses the ß sheet to recognize bases in the major groove and the amino termini of the second helix in each pair contact the DNA backbone.

Group VII: other

37. Rel homology region family

Function. The Rel homology region is found in the amino terminus of proteins that act at the ? B DNA recognition site, and mediates DNA binding, dimerization and nuclear localization (Figure 7a). Proteins that contain the region act as transcription regulators for genes commonly involved in cellular defense and differentiation. The carboxy-terminal domains located outside the region are variable between proteins.

Structure. The Rel homology region binds symmetrically as a homo- or hetero-dimer. Each subunit (of about 300 residues) has two distinct domains, both consisting of a ß sandwich.

Binding. Interactions in the DNA major groove are made along the whole length of the 10 bp site using a total of ten interstrand loops [54,81].

38. STAT protein family

Function. STATs are a family of eukaryotic transcription factors that mediate the response to a large number of cytokines and growth factors. Upon activation by cell-surface receptors or their associated kinases, Stat proteins dimerize, translocate to the nucleus and bind to specific promoter sequences.

Structure. STAT proteins are between 750 and 850 residues long and bind as dimers to DNA target sites with a 9 bp consensus sequence, TTCCGGGAA. Each monomer is composed of four domains: an amino-terminal four-helix bundle, an eight-stranded ß barrel (residues 321-465), a helix-loop-helix 'connector' domain (residues 466-585) and an SH2 domain.

Binding. The STAT homodimer grips the DNA like a pair of pliers (Figure 7b). The monomers are held together by the carboxy-terminal SH2 domains, and the large four-helix bundle domains form the 'handles' of the pliers. The DNA is almost entirely enclosed by the protein dimer, and contacts the loops from the ß barrel and the connector domains.

Group VIII: enzymes

39. Methyltransferase family

Function. The methyltransferase enzyme is represented by a single homologous family [82,83]. The protein catalyzes the transfer of a methyl group from S-adenosyl-L-methion-ine to the C5 position of cytosine. In prokaryotes the reaction is most commonly found in the protection of the DNA from restriction enzymes. In eukaryotes, however, DNA methylation is implicated in a wider range of cellular processes including transcriptional regulation, DNA repair, developmental regulation and chromatin organization. The current dataset only includes the prokaryotic HhaI methyltransferase (for example, 4mht).

Structure.: The protein functions as a monomer (about 320 residues) containing two domains that are separated by a large DNA-binding cleft (Figure 8a). The catalytic domain (about 220 residues) consists of a seven-stranded ß sheet flanked by a total of five a helices on either side. This domain contains the cofactor-binding site and the active sites. The DNA-recognition domain (about 100 residues) comprises five antiparallel strands that form a twisted ß sheet.

Binding. The protein preferentially binds the sequence 5'-GCGC-3' with the first cytosine base methylated in the enzyme reaction. The DNA is bound in the protein cleft so that the major groove faces the recognition domain and the minor groove faces the catalytic domain. The 4 bp in the target sequence are contacted from the major groove using two glycine-rich interstrand loops, and the substrate cytosine is flipped out of the DNA helix into the catalytic domain. The DNA structure is underwound and the base-pairing is rearranged over 3 bp either side of the substrate base. The three structures in the family all have identical sequences and return high pairwise SSAP scores (> 90).

40-44. Endonucleases

Seven endonuclease families are represented in the current dataset. The FokI family also belongs to the HTH group and has already been described. Figure 8b-8f display MolScript diagrams for representative structures of all the families, viewed parallel and perpendicular to the DNA axis. EcoRV, PvuII, EcoRI and BamHI (1rva, 1piv, 1eri and 1bhm, respectively) are type II restriction endonucleases that recognize DNA sites of 6 bp in length and cleave the phosphate backbone at precise positions within the target sequence. Although there is little sequence similarity between the four protein types, their U-shaped homodimeric structures display some very common features [57,84,85,86,87].

The subunits of PvuII (about 140 residues per subunit) and EcoRV (approximately 240 residues per subunit) may be divided into three segments: the amino-terminal dimerization region, the core catalytic region and the carboxy-terminal DNA-recognition region (Figure 8b, 8c). The catalytic regions of both comprise a five- or six-stranded mixed parallel/antiparallel ß sheet (colored blue), which forms part of the cavity base. Most of the DNA-recognition segments extend from the carboxy-terminal end of the catalytic region (red). In PvuII, the region comprises two parallel a helices and in EcoRV, a mixture of a helices and ß strands. Both proteins approach the minor groove, and the DNA-recognition regions reach around the side of the DNA to contact bases in the major groove using a pair of loops. The dimerization regions of the two proteins are very different (colored green) and complete the base of the cavity [84,85].

The catalytic (or core) regions of endonucleases EcoRI (about 250 residues per monomer) and BamHI (about 200 residues per monomer) also consist of five-stranded parallel/antiparallel ß sheets (Figure 8d, 8e). The positioning of the sheets is different from EcoRV and PvuII, and they form the sides of the cavities. Included in the core region of both proteins are two a helices that pack against their counterparts in the other subunit to form a four-helix bundle at the base of the cavity. EcoRI and BamHI both approach the DNA and make most of the base contacts from the major groove, although the method of sequence recognition greatly differ. EcoRI uses an extra set of interstrand loops and strands that follow the major groove towards the outer edges of the target sequence from the center (green in Figure 8d). BamHI lacks these extra regions and uses the amino-terminal end of the helical bundle for binding [87,88].

44. Endonuclease V

This protein (for example 1vas) catalyzes the first step in the pyrimidine-specific base-excision repair pathway. In contrast to the type II enzymes described above, endonuclease V functions as a monomer (about 130 residues) whose structure comprises a four-helix bundle arranged to form a concave surface in which the DNA is bound (Figure 8f). Binding is centered on a damaged pyrimidine dimer most of the interactions are to the DNA backbone, and the only base contacts are made to the central adenine which is flipped out of the DNA helix into a cavity on the protein surface.

45. DNase I

Function. DNase I is an endonuclease that degrades double-stranded DNA in a non-specific but sequence-dependent manner. Its function is dependent on the presence of divalent cations such as Ca 2+ , Mg 2+ and Mn 2+ .

Structure. DNase I is an a,ß protein with two six-stranded ß-pleated sheets packed against each other forming the core of a 'sandwich'-type structure. The two predominantly antiparallel ß sheets are flanked by three longer a helices and extensive loop regions.

Binding. DNase I binds in the minor groove of the DNA duplex with an exposed loop region forming contacts in and along both sides of the minor groove and extending over a total of 6 bp (Figure 8g). As a consequence of DNase I binding, the minor groove opens by about 3 Å and the duplex bends towards the major groove by about 20°.

46. DNA mismatch endonuclease

Function. In E. coli, the enzyme recognizes a TG mismatched base pair, generated after spontaneous deamina-tion of methylated cytosines, and cleaves the phosphate backbone on the 5' side of the thymine.

Structure. The protein contains three helices surrounding a ß sheet, with one other helix used to intercalate the DNA.

Binding. Three aromatic residues from one helix intercalate into the major groove of the DNA to strikingly deform the base pair stacking (Figure 8h).

47-50. Polymerase group

Polymerases must provide sequence-independent interactions with their DNA substrate, yet retain the specificity to distinguish correctly paired bases from mismatches. DNA polymerases synthesize DNA strands by catalyzing the stepwise addition of a deoxyribonuleotide to the 3'-OH end of a polynucleotide chain that is paired to a second, template stand. Four polymerases have been classified: Pol ß, Pol I, Pol T7 and Pol RT (reverse transcriptase).

47. DNA polymerase ß (pol ß) 48. DNA polymerase I (pol I) 49. DNA polymerase T7 (pol T7)

Pol ß (Figure 8i) and Pol I (Figure 8j) have three structural domains that perform three separate functions, not only polymerizing the DNA but editing and repairing it by 3'-5'-and 5'-3'-exonuclease activity respectively. T7 DNA polymerase (Figure 8k) possesses no 5'-3'-exonuclease activity. For Pol I and T7, the larger carboxy-terminal domain has both the polymerase and 3'-5'-exonuclease activity with an a+ß structure that can be likened to that of a right hand. A large cleft formed from a six-stranded antiparallel ß sheet surrounded by a helices forms the 'palm' and binds the DNA minor groove along with the 'thumb' region (Figure 8j, 8k). Extensive sequence-independent interactions exist in the minor groove. The major groove, with its sequence-specific pattern of hydrogen-bond donors and acceptors, which form the primary means of recognition for many sequence-specific DNA-binding proteins, does not contact the protein and is solvent-accessible.

The smaller amino-terminal of Pol I has 5'-3'-exonuclease activity. It is folded into an aß structure with a mixed ß sheet of five strands.

50. HIV reverse transcriptase

Function. Reverse transcriptases have two enzymatic activities: a DNA polymerase that can copy either DNA or RNA templates and an RNase H. The two crystal structures of HIV reverse transcriptase which have been solved are only of the polymerase region.

Structure. HIV-1 reverse transcriptase (Figure 8l)is a heterodimer consisting of p66 (about 550 residues) and p51 (about 430 residues), two subunits of a helices and ß strands which share a common amino terminus. The p51 subunit corresponds to the polymerase domain of the p66 subunit. The carboxy terminus of p66 forms the RNase H domain.

Binding. Loops and helices of p66 make extensive interactions with the DNA. P51 also binds but its interactions are mainly at the protein dimer interface with p66.

51. Uracil-DNA glycosylase

Function. Any uracil bases in DNA, a result of either misincorporation or deamination of cytosine, are removed by uracil-DNA glycosylase (UDG).

Structure. UDG is 225 residues long and contains a central four-stranded ß-sheet region partly surrounded by eight a helices (Figure 8m).

Binding. Damaged DNA binds to UDG near the carboxy-terminal end of its central four-stranded ß sheet. Conserved UDG residues in loop regions contact the DNA, with the loop between sheet 4 and helix 8 inserting into the DNA minor groove. A few contacts with the DNA backbone are made by two helices.

52. 3-Methyladenine DNA glycosylase

Function. DNA N-glycosylases are base excision-repair proteins that locate and cleave damaged bases from DNA as the first step in restoring the sequence.

Structure. The protein is 216 residues in length and is composed mainly of ß strands (Figure 8n).

Binding. The enzyme intercalates into the minor groove of DNA using two ß strands, causing the damaged base to flip into the enzyme active site for base excision.

53. Homing endonuclease family

Function. Homing endonucleases are a diverse collection of proteins that are encoded by genes with mobile, self-splicing introns. These enzymes promote the movement of the DNA sequences that encode them from one chromosome location to another they do this by making a site-specific double-strand break at a target site in an allele that lacks the corresponding mobile intron.

Structure. The protein binds DNA as a dimer and displays mixed aß topology (Figure 8o). Each monomer contains three antiparallel ß sheets flanked by two long a helices, and a long carboxy-terminal tail that extends around the surface of the second subunit in the dimer and is stabilized by two bound zinc ions 15 Å apart.

Binding. The zinc-binding motifs are critical primarily for structural stabilization of the protein core and are not involved in DNA binding. The primary sequence-specific contacts made to homing-site DNA are from residues in the second ß sheet of each enzyme monomer which contact the major groove of each half-site. Additional contacts are made in the center of the complex within the minor groove and with several phosphate groups in the cleavage site.

54. Topoisomerase I

Function. Topoisomerases I promote the relaxation of DNA superhelical tension by introducing a transient single-stranded break in duplex DNA and are vital for the processes of DNA replication, transcription and recombination.

Structure. No crystal structure has been solved for the whole protein - only for the central core and the carboxy-terminal domains (592 residues see Figure 8p). The central core domain is connected to the carboxy-terminal domain by a linker. This linker assumes a coiled-coil configuration and protrudes away from the remainder of the enzyme.

Binding. The enzyme completely surrounds the DNA, contacting the backbone with loops and a ß sheet binds in the major groove.

Flow diagram showing the selection of the protein-DNA complexes from the PDB (04/01/00). The protein-DNA complexes were grouped into structurally related families using the secondary structure alignment program SSAP (see text).

Group I, HTH proteins. The DNA-binding motif is red. The protein binds as a dimer one monomer is colored blue and the other yellow. The DNA is shown as a space filling model. Family names and numbers are as listed in Table 2 PDB codes are bracketed.

Group II, zinc-coordinating proteins. Colors, numbers and names are as in Figure 1.

Group III, zipper-type proteins. Colors, numbers and names are as in Figure 1.

Group IV, 'other a helix proteins'. Colors, numbers and names are as in Figure 1.

Group V, ß-sheet proteins. Colors, numbers and names are as in Figure 1.

Group VI, the ß-hairpin/ribbon proteins. Colors, numbers and names are as in Figure 1.

Group VII, 'other DNA-binding proteins'. Colors, numbers and names are as in Figure 1.

Group VIII, the enzymes. Colors, numbers and names are as in Figure 1.

3. Conclusions

The knowledge of the structural, molecular, and functional biology of bone is essential for the better comprehension of this tissue as a multicellular unit and a dynamic structure that can also act as an endocrine tissue, a function still poorly understood. In vitro and in vivo studies have demonstrated that bone cells respond to different factors and molecules, contributing to the better understanding of bone cells plasticity. Additionally, bone matrix integrins-dependent bone cells interactions are essential for bone formation and resorption. Studies have addressed the importance of the lacunocanalicular system and the pericellular fluid, by which osteocytes act as mechanosensors, for the adaptation of bone to mechanical forces. Hormones, cytokines, and factors that regulate bone cells activity, such as sclerostin, ephrinB2, and semaphoring, have played a significant role in the bone histophysiology under normal and pathological conditions. Thus, such deeper understanding of the dynamic nature of bone tissue will certainly help to manage new therapeutic approaches to bone diseases.

Watch the video: Whats New With Burton Step On for 2122 (May 2022).