How do you calculate or predict the charge of a protein at pH 7?

How do you calculate or predict the charge of a protein at pH 7 given a fasta sequence?

Any papers or online servers to do this is well appreciated.

ExPASy to the rescue! Although I didn't comb through all the tools, this nifty website provides quite the myriad of bioinformatics resources which most certainly contains the tool to calculate what you want.

Bear in mind though, most tools will tell you the isoelectric point of your protein. However, bearing in mind the relationship between pI and pH (i.e. if pI < pH, then protein is - charged, and vice versa), you can easily figure out the charge of a protein at pH 7.

http://expasy.org/tools/

*Update

Here's the exact tool to calculate isoelectric points: http://web.expasy.org/compute_pi/

To calculate the charge at different pH:

At pH 3 K, R, H are + and D,E have no charge so add up all of the K,R,H in the sequence and that is your net charge at pH 3 At pH 6 K, R, H are + but now D,E are (-) so subtract one total from the other to figure if your net charge is + or -. At pH 8 K and R are +, H has no charge and D,E are (-). At pH 10 R is +, K, H have no charge and D,E are (-).

That can give you a general idea. You can estimate the charge at pHs in between.

pKa values of amino acid side chains play an important role in defining the pH-dependent characteristics of a protein. The pH-dependence of the activity displayed by enzymes and the pH-dependence of protein stability, for example, are properties that are determined by the pKa values of amino acid side chains.

The pKa values of an amino acid side chain in solution is typically inferred from the pKa values of model compounds (compounds that are similar to the side chains of amino acids). See Amino acid for the pKa values of all amino acid side chains inferred in such a way. There are also numerous experimental studies that have yielded such values, for example by use of NMR spectroscopy.

The table below lists the model pKa values that are often used in a protein pKa calculation, and contains a third column based on protein studies. 

Amino Acid pKa pKa
Asp (D) 3.9 4.0 0
Glu (E) 4.3 4.4 0
Arg (R) 12.0 13.5 0
Lys (K) 10.5 10.4 0
His (H) 6.08 6.8 0
Cys (C) (–SH) 8.28 8.3 0
Tyr (Y) 10.1 9.6 0
N-term 8.0 0
C-term 3.6 0

When a protein folds, the titratable amino acids in the protein are transferred from a solution-like environment to an environment determined by the 3-dimensional structure of the protein. For example, in an unfolded protein an aspartic acid typically is in an environment which exposes the titratable side chain to water. When the protein folds the aspartic acid could find itself buried deep in the protein interior with no exposure to solvent.

Furthermore, in the folded protein the aspartic acid will be closer to other titratable groups in the protein and will also interact with permanent charges (e.g. ions) and dipoles in the protein. All of these effects alter the pKa value of the amino acid side chain, and pKa calculation methods generally calculate the effect of the protein environment on the model pKa value of an amino acid side chain.    

Typically the effects of the protein environment on the amino acid pKa value are divided into pH-independent effects and pH-dependent effects. The pH-independent effects (desolvation, interactions with permanent charges and dipoles) are added to the model pKa value to give the intrinsic pKa value. The pH-dependent effects cannot be added in the same straightforward way and have to be accounted for using Boltzmann summation, Tanford–Roxby iterations or other methods.

The interplay of the intrinsic pKa values of a system with the electrostatic interaction energies between titratable groups can produce quite spectacular effects such as non-Henderson–Hasselbalch titration curves and even back-titration effects. 

The image below shows a theoretical system consisting of three acidic residues. One group is displaying a back-titration event (blue group).

Several software packages and webserver are available for the calculation of protein pKa values. See links below or this table

Using the Poisson–Boltzmann equation Edit

Some methods are based on solutions to the Poisson–Boltzmann equation (PBE), often referred to as FDPB-based methods (FDPB is for "finite difference Poisson–Boltzmann"). The PBE is a modification of Poisson's equation that incorporates a description of the effect of solvent ions on the electrostatic field around a molecule.

The H++ web server, the pKD webserver, MCCE, Karlsberg+, PETIT and GMCT use the FDPB method to compute pKa values of amino acid side chains.

FDPB-based methods calculate the change in the pKa value of an amino acid side chain when that side chain is moved from a hypothetical fully solvated state to its position in the protein. To perform such a calculation, one needs theoretical methods that can calculate the effect of the protein interior on a pKa value, and knowledge of the pKa values of amino acid side chains in their fully solvated states.    

Empirical methods Edit

A set of empirical rules relating the protein structure to the pKa values of ionizable residues have been developed by Li, Robertson, and Jensen. These rules form the basis for the web-accessible program called PROPKA for rapid predictions of pKa values. A recent empirical pKa prediction program was released by Tan KP et.al. with the online server DEPTH web server

Molecular dynamics (MD)-based methods Edit

Molecular dynamics methods of calculating pKa values make it possible to include full flexibility of the titrated molecule.   

Molecular dynamics based methods are typically much more computationally expensive, and not necessarily more accurate, ways to predict pKa values than approaches based on the Poisson–Boltzmann equation. Limited conformational flexibility can also be realized within a continuum electrostatics approach, e.g., for considering multiple amino acid sidechain rotamers. In addition, current commonly used molecular force fields do not take electronic polarizability into account, which could be an important property in determining protonation energies.

Determining pKa values from titration curves or free energy calculations Edit

and is thus in turn related to the protonation free energy of the site via

The protonation free energy can in principle be computed from the protonation probability of the group ⟨x⟩(pH) which can be read from its titration curve

Titration curves can be computed within a continuum electrostatics approach with formally exact but more elaborate analytical or Monte Carlo (MC) methods, or inexact but fast approximate methods. MC methods that have been used to compute titration curves  are Metropolis MC  or Wang–Landau MC. Approximate methods that use a mean-field approach for computing titration curves are the Tanford–Roxby method and hybrids of this method that combine an exact statistical mechanics treatment within clusters of strongly interacting sites with a mean-field treatment of intercluster interactions.     

In practice, it can be difficult to obtain statistically converged and accurate protonation free energies from titration curves if ⟨x⟩ is close to a value of 1 or 0. In this case, one can use various free energy calculation methods to obtain the protonation free energy  such as biased Metropolis MC,  free-energy perturbation,   thermodynamic integration,    the non-equilibrium work method  or the Bennett acceptance ratio method. 

Note that the pK HH
a value does in general depend on the pH value. 

This dependence is small for weakly interacting groups like well solvated amino acid sidechains on the protein surface, but can be large for strongly interacting groups like those buried in enzyme active sites or integral membrane proteins.   

PH and pKa

Once you have pH or pKa values, you know certain things about a solution and how it compares with other solutions:

• The lower the pH, the higher the concentration of hydrogen ions [H + ].
• The lower the pKa, the stronger the acid and the greater its ability to donate protons.
• pH depends on the concentration of the solution. This is important because it means a weak acid could actually have a lower pH than a diluted strong acid. For example, concentrated vinegar (acetic acid, which is a weak acid) could have a lower pH than a dilute solution of hydrochloric acid (a strong acid).
• On the other hand, the pKa value is constant for each type of molecule. It is unaffected by concentration.
• Even a chemical ordinarily considered a base can have a pKa value because the terms "acids" and "bases" simply refer to whether a species will give up protons (acid) or remove them (base). For example, if you have a base Y with a pKa of 13, it will accept protons and form YH, but when the pH exceeds 13, YH will be deprotonated and become Y. Because Y removes protons at a pH greater than the pH of neutral water (7), it is considered a base.

How does pH change protein structure? Changes in pH change the attractions between the groups in the side chains of the protein.

Explanation:

The interactions between the side chains of the amino acids determine the shape of a protein. Four types of attractive interactions determine the shape and stability of a protein. The two that pH changes affect are salt bridges (a) and hydrogen bonding (b).

Salt Bridges

Salt bridges are ionic bonds between positively and negatively charged side chains of amino acids. An example is the attraction between a #"-COO"^"-"# ion of lysine and an #"-NH"_3^"+"# ion of aspartic acid.

Increasing the pH by adding a base converts the #"-NH"_3^"+"# ion to a neutral #"-NH"_2# group.

Decreasing the pH by adding an acid converts the #"–COO"^"-" # ion to a neutral #"-COOH"# group.

In each case the ionic attraction disappears, and the protein shape unfolds.

Hydrogen Bonding

Various amino acid side chains can hydrogen bond to each other. Examples are:

• Two alcohols: Ser, Thr, and Tyr.
• Alcohol and amine: Ser and Lys
• Alcohol and amide: Ser and Asn
• Alcohol and acid: Asp and Tyr
• Two acids: Asp and Glu

Changing the pH disrupts the hydrogen bonds, and this changes the shape of the protein.

1. Introduction

One of the most successful methods for detecting and analyzing protein posttranslational modifications (PTMs) has been two-dimensional gel electrophoresis (2D-GE). Since many PTMs, such as phosphorylation, introduce charged groups into the protein, there is often a detectable change in the position of the protein on a 2D gel. Although the change in the mass of the protein due to the PTM is often too small to be easily detected by standard sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE), the modification can cause a change in the net charge of the protein leading to a change in the isoelectric point, or pI of the protein. The first dimension of the 2D gel, usually shown horizontally, is the isoelectric focusing dimension changes in protein pI’s are reflected as changes in the horizontal position of the protein spot within the 2D pattern of spots. Often it is observed that there are ‘trains’ of spots on the gel that are presumably formed by multiple versions of the same protein that differ in isoelectric point due to increasing numbers posttranslational modifications such as phosphorylation or deamidation (1, 2).

Although 2D-GE is a sensitive method for determining that there are posttranslationally modified forms of proteins present, it does not directly indicate what the modification is or how many of the residues in the protein are modified. Since proteins vary greatly in their ability to buffer the change in pI due to posttranslational modifications, to examine these results more closely it is necessary to calculate the predicted pI changes caused by the modification in the context of the protein sequence.

To meet this need, we have developed ProMoST, a web based application that allows users considerable freedom in calculating the pI values of modified and unmodified proteins (3). ProMoST has predefined modifications so that casual users are able to rapidly determine the predicted pI values of modified proteins and peptides. In addition, ProMoST also provides additional options for more advanced users allowing them to define additional custom modifications, change the pKa values for the defined modifications and even make changes to the default pKa values for charged amino acids used to calculate pI values. The results of the calculations can be displayed both in a tabular format as well as in a graphic representation of the migration of the protein on a 2D gel.

1.1 pI

The pKa values of the side chains of the twenty common amino acids that comprise most proteins vary from approximately pH 2.8 to pH 11.2 (4). Three amino acids are positively charged under physiological conditions (lysine, arginine, and histidine) are termed basic amino acids and two amino acids are negatively charged under physiological conditions (glutamic acid and aspartic acid) and are termed acidic amino acids. In addition, the amino (N) and carboxyl (C) termini of the protein can also be charged. To determine the total charge of a protein at a given pH, the fractional number of positive and negative charges for each of the amino acids in the protein’s sequence is determined and sum of the fractional charges is equal to the charge on the protein.

The isoelectric point, or pI of the protein is the pH value at which the total charge on the protein is zero. At this pH value the negative and positive charges of the protein are equal and the protein is at neutral charge. The pI of the protein therefore gives an indication of whether the protein will carry a net positive or negative charge under physiological conditions. Proteins that have a pI > 7.0 are considered to be basic proteins and proteins that have a pI < 7.0 are considered to be acidic proteins.

In addition to giving an indication of the charge of the protein, the pI is also a good indicator of the solubility of the protein at a given pH. One of the most important aspects of a protein’s physiochemical properties that determines solubility is its charge. Thus at a pH equal to the pI of the protein, it is uncharged and therefore it is usually the least soluble. Manipulating protein charge, either by changing pH or by adding salt to neutralize charge is the basis for many of the early methods for protein purification by differential solubility (5).

The loss of charge at a protein’s pI is also part of the fractionation process during the first dimension of the 2D gel that is based on isoelectric focusing (6). Proteins are introduced to a strip on which a pH gradient has been established and in the presence of a high electric field they migrate to the position on the strip at which the protein has a net neutral change and it stops migrating. This pH value corresponds to the pI of the protein. Thus the final migration position of the protein in the horizontal dimension of the 2D gel is determined by the pI value of the protein.

1.2 Modifications and mutations change pI

The fact that the migration in the isoelectric focusing dimension of proteins in 2D-GE is very sensitive to changes in pI makes 2D-GE a valuable technique for identifying modifications and mutations (1). Modifications such as phosphorylation that add highly charged groups to the protein can cause easily detectable changes in pI and therefore mobility of the protein in the isoelectric focusing dimension. Similarly, the changes in protein mass and pI due to mutations that cause a net loss or gain of charge on the protein by altering the number of charged acidic and basic residues present in the protein can also be calculated and displayed. The amount of mobility shift that is observed due to modification or mutation is dependent on three factors. First, the pKa value for the modification or change induced by mutation is very important to the final change in the protein pI. Modifications, such as phosphorylation, that introduce a group with either a strongly acidic or basic pKa will have a greater effect than those with a pKa value closer to neutrality. Similarly, a mutation that causes a change from an acidic residue to a basic residue will lead to a larger change in pKa than a change from a charged residue to a neutral residue. The larger pKa alteration will lead to a larger change in protein pI and therefore a larger mobility shift in the isoelectric focusing dimension of the 2D gel. Second, the number of modifications or residue changes will also have an impact on the mobility shift observed on the gels. Often for modifications, a train of spots will be observed. Interestingly, the shift in mobility is often not constant and the distance between spots can vary. This is explained by the third factor that determines the magnitude of the observed pI shift: the charge buffering capacity of the protein at a given pH. Since different proteins are comprised of different mixtures of positively and negatively charged amino acid depending on their primary amino acid sequence, the charge titration profile for each protein is unique. Thus the extent to which a modification changes the pI of the protein and impacts on the mobility of the protein, is different since the change titration profile changes with pH. Figure 1 shows an example of this for human cyclin-dependent kinase 2 (CDK2). Figure 1 Panel A shows the titration of the unmodified protein. Panels B𠄽 show the titrations with 1, 2 or 3 phosphate groups. For comparison, Figure 1 Panel E shows the 2D gel spot positions calculated by ProMoST. Note that the magnitude of the shift in the calculated spot position due to additional phosphorylation varies from spot to spot. This variance correlates with the titration curves for CDK2 shown in Figure 1 , Panels A𠄽.

PI, pH and pKa

The kinetics of protons dissociating (and associating) from an acid, A, can be treated like any other reaction. The dissociation reaction is:

\$qquadlarge < extleftrightharpoons ext^+ + ext^->\$

where HA is the acid (e.g. hydrochloric acid, HCl) and A⁻ is what is known as the conjugate base (Cl⁻ is the conjugate base of HCl). An association constant, \$K_a\$, can be defined just like any other equilibrium constant. This is the ratio of the concentration chemical species at equilibrium.

A strong acid is good at losing protons, so equilibrium will only be reached when [H⁺] and [A⁻] are high and [HA] is low. A strong acid, therefore has a large \$K_a\$. In the same way that pH is the negative logarithm of [H⁺], \$pK_a\$ is the negative logarithm of \$K_a\$:

On a side note, an analogous equation can be derived for the dissociation of proton from a base:

\$qquadlarge< ext^+ leftrightharpoons ext^+ + ext>\$

Protein Structure: pKa and Protonation States

To refresh your memory, pKa is related to the acid dissociation constant Ka through the following equation:

The pKa is helpful because it allows us to easily discern how acidic a solution is. Ka also provides this information but pKa portrays it in more easily understood values. The lower the pKa, the more acidic.

Let us also think of the Henderson Hasselbach equation, which relates pH and pKa:

• If pH < pKa, then the species is protonated.
• If pH > pKa, then the species is deprotonated.
• If pH = pKa, then half of the species has dissociated. In other words, there are equal concentrations of deprotonated species and protonated species in solution.

This background information is exceptionally important when understanding the protonation states of amino acids. Let us look over our list of 20 naturally occurring amino acids again: Note that these are represented at physiological pH, which is about 7.4. At this pH, for all of the amino acids, the N and C termini are charged. This is because the pKa of the N-term is about 9, while the pKa of the C-term is about 3. The protonation states of these key elements in amino acids can be better visualized by drawing them at a different pH. For many of our amino acids, however, it does get more complicated. These amino acids side chains have their own pKa, and could end up with a charge at physiological pH as a result. Residues which are charged at physiological pH are represented below: These are not the only residues with pKa values–both Tyrosine and Cysteine have relevant pKa values. These residues are not charged at physiological pH but can be deprotonated under other pH conditions. Additionally, Histidine is actually not charged at physiological pH, though it is an acidic residue. Upon acidic conditions, where Histidine is protonated, it would have a positive charge.

The protonation states of residue side chains work exactly the same way as described above. As you can see, the acidic residues Asp and Glu (with pKa < 7) maintain a negative charge as their pKa is less than the relative pH. In contrast, the basic residues Lys and Arg maintain a positive charge because their pKas are higher than physiological pH, and they have not been deprotonated as a result.

An important definition to know is the Isoelectric Point (pI). This is defined as the pH at which an amino acid carries no net charge. This point can be calculated as follows: Another key term is a Zwitter Ion. This is an electroneutral molecule that maintains both positive and negative charges, which cancel each other out.

*****Note the differences between a Zwitter Ion and the Isoelectric point of a molecule–do NOT get them confused. The Isoelectric point may be a Zwitterion but the molecule can be a Zwitterion at pH other than that of the isoelectric point.*****
The major connection between these two definitions is that the concentration of the Zwitterion form of a molecule is at its highest when the pH = pI.

3. Introduction

4.3 Calculating the intrinsic pKa value 4.4 The interaction with other titratable groups

The desolvation energies and the background interaction energies can be regarded as being largely pH-independent. The interaction energy between titratable groups is obviously not pH-independent, and it is therefore not possible just to add the interaction energies with all the other titratable groups to the intrinsic pKa in order to get the true pKa value of the residue. We therefore have to use a calculation protocol that takes the pH-dependence of the interactions between titratable groups into account. This can be done if we calculate the energy for each of the possible protonation states of the protein, and use these energies to evaluate the partition function for these states at a range of pH-values.

State Group 1 Group 2 Group 3 Energy
1 + + + dGpH(1) + dGpH(2) + dGpH(3) + (1=2) + (1=3) + (2=3)
2 + + 0 dGpH(1) + dGpH(2) + (1=2)
3 + 0 + dGpH(1) + dGpH(3) + (1=3)
4 + 0 0 dGpH(1)
5 0 + + dGpH(2) + dGpH(3) + (2=3)
6 0 + 0 dGpH(2)
7 0 0 + dGpH(3)
8 0 0 0 0

Table 4.1 Possible protonation states for a hypothetical protein consisting of three titratable group. +: charged, 0: neutral. Energy is relative to state 8. (X=Y) indicates the interaction energy between the charged forms of groups X and Y. dGpH (X) is the free energy difference between the charged and neutral forms of group X at a fixed pH value (see text for explanation).

Let us consider a protein with three titratable groups. Each of these groups can exist in two states: charged and neutral. The protein can thus occupy 2 3 different protonation states. These are summarised in Table 4.1.At a given pH we want to determine the free energy of all the states in Table 4.1 relative to the free energy of state 8, which we have defined to be zero. The free energy of each of the other states consists of two terms A and B:

A) For each residue: the energy difference between the charged and neutral form of the residue disregarding the interactions between the titratable groups.

B) The interactions between the titratable groups.

4.4.1 Term A

Term A can be calculated from the intrinsic pKa for each residue by rearranging Eq. 4.10:

This gives an expression for the free energy difference between the charged and neutral state of a titratable group at a fixed pH value:

4.4.2 Term B

Term B is the interaction energies between the titratable groups in this particular protonation state. For state five, for example, term B should hold the following three interaction energies ([X : Y] denotes the interaction energy between X and Y):

(G1 = Group 1, G2 = Group 2, G3 = Group 3, :+ = charged, :0 = neutral)

The energies E1 and E2 are already contained in the intrinsic pKa, because it is calculated by determining the energy of charging a single group in a form of the protein where all other titratable groups are in their neutral state (see section 4.2.3 and Fig. 6.2).

Thus only E3 has to be added to term A to obtain the free energy for state five. The intrinsic pKa, however, does also contain the energies E4 and E5 (in the same way that the intrinsic pKa contains E1 and E2).

We have to correct for this in the energy that we add to the intrinsic pKa [ D GpH(2) and D GpH(3) in Table 4.1] for the interaction between the charged forms of groups two and three. A simple evaluation shows that:

E3 - (E4 + E5) = [G2(+) : G3 (+)] - [G2 (+) : G3 (0)] - [G2 (0) : G3 (+)] + [G2 (0) : G3 (0)]

and this is therefore the energy which is listed as (2<< 3) in Table 4.1.

4.5 Calculating titration curves

We now know the energy of every possible protonation state of a protein at a given pH value, and the next step is the conversion of these energies into fractional charges at each pH value for each residue in order to get the titration curves.

A straight-forward way to find the occupancy of the different states in Table 4.1 is to evaluate the Boltzmann sum for each state.

Here pi is the fraction of molecules in state i. Ei is the energy of state i, and the sum in the denominator is over all possible states of the system. k is Boltzmann's constant and T is the temperature in Kelvin.

The fractional charge of a particular group is simply the sum of the pi's for all the states where the group is charged. Thus for group 1 in Table 4.1, for example, the charge is the sum of p1, p2, p3 and p4.

It is clear from Table 4.1 that the number of states equals 2 N , where N is the number of titratable groups. For values of N significantly larger than 30, it is therefore no longer possible to evaluate (Eq. 4.18). For large systems it is thus customary to use a Monte Carlo protocol [Beroza et al., 1991] to obtain pi.

From the calculated titration curves the pKa value for each group is determined as the pH where the group is half-protonated. This gives an accurate result only if the titration curve follows a Henderson-Hasselbalch shape. This is the case for most groups, but especially in active sites it is quite common to find groups that have very irregular titration curves. In these cases manual inspection of the titration curves is necessary in order to obtain meaningful results.

4.6 Performance of pKa calculation methods

Several pKa calculation packages are presently available. Most of these, however, have serious trouble to reach a better agreement with experimentally determined pKa values than the so-called null model. The null model assumes that the pKa values of protein side chains are not shifted at all compared to their value in water.

This poor performance of pKa calculations is not due to an incorrect theory, though, but rather to an incorrect description of the protein in the calculations. A fundamental problem with pKa calculations is that crystal structures are used as source of coordinates for the protein. The crystal symmetry induces structural changes in the protein, and thereby causes some pKa values to be shifted compared to their value in solution. It is therefore not surprising that the pKa values calculated from a crystal structure will differ from the pKa values measured in solution by NMR.

The description of the protein used in pKa calculations is, however, also often to simple. Protons are, for example, often omitted, and methods that include protons do often not model the deprotonation of a titratable group explicitly. It is our opinion that pKa calculations can improve greatly by including a more detailed description of the protein and its dynamics.

General sequence analysis with SequenceParameters

The approach we recommend for accessing SequenceParameters objects is to use the following Python code

By opening your code with this line, you now have direct access to the SequenceParameters class, which takes either a string of an amino acid sequence or the filename of a file containing an amino acid sequence, which is then read and parsed. As an example

Both these code snippets create a SequenceParameters object - here that object is called SeqOb , but obviously this variable could be named anything. We can run a huge range of analysis routines on this object. The complete function list is shown below for reference.

Many of these functions don’t take arguments. Optional arguments are prefixed with a question mark (?). For each function we use the seqOb.<function> syntax - e.g.

Single value sequence analysis functions

The functions below perform various analysis over sequences and return a single value. NOTE: Where pH values can be provided, if left blank we assume a neutral pH where only R/K/D/E are charged. If a pH value is provided, then R/K/D/E/C/Y/H are all considered titratable residues using EMBOSS pKa values, listed below: ‘C’: 8.5, ‘Y’: 10.1, ‘H’: 6.5, ‘E’: 4.1, ‘D’: 3.9, ‘K’: 10.0, ‘R’: 12.5

Function name Operation
get_length() Get the sequence length
get_FCR(pH=None) Get the fraction of charged residues in the sequence  (pH keyword allows for a pH specific value)
get_NCPR(pH=None) Get the net charge per residue of the sequence 
get_isoelectric_point() Get the isoelectric point of the sequence
get_molecular_weight() Get the molecular weight of the protein associated with a given amino acid sequence
get_countNeg() Get the number of negatively charged residues in the sequence (D/E)
get_countPos() Get the number of positively charged residues in the sequence (R/K)
get_countNeut() Get the number of neutral amino acids
get_fraction_negative() Get the fraction of residues which are negatively charged (F-)
get_fraction_positive() Get the fraction of residues which are positively charged (F+)
get_fraction_expanding(pH=None) Get the fraction of residues which are predicted to contribute to chain expansion (E/D/R/K/P)
get_amino_acid_fractions() Get a dictionary of the fractions of each amino acid in the sequence
get_fraction_disorder_promoting() Get the fraction of residues predicted to be ‘disorder promoting’ . Note this is NOT a disorder prediction!
get_kappa() Get the sequence’s kappa value 
get_Omega() Get the sequence’s Omega value. Omega defines the patterning between charged/proline residues and all other residues 14].
get_mean_net_charge(pH=None) Get the absolute mean net charge of your sequence
get_phase_plot_region() Get the region on the Das-Pappu diagram of states where your sequence falls 
get_mean_hydropathy() Get the mean hydropathy as calculated from a skewed Kyte-Doolittle hydrophobicity scale* 
get_uversky_hydropathy() Get the mean hydropathy as calculated from a normalized Kyte-Doolittle hydrophobicity scale** [3,4]
get_PPII_propensity(mode='hilser') Get the overall sequence’s PPII propensity as defined by one of three PPII propensity scales. By default, the scale by Elam et al. is used, but modes creamer and kallenbach are also available, which use values from scales by Rucker or Shi, respectively [12,13].
get_delta()
Returns the delta value of the sequence, as defined when calculating kapp 
get_deltaMax()
Returns the maximum possible delta value (delta-max) for a sequence of this composition

* The skewed hydrophobicity scale shifts the normal KD scale such that the lowest value is 0 (instead of -4.5) and the highest value is 9 (instead of 4.5)

** The normalized Kyte-Doolittle scale converts all values on the scale to fall between 0 and 1

Position-specific sequence analysis functions

The following functions generate an array of values which describes some property associated with the sequence as a function of sequence position.

The complexityType Defines the complexity measure being employed. Three different complexity measures are provided by localCIDER, where the measure being used is passed via a string with one of ‘WF’, ‘LC’, or ‘LZW’. WF is Wooton-Federhen complexity , which reports on the sequence’s local Shannon entropy, and is the complexity measure used in the SEG algorithm. LC is Linguistic complexity , which reports on the number of distinct subsequences over the maximum number of different subsequences given the alphabet size and the word size. Finally, LZW is Lempel-Ziv-Welch  complexity, and effectivly asks how efficienctly the sequence can undergo lossless compression using unique subsequences.

The alphabetSize defines the size of the alphabet being used, where pre-defined alphabets are then used based on the specific size. Those pre-defined alphabets are defined below this table for clarity. By default an alphabetSize of 20 is used (i.e. no reduction in amino acid complexity). ‘userAlphabet’ Allows the user to define their own reduced alphabet. The format here is a dictionary where each key-value pair is amino-acid to X. This means you need a dictionary of length 20 where each amino acid is mapped to another amino acid. This is somewhat of tedious, but it helps avoid user-error where specific amino acids are missed. (default=None).

Reduced alphabets

Predefined alphabets shown below - all except eleven are based on alphabets defined in the reference below .

Phosphorylation functions

The following functions augment your sequence to consider the impact of phosphorylation on the electrostatic properties. Note this makes the highly simplifying assumption that the phosphorylation of a Ser/Thr/Tyr residue simply adds a negative charge to your protein chain. In reality, many other properties of the chain are impacted by phosphorylation than simply the linear charge patterning.

Miscellaneous functions

The functions below represent a variety of miscellaneous functions.

Function name Operation
get_HTMLColorString()
Returns a fully formated HTML string which can be used to represent your sequence. The coloring used has a default, but can be defined using the set_HTMLColorResiduePalette function
set_HTMLColorResiduePalette(colorDictionary)
Allows you to custom define a colour pallete. The colorDictionary must be a dictionary object that maps each of the 20 amino acids to a color. Currently 17 possible colors can be assigned to the 20 amino acids. These are
aqua, black, blue, fuchsia, gray, green, lime, maroon, navy, olive, orange, purple, red, silver, teal, white, and yellow. This set of 17 colors represents the HTML browser compatible set of colors.

Sequence permutation functions

These functions perform some operation on the sequence, returning a permuted SequenceParameter object populated with a different sequence. The underlying sequence object the function is called on is not altered.

Function name Operation
get_shuffle(frozen=None)
Returns a SequenceParameter object with the primary amino acid sequence shuffled. If residue index positions are past as a list to ‘frozen’ those residues are considered imutable and are not shuffled.

Plotting functions (on-screen ‘show’ functions)

The following functions let you plot parameters from your sequence and display the results immediately on screen.

Function name Operation
show_phaseDiagramPlot(label="", title='Diagram of states', legendOn=True, xLim=1, yLim=1, fontSize=10, getFig=False) Renders a matplotlib Das-Pappu diagram of states plot with your sequence on the diagram  . If a label is provided this is a string which annotates your sequence on the plot. If a title is provided this sets the plot title. legendOn defines if the region labels are included as a legend. xLim and yLim define the max values for the X and Y axes. fontSize defines the size of the label font. getFig defines if a matplotlib object is returned instead of being rendered on screen.
show_uverskyDiagramPlot(label="", title='Uversky Plot', legendOn=True, xLim=1, yLim=1, fontSize=10, getFig=False) Renders a matplotlib Uversky plot with your sequence on the diagram  . label can be a string which labels your sequence on the plot. If a title is provided this sets the plot title. legendOn defines if the regions labels are included as a legend. xLim and yLim define the max values for the X and Y axes. fontSize defines the size of the label font. getFig defines if a matplotlib object is returned instead of being rendered.
show_linearHydropathy(blobLen=5, getFig=False) Renders a matplotlib plot of the moving average hydropathy along the sequence, where the hydropathy is calculated in overlapping windows of size blobLen . Typically a blob length of 5-7 is used. getFig defines if a matplotlib object is returned instead of being rendered.
show_linearNCPR(blobLen=5, getFig=False) Renders a matplotlib plot of the moving average net charge per residue (NCPR) along the sequence, where the NCPR is calculated in overlapping windows of size blobLen . Typically a blob length of 5-7 is used. getFig defines if a matplotlib object is returned instead of being rendered.
show_linearFCR(blobLen=5, getFig=False) Renders a matplotlib plot of the moving average fraction of charged residues (FCR) along the sequence, where the FCR is calculated in overlapping windows of size blobLen . Typically a blob length of 5-7 is used. getFig defines if a matplotlib object is returned instead of being rendered.
show_linearSigma(blobLen=5, getFig=False) Renders a matplotlib plot of the moving average sigma parameter along the sequence, where sigma is calculated in overlapping windows of size blobLen . Typically a blob length of 5-7 is used. Recall that sigma is calculated as the NCPR 2 / FCR. getFig defines if a matplotlib object is returned instead of being rendered.
show_linearComplexity(complexityType='WF', alphabetSize=20, userAlphabet=<>, windowSize=10, stepSize=1, wordSize=3, getFig=False) Renders a matplotlib plot of the linear sequence complexity. For a discussion of the various options see the get_linear_complexity description under the SequenceParameters functions table. getFig defines if a matplotlib object is returned instead of being rendered.

Plotting functions (file-creating ‘save’ functions)

The following functions let you plot parameters from your sequence and save those plots to file for future use.

Function name Operation
save_phaseDiagramPlot(filename, label='', title='Diagram of states', legendOn=True, xLim=1, yLim=1, fontSize=10, saveFormat='png') Generates a matplotlib Das-Pappu diagram of states plot which is then saved to disk. filename is required and defines the file to be saved. Adding extensions is recommended but not required. All options are the same as in show_phaseDiagramPlot , with the addition of the saveFormat keyword, which defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz. (DEFAULT = png)
save_uverskyPlot(filename, label='', title='Uversky plot', legendOn=True, xLim=1, yLim=1, fontSize=10, saveFormat='png') Generates a matplotlib Uversky plot with your sequence on the diagram  which is then saved to disk. filename is required and defines the file to be saved. Adding extensions is recommended but not required. All options are the same as in show_uverskyPlot , with the addition of the saveFormat keyword, which defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz. (DEFAULT = png)
save_linearHydropathy(filename, blobLen=5, saveFormat='png') Renders a matplotlib plot of the moving average hydropathy along the sequence, where the hydropathy is calculated in overlapping windows of size blobLen . Typically 5-7 is used. The plot is saved in the filename location. Adding extensions is recommended but not required. The saveFormat keyword defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz.
save_linearNCPR(filename, blobLen=5, saveFormat='png') Renders a matplotlib plot of the moving average net charge per residue (NCPR) along the sequence, where the NCPR is calculated in overlapping windows of size blobLen . Typically 5-7 is used. The plot is saved in the filename location. Adding extensions is recommended but not required. The saveFormat keyword defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz.
save_linearFCR(filename, blobLen=5, saveFormat='png') Renders a matplotlib plot of the moving average fraction of charged residues (FCR) along the sequence, where the FCR is calculated in overlapping windows of size blobLen . Typically 5-7 is used. The plot is saved in the filename location. Adding extensions is recommended but not required. The saveFormat keyword defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz.
save_linearSigma(filename, blobLen=5, saveFormat='png') Renders a matplotlib plot of the moving sigma value, where sigma defines the local charge assymetry and is used in the calculation of kappa. Sigma is calculated over blobs of blobLen size, typically with blobs of 5-7 residues. The plot is saved in the filename location. Adding extensions is recommended but not required. The saveFormat keyword defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz.
save_linearComplexity(filename, complexityType='WF', alphabetSize=20, userAlphabet=<>, windowSize=10, stepSize=1, wordSize=3, saveFormat='png') Renders a matplotlib plot of the linear sequence complexity. For a discussion of the various options see the get_linear_complexity description under the SequenceParameters functions table. The plot is saved in the filename location. Adding extensions is recommended but not required. The saveFormat keyword defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz.
save_linearCoposition(filename, blobLen=5, saveFormat='png', title='', plot_data=False)) Renders a matplotlib plot of the local, position-specific amino acid composition. In version 0.1.9 the only grouping is the standard physiochemical grouping of amino acids, but in future versions we plan to add customizable groups to the plotting functions (customizable groups are available in the equivalent analysis function get_linear_sequence_composition ). The local density is initially calculated and the fit to a univariate spline to remove noise and make the local sequence features more easily identifiable. If the plot_data variable is set to true the raw data is plotted alongside this spline fit, to ensure the fitting procedure is capturing the relevant sequence features. The plot is saved in the filename location. Adding extensions is recommended but not required. The saveFormat keyword defines the output format - this parameter is passed to matplotlibs savefig command which supports the following filetypes: emf, eps, pdf, png, ps, raw, rgba, svg, svgz.

Abstract

Protein interactions of α-chymotrypsinogen A (aCgn) were quantified using light scattering from low to high protein concentrations. Static light scattering (SLS) was used to determine the excess Rayleigh ratio (R ex ) and osmotic second virial coefficients (B22) as a function of pH and total ionic strength (TIS). Repulsive (attractive) protein–protein interactions (PPI) were observed at pH 5 (pH 7), with decreasing repulsions (attractions) upon increasing TIS. Simple colloidal potential of mean force models (PMF) that account for short-range nonelectrostatic attractions and screened electrostatic interactions were used to fit model parameters from data for B22 vs TIS at both pH values. The parameters and PMF models from low-concentration conditions were used as the sole input to transition matrix Monte Carlo simulations to predict high concentration R ex behavior. At conditions where PPI are repulsive to slightly attractive, experimental R ex data at high concentrations could be predicted quantitatively by the simulations. However, accurate predictions were challenging when PPI were strongly attractive due to strong sensitivity to changes in PMF parameter values. Additional simulations with higher-resolution coarse-grained molecular models suggest an approach to qualitatively predict cases when anisotropic surface charge distributions will lead to overall attractive PPI at low ionic strength, without assumptions regarding electrostatic “patches” or multipole expansions.