We are searching data for your request:
Upon completion, a link will appear to access the found materials.
There are roughly 10,000 to 20,000 protein species in the human proteome (while I've seen also numbers of 500,000 to 1,000,000). Furthermore, there are roughly 200 different cell types in the human body. My question is:
How are the protein species distributed over the cell types?
This means specifically:
How many proteins are expressed by n cell types?
How many cell types express n proteins?
Probably these numbers are not known exactly because not all the proteins may be known that a given cell type expresses.
But there might be evidence which general form the two distribution curves do have. Are they Poisson distributions and do look more like this?
Or some other kind of distribution, e.g. Gauss or even multi-modal?
From my searches, there is no single resource that includes an atlas of all human proteins produced across all human cell types. However, there are several recent mass spectrometry studies that look at cell-type-resolved proteomes for specific human and mouse tissues that provide some insight into the distribution of proteins across different cells.
A Cell-type-resolved Liver Proteome
This study identified between 6200 and 8500 mouse gene products in each of four cell types -- hepatocytes, hepatic stellate cells, Kupffer cells, and liver sinusoidal endothelial cells -- and a total of 10,075 gene products across all four cell types. Figure 1D shows that there is significant overlap between the proteomes; 5,246 proteins (52.1%) are shared by all four cell types, and only 1,451 proteins (14.4%) are cell-type exclusive in this set. Besides cell-resident proteins, the authors also report secreted proteins unique to hepatocytes and Kupffer cells. These results by-and-large corroborate an earlier publication, Cell-Type-Resolved Quantitative Proteomics of Murine Liver, where the authors reported 8,338 of 11,520 (72.4%) proteins were common to the five hepatic cell types tested.
Cell type- and brain region-resolved mouse brain proteome
This study looked at four types of isolated neurons from specific regions of the mouse brain as well as five types of primary cultured neurons. Of 13,061 total proteins, 10,529 (80.6%) are common to the five cultured cell types tested, and only 194 (1.4%) are unique to a single cell type. Figure 2D makes clear that protein abundance across cell types - not just binary presence/absence - is important to cell identity. (More on that below… )
Region and cell-type resolved quantitative proteomic map of the human heart
Getting to the human proteins mentioned in your question, this publication analyzed three cardiac cell types and adipose fibroblasts across 16 anatomical regions, giving both spatial and functional perspectives on protein type distribution in the human heart. Figure 5A, like Figure 1D for the mouse liver proteome paper, shows a great amount of overlap of proteins between cell types: of 11,163 total proteins, 7,965 (71.4%) are common to all cell types (including adipose fibroblasts), and 617 (5.5%) are unique to one type of cell. Interestingly, and perhaps not surprisingly, the subsets of proteins unique to specific cells are enriched for cell surface markers. (Related:The in silico human surfaceome)
Social network architecture of human immune cells unveiled by quantitative proteomics
These authors identified more than 10,000 different proteins across 28 primary human hematopoietic cell populations, with three or four biological replicates per cell type, including 17 distinct types of immune cells. They identified an average of 9,500 proteins per cell type, and, for the immune cells, generated both "steady state" and "activated" proteomes, giving insight into how the proteome changes both across cell types and within each type between states.
To directly answer your question,
How are the protein species distributed over the cell types?
I took the data from Supplementary Table 6, averaged the replicates, converted protein copy-number values to a binary presence/absence matrix, and computed the number of immune cell types (1 - 17) represented by each unique protein ID. The inset graph combines cell type and activation state labels to ask whether the effect of activation dominates over cell-type differences in proteome divergence.
Presence/Absence, no threshold -- Unique protein distribution across 17 primary immune cells types. A protein is considered "present" in a cell type if average copy-number value is greater than zero. Data from Rieckmann et al. Nat Immunol. 2017.
Because the inferred protein copy-numbers from the mass spectrometry data cover a high dynamic range, I also looked at the distribution of proteins that had average copy-number values at least double the zero-depleted cell-type-specific median copy-number value.
Highly abundant proteins -- Unique protein distribution across 17 primary immune cells types. Protein-cell pairs are only counted if the average protein copy-number is at least double the median protein copy-number for each cell type. Data from Rieckmann et al. Nat Immunol. 2017.
Taken together, these distributions suggest a few things:
- for this diverse set of immune cells, a majority of proteins are present in all cell types when looking at strict presence/absence data
- after subsetting for highly abundant proteins, a bimodal distribution appears, suggesting that protein abundance is a better metric for functional comparison of proteomes than a simple binary metric
- the proteomes of one cell type between different activation states are more similar than the proteomes of different cell types in the same state
For anyone interested, I've made the cleaned-up data available on Dropbox.
A more complete answer might combine the data from the immune cell and heart cell publications to get a sense of proteome concordance across tissues, but, just from a cursory analysis, differences in protein labels between the datasets would make comparison tedious, so I'll leave that to someone else!
4 Mixture Models
One of the main challenges of biological data analysis is dealing with heterogeneity. The quantities we are interested in often do not show a simple, unimodal “textbook distribution”. For example, in the last part of Chapter 2) we saw how the histogram of sequence scores in Figure 2.25 had two separate modes, one for CpG-islands and one for non-islands. We can see the data as a simple mixture of a few (in this case: two) components. We call these finite mixtures. Other mixtures can involve almost as many components as we have observations. These we call infinite mixtures 58 58 We will see that—as for so many modeling choices–the right complexity of the mixture is in the eye of the beholder and often depends on the amount of data and the resolution and smoothness we want to attain. .
In Chapter 1 we saw how a simple generative model with a Poisson distribution led us to make useful inference in the detection of an epitope. Unfortunately, a satisfactory fit to real data with such a simple model is often out of reach. However, simple models such as the normal or Poisson distributions can serve as building blocks for more realistic models using the mixing framework that we cover in this chapter. Mixtures occur naturally for flow cytometry data, biometric measurements, RNA-Seq, ChIP-Seq, microbiome and many other types of data collected using modern biotechnologies. In this chapter we will learn from simple examples how to build more realistic models of distributions using mixtures.
Molecular Biology of the Cell. 4th edition.
From a chemical point of view, proteins are by far the most structurally complex and functionally sophisticated molecules known. This is perhaps not surprising, once one realizes that the structure and chemistry of each protein has been developed and fine-tuned over billions of years of evolutionary history. We start this chapter by considering how the location of each amino acid in the long string of amino acids that forms a protein determines its three-dimensional shape. We will then use this understanding of protein structure at the atomic level to describe how the precise shape of each protein molecule determines its function in a cell.
How Do Cellular Innovations Arise?
For practical reasons, cell biology has historically focused on the average features of the members of large populations of genetically uniform cells. However, natural selection does not operate directly on population means but on variation among individuals. Moreover, the evolutionary response to selection on a trait is not a simple matter of variation, but a function of the fraction of variation that has a genetic basis. Estimation of these key parameters is now within reach as new technologies allow assays of single cells in a high-throughput manner. Applications of these methods to genetically uniform populations reveal substantial cell-to-cell variation in gene-specific numbers of transcripts and proteins in all domains of life (43 ⇓ –45), and such variation (intrinsic cellular noise) seems to be a natural outcome of biophysical features of interactions between transcription factors and their binding sites, which can be quantified in mechanistic terms (46, 47). These kinds of observations, which can be extended to other intracellular traits (48), are essential to understanding the limits to the evolvability of cellular features. This is because environmental variance (intracellular noise) reduces the ability of a population to respond to selection by overshadowing the heritable genetic component of variation (49).
Although conceptually straightforward, resolving the degree to which variation (and covariation) of phenotypes in populations of cells is a consequence of genetic vs. environmental causes will require large-scale experimental designs including genetically variable isolates. When applied in this way, single-cell phenotyping down to the level of individual molecules has the potential to revolutionize the field of quantitative genetics by elucidating the precise sources of variation underlying the expression of higher-order cellular features. Notably, the statistical framework of quantitative genetics is also fully equipped to address the evolutionary consequences of transient epigenetic effects (49), whose influences are dissipated over time with various levels of reinforcement (e.g., refs. 50 ⇓ –52).
3. Comparisons of more than two means
The two-sample t -test works well in situations where we need to determine if differences exist between two populations for which we have sample means. But what happens when our analyses involve comparisons between three or more separate populations? Here things can get a bit tricky. Such scenarios tend to arise in one of several ways. Most common to our field is that we have obtained a collection of sample means that we want to compare with a single standard. For example, we might measure the body lengths of young adult-stage worms grown on RNAi-feeding plates, each targeting one of 100 different collagen genes. In this case, we would want to compare mean lengths from animals grown on each plate with a control RNAi that is known to have no effect on body length. On the surface, the statistical analysis might seem simple: just carry out 100 two-sample t -tests where the average length from each collagen RNAi plate is compared with the same control. The problem with this approach is the unavoidable presence of false-positive findings (also known as Type I errors ). The more t -tests you run, the greater the chance of obtaining a statistically significant result through chance sampling. Even if all the populations were identical in their lengths, 100 t -tests would result on average in five RNAi clones showing differences supported by P- values of <0.05, including one clone with a P -value of <0.01. This type of multiple comparisons problem is common in many of our studies and is a particularly prevalent issue in high-throughput experiments such as microarrays, which typically involve many thousands of comparisons.
Taking a slightly different angle, we can calculate the probability of incurring at least one false significance in situations of multiple comparisons. For example, with just two t -tests and a significance threshold of 0.05, there would be an % chance 29 that we would obtain at least one P -value that was <0.05 just by chance [1 – (0.95) 2 = 0.0975]. With just fourteen comparisons, that probability leaps to >50% (1 – (0.95) 14 = 0.512). With 100 comparisons, there is a 99% chance of obtaining at least one statistically significant result by chance. Using probability calculators available on the web (also see Section 4.10), we can determine that for 100 tests there is a 56.4% chance of obtaining five or more false positives and a 2.8% chance of obtaining ten or more. Thinking about it this way, we might be justifiably concerned that our studies may be riddled with incorrect conclusions! Furthermore, reducing the chosen significance threshold to 0.01 will only help so much. In this case, with 50 comparisons, there is still an % probability that at least one comparison will sneak below the cutoff by chance. Moreover, by reducing our threshold, we run the risk of discarding results that are both statistically and biologically significant.
A related but distinct situation occurs when we have a collection of sample means, but rather than comparing each of them to a single standard, we want to compare all of them to each other. As is with the case of multiple comparisons to a single control, the problem lies in the sheer number of tests required. With only five different sample means, we would need to carry out 10 individual t -tests to analyze all possible pair-wise comparisons [5(5 − 1)/2 = 10]. With 100 sample means, that number skyrockets to 4,950. (100(100 − 1)/2 = 4,950). Based on a significance threshold of 0.05, this would lead to about 248 statistically significant results occurring by mere chance! Obviously, both common sense, as well as the use of specialized statistical methods, will come into play when dealing with these kinds of scenarios. In the sections below, we discuss some of the underlying concepts and describe several practical approaches for handling the analysis of multiple means.
3.2. Safety through repetition
Before delving into some of the common approaches used to cope with multiple comparisons, it is worth considering an experimental scenario that would likely not require specialized statistical methods. Specifically, we may consider the case of a large-scale “functional genomics screen”. In C. elegans , these would typically be carried out using RNAi-feeding libraries (Kamath et al., 2003) or chemical genetics (Carroll et al., 2003) and may involve many thousands of comparisons. For example, if 36 RNAi clones are ultimately identified that lead to resistance to a particular bacterial pathogen from a set of 17,000 clones tested, how does one analyze this result? No worries: the methodology is not composed of 17,000 statistical tests (each with some chance of failing). That's because the final reported tally, 36 clones, was presumably not the result of a single round of screening. In the first round (the primary screen), a larger number (say, 200 clones) might initially be identified as possibly resistant (with some “false significances” therein). A second or third round of screening would effectively eliminate virtually all of the false positives, reducing the number of clones that show a consistent biological affect to 36. In other words, secondary and tertiary screening would reduce to near zero the chance that any of the clones on the final list are in error because the chance of getting the same false positives repeatedly would be very slim. This idea of “safety through independent experimental repeats” is also addressed in Section 4.10 in the context of proportion data. Perhaps more than anything else, carrying out independent repeats is often best way to solidify results and avoid the presence of false positives within a dataset.
3.3. The family-wise error rate
To help grapple with the problems inherent to multiple comparisons, statisticians came up with something called the family-wise error rate . This is also sometimes referred to as the family-wide error rate , which may provide a better description of the underlying intent. The basic idea is that rather than just considering each comparison in isolation, a statistical cutoff is applied that takes into account the entire collection of comparisons. Recall that in the case of individual comparisons (sometimes called the per-comparison or comparison-wise error rate ), a P -value of <0.05 tells us that for that particular comparison, there is less than a 5% chance of having obtained a difference at least as large as the one observed by chance. Put another way, in the absence of any real difference between two populations, there is a 95% chance that we will not render a false conclusion of statistical significance. In the family-wise error rate approach, the criterion used to judge the statistical significance of any individual comparison is made more stringent as a way to compensate for the total number of comparisons being made. This is generally done by lowering the P -value cutoff (α level) for individual t -tests. When all is said and done, a P -value of <0.05 will mean that there is less than a 5% chance that the entire collection of declared positive findings contains any false positives.
We can use our example of the collagen experiment to further illustrate the meaning of the family-wise error rate. Suppose we test 100 genes and apply a family-wise error rate cutoff of 0.05. Perhaps this leaves us with a list of 12 genes that lead to changes in body size that are deemed statistically significant . This means that there is only a 5% chance that one or more of the 12 genes identified is a false positive. This also means that if none of the 100 genes really controlled body size, then 95% of the time our experiment would lead to no positive findings. Without this kind of adjustment, and using an α level of 0.05 for individual tests, the presence of one or more false positives in a data set based on 100 comparisons could be expected to happen >99% of the time. Several techniques for applying the family-wise error rate are described below.
3.4. Bonferroni-type corrections
The Bonferroni method , along with several related techniques, is conceptually straightforward and provides conservative family-wise error rates. To use the Bonferroni method, one simply divides the chosen family-wise error rate (e.g., 0.05) by the number of comparisons to obtain a Bonferroni-adjusted P -value cutoff. Going back to our example of the collagen genes, if the desired family-wise error rate is 0.05 and the number of comparisons is 100, the adjusted per-comparison significance threshold would be reduced to 0.05/100 = 0.0005. Thus, individual t -tests yielding P -values as low as 0.0006 would be declared insignificant. This may sound rather severe. In fact, a real problem with the Bonferroni method is that for large numbers of comparisons, the significance threshold may be so low that one may fail to detect a substantial proportion of true positives within a data set. For this reason, the Bonferroni method is widely considered to be too conservative in situations with large numbers of comparisons.
Another variation on the Bonferroni method is to apply significance thresholds for each comparison in a non-uniform manner. For example, with a family-wise error rate of 0.05 and 10 comparisons, a uniform cutoff would require any given t -test to have an associated P -value of <0.005 to be declared significant. Another way to think about this is that the sum of the 10 individual cutoffs must add up to 0.05. Interestingly, the integrity of the family-wise error rate is not compromised if one were to apply a 0.04 significance threshold for one comparison, and a 0.00111 (0.01/9) significance threshold for the remaining nine. This is because 0.04 + 9(0.00111) ≈ 0.05. The rub, however, is that the decision to apply non-uniform significance cutoffs cannot be made post hoc based on how the numbers shake out! For this method to be properly implemented, researchers must first prioritize comparisons based on the perceived importance of specific tests, such as if a negative health or environmental consequence could result from failing to detect a particular difference. For example, it may be more important to detect a correlation between industrial emissions and childhood cancer rates than to effects on the rate of tomato ripening. This may all sound rather arbitrary, but it is nonetheless statistically valid.
3.5. False discovery rates
As discussed above, the Bonferroni method runs into trouble in situations where many comparisons are being made because a substantial proportion of true positives are likely to be discarded for failing to score below the adjusted significance threshold. Stated another way, the power of the experiment to detect real differences may become unacceptably low. Benjamini and Hochberg (1995) are credited with introducing the idea of the false discovery rate (FDR) , which has become an indispensable approach for handling the statistical analysis of experiments composed of large numbers of comparisons. Importantly, the FDR method has greater power than does the Bonferroni method. In the vernacular of the FDR method, a statistically significant finding is termed a “discovery”. Ultimately, the FDR approach allows the investigator to set an acceptable level of false discoveries (usually 5%), which means that any declared significant finding has a 5% chance of being a false positive. This differs fundamentally from the idea behind the family-wise model, where an error rate of 5% means that there is a 5% chance that any of the declared significant findings are false. The latter method starts from the position that no differences exist. The FDR method does not suppose this.
The FDR method is carried out by first making many pairwise comparisons and then ordering them according to their associated P -values, with lowest to highest displayed in a top to bottom manner. In the examples shown in Table 3, this was done with only 10 comparisons (for three different data sets), but this method is more commonly applied to studies involving hundreds or thousands of comparisons. What makes the FDR method conceptually unique is that each of the test-derived P -values is measured against a different significance threshold. In the example with 10 individual tests, the one giving the lowest P -value is measured against 30 0.005 (0.05/10). Conversely, the highest P -value is measured against 0.05. With ten comparisons, the other significance thresholds simply play out in ordered increments of 0.005 (Table 3). For example, the five significance thresholds starting from the top of the list would be 0.005, 0.010, 0.015, 0.020, and 0.025. The formula is k (α/ C ), where C is the number of comparisons and k is the rank order (by sorted P -values) of the comparison. If 100 comparisons were being made, the highest threshold would still be 0.05, but the lowest five in order would be 0.0005, 0.0010, 0.0015, 0.0020, and 0.0025. Having paired off each experimentally derived P -value with a different significance threshold, one checks to see if the P -value is less than the prescribed threshold. If so, then the difference is declared to be statistically significant (a discovery), at which point one moves on to the next comparison, involving the second-lowest P -value. This process continues until a P -value is found that is higher than the corresponding threshold. At that point, this and all remaining results are deemed not significant.
Table 3. Illustration of FDR method, based on artificial P -values from 10 comparisons.
The highlighted values indicate the first P -value that is larger than the significance threshold (i.e., the FDR critical value)].
*Comparisons that were declared significant by the method.
Examples of how this can play out are shown in Table 3. Note that even though some of the comparisons below the first failed test may themselves be less than their corresponding significance thresholds (Data Set #3), these tests are nevertheless declared not significant. This may seem vexing, but without this property the test would not work. This is akin to a “one strike and you're out” rule. Put another way, that test, along with all those below it on the list, are declared persona non grata and asked to leave the premises!
Although the FDR approach is not hugely intuitive, and indeed the logic is not easily tractable, it is worth considering several scenarios to see how the FDR method might play out. For example, with 100 independent tests of two populations that are identical, chance sampling would be expected to result on average with a single t -test having an associated P -value of 0.01 31 . However, given that the corresponding significance threshold would be 0.0005, this test would not pass muster and the remaining tests would also be thrown out. Even if by chance a P -value of <0.0005 was obtained, the next likely lowest P -value on the list, 0.02, would be measured against 0.001, underscoring that the FDR method will be effective at weeding out imposters. Next, consider the converse situation: 100 t -tests carried out on two populations that are indeed different. Furthermore, based on the magnitude of the difference and the chosen sample size, we would expect to obtain an average P -value of 0.01 for all the tests. Of course, chance sampling will lead to some experimental differences that result in P -values that are higher or lower than 0.01, including on average one that is 0.0001 (0.01/100). Because this is less than the cutoff of 0.0005, this would be classified as a discovery, as will many, though not all, of the tests on this particular list. Thus, the FDR approach will also render its share of false-negative conclusions (often referred to as Type II errors). But compared with the Bonferroni method, where the significance threshold always corresponds to the lowest FDR cutoff, the proportion of these errors will be much smaller.
3.6. Analysis of variance
Entire books are devoted to the statistical method known as analysis of variance 32 (ANOVA) . This section will contain only three paragraphs. This is in part because of the view of some statisticians that ANOVA techniques are somewhat dated or at least redundant with other methods such as multiple regression (see Section 5.5). In addition, a casual perusal of the worm literature will uncover relatively scant use of this method. Traditionally, an ANOVA answers the following question: are any of the mean values within a dataset likely to be derived from populations 33 that are truly different? Correspondingly, the null hypothesis for an ANOVA is that all of the samples are derived from populations, whose means are identical and that any difference in their means are due to chance sampling. Thus, an ANOVA will implicitly compare all possible pairwise combinations of samples to each other in its search for differences. Notably, in the case of a positive finding, an ANOVA will not directly indicate which of the populations are different from each other. An ANOVA tells us only that at least one sample is likely to be derived from a population that is different from at least one other population.
Because such information may be less than totally satisfying, an ANOVA is often used in a two-tiered fashion with other tests these latter tests are sometimes referred to as post hoc tests . In cases where an ANOVA suggests the presence of different populations, t -tests or other procedures (described below) can be used to identify differences between specific populations. Moreover, so long as the P -value associated with the ANOVA is below the chosen significance threshold, the two means that differ by the greatest amount are assured of being supported by further tests. The correlate, however, is not true. Namely, it is possible to “cherry pick” two means from a data set (e.g., those that differ by the greatest amount) and obtain a P value that is <0.05 based on a t -test even if the P -value of the ANOVA (which simultaneously takes into account all of the means) is >0.05. Thus, ANOVA will provide a more conservative interpretation than t -tests using chosen pairs of means. Of course, focusing on certain comparisons may be perfectly valid in some instances (see discussion of planned comparisons below). In fact, it is generally only in situations where there is insufficient structure among treatment groups to inspire particular comparisons where ANOVA is most applicable. In such cases, an insignificant ANOVA finding might indeed be grounds for proceeding no further.
In cases of a positive ANOVA finding, a commonly used post hoc method is Tukey's test , which goes by a number of different names including Tukey's honest significant difference test and the Tukey-Kramer test . The output of this test is a list of 95% CIs of the differences between means for all possible pairs of populations. Real differences between populations are indicated when the 95% CI for a given comparison does not include zero. Moreover, because of the family-wise nature of this analysis, the entire set of comparisons has only a 5% chance of containing any false positives. As is the case for other methods for multiple comparisons, the chance of obtaining false negatives increases with the number of populations being tested, and, with post hoc ANOVA methods, this increase is typically exponential. For Tukey's test, the effect of increasing the number of populations is manifest as a widening of 95% CIs, such that a higher proportion will encompass zero. Tukey's test does have more power than the Bonferroni method but does not generate precise P -values for specific comparisons. To get some idea of significance levels, however, one can run Tukey's test using several different family-wise significance thresholds (0.05, 0.01, etc.) to see which comparisons are significant at different thresholds. In addition to Tukey's test, many other methods have been developed for post hoc ANOVA including Dunnett's test, Holm's test, and Scheffe's test. Thus if your analyses take you heavily into the realm of the ANOVA, it may be necessary to educate yourself about the differences between these approaches.
3.7. Summary of multiple comparisons methods
Figure 10 provides a visual summary of the multiple comparisons methods discussed above. As can be seen, the likelihood of falsely declaring a result to be statistically significant is highest when conducting multiple t -tests without corrections and lowest using Bonferroni-type methods. Conversely, incorrectly concluding no significant difference even when one exists is most likely to occur using the Bonferroni method. Thus the Bonferroni method is the most conservative of the approaches discussed, with FDR occupying the middle ground. Additionally, there is no rule as to whether the uniform or non-uniform Bonferroni method will be more conservative as this will always be situation dependent. Though discussed above, ANOVA has been omitted from Figure 10 since this method does not apply to individual comparisons. Nevertheless, it can be posited that ANOVA is more conservative than uncorrected multiple t -tests and less conservative than Bonferroni methods. Finally, we can note that the statistical power of an analysis is lowest when using approaches that are more conservative (discussed further in Section 6.2).
Figure 10. Strength versus weakness comparison of statistical methods used for analyzing multiple means.
3.8. When are multiple comparison adjustments not required?
There is no law that states that all possible comparisons must be made. It is perfectly permissible to choose a small subset of the comparisons for analysis, provided that this decision is made prior to generating the data and not afterwards based on how the results have played out! In addition, with certain datasets, only certain comparisons may make biological sense or be of interest. Thus one can often focus on a subset of relevant comparisons. As always, common sense and a clear understanding of the biology is essential. These situations are sometimes referred to as planned comparisons, thus emphasizing the requisite premeditation. An example might be testing for the effect on longevity of a particular gene that you have reason to believe controls this process. In addition, you may include some negative controls as well as some “long-shot” candidates that you deduced from reading the literature. The fact that you included all of these conditions in the same experimental run, however, would not necessarily obligate you to compensate for multiple comparisons when analyzing your data.
In addition, when the results of multiple tests are internally consistent, multiple comparison adjustments are often not needed. For example, if you are testing the ability of gene X loss of function to suppress a gain-of-function mutation in gene Y, you may want to test multiple mutant alleles of gene X as well as RNAi targeting several different regions of X. In such cases, you may observe varying degrees of genetic suppression under all the tested conditions. Here you need not adjust for the number of tests carried out, as all the data are supporting the same conclusion. In the same vein, it could be argued that suppression of a mutant phenotype by multiple genes within a single pathway or complex could be exempt from issues of multiple comparisons. Finally, as discussed above, carrying out multiple independent tests may be sufficient to avoid having to apply statistical corrections for multiple comparisons.
3.9. A philosophical argument for making no adjustments for multiple comparisons
Imagine that you have written up a manuscript that contains fifteen figures (fourteen of which are supplemental). Embedded in those figures are 23 independent t -tests, none of which would appear to be obvious candidates for multiple comparison adjustments. However, you begin to worry. Since the chosen significance threshold for your tests was 0.05, there is nearly a 70% chance [1 – (0.95) 23 = 0.693] that at least one of your conclusions is wrong 34 . Thinking about this more, you realize that over the course of your career you hope to publish at least 50 papers, each of which could contain an average of 20 statistical tests. This would mean that over the course of your career you are 99.9999999999999999999947% likely to publish at least one error and will undoubtedly publish many (at least those based on statistical tests). To avoid this humiliation, you decide to be proactive and impose a career-wise Bonferroni correction to your data analysis. From now on, for results with corresponding statistical tests to be considered valid, they must have a P -value of <0.00005 (0.05/1000). Going through your current manuscript, you realize that only four of the 23 tests will meet your new criteria. With great sadness in your heart, you move your manuscript into the trash folder on your desktop.
Although the above narrative may be ridiculous (indeed, it is meant to be so), the underlying issues are very real. Conclusions based on single t -tests, which are not supported by additional complementary data, may well be incorrect. Thus, where does one draw the line? One answer is that no line should be drawn, even in situations where multiple comparison adjustments would seem to be warranted. Results can be presented with corresponding P -values, and readers can be allowed to make their own judgments regarding their validity. For larger data sets, such as those from microarray studies, an estimation of either the number or proportion of likely false positives can be provided to give readers a feeling for the scope of the problem. Even without this, readers could in theory look at the number of comparisons made, the chosen significance threshold, and the number of positive hits to come up with a general idea about the proportion of false positives. Although many reviewers and readers may not be satisfied with this kind of approach, know that there are professional statisticians who support this strategy. Perhaps most importantly, understand that whatever approaches are used, data sets, particularly large ones, will undoubtedly contain errors, including both false positives and false negatives. Wherever possible, seek to confirm your own results using multiple independent methods so that you are less likely to be fooled by chance occurrence.
29 This discussion assumes that the null hypothesis (of no difference) is true in all cases.
30 Notice that this is the Bonferroni critical value against which all P-values would be compared.
31 If the null hypothesis is true, P-values are random values, uniformly distributed between 0 and 1.
32 The name is a bit unfortunate in that all of statistics is devoted to analyzing variance and ascribing it to random sources or certain modeled effects.
33 These are referred to in the official ANOVA vernacular as treatment groups.
34 This is true supposing that none are in fact real.
Plant Materials and Growth Conditions
Plants used in this study were of the Arabidopsis thaliana Col-0 ecotype, the A17 ecotype of Medicago truncatula, the M82 LA3475 cultivar of tomato (Solanum lycopersicum), and the Nipponbare cultivar of rice (Oryza sativa). Transgenic plants of each species for INTACT were produced by transformation with a binary vector carrying both a constitutively expressed biotin ligase and constitutively expressed NTF protein containing a nuclear outer membrane association domain ( Ron et al., 2014). The binary vector used for M. truncatula was identical to the tomato vector ( Ron et al., 2014) but was constructed in a pB7WG vector containing the phosphinothricin resistance gene for plant selection and it retains the original AtACT2p promoter. The binary vector used for rice is described elsewhere ( Reynoso et al., 2017). Transformation of rice was performed at UC Riverside and tomato transformation was performed at the UC Davis plant transformation facility. Arabidopsis plants were transformed by the floral dip method ( Clough and Bent, 1998), and composite transgenic M. truncatula plants were produced according to established procedures ( Limpens et al., 2004).
For root tip chromatin studies, constitutive INTACT transgenic plant seeds were surface sterilized and sown on 0.5× Murashige and Skoog (MS) medium ( Murashige and Skoog, 1962) with 1% (w/v) sucrose in 150-mm-diameter Petri plates, except for tomato and rice, where full-strength MS medium with 1% (w/v) sucrose and without vitamins was used. Seedlings were grown on vertically oriented plates in controlled growth chambers for 7 d after germination, at which point the 1-cm root tips were harvested and frozen immediately in liquid N2 for subsequent nuclei isolation. The growth temperature and light intensity was 20°C and 200 μmol/m 2 /s for Arabidopsis and M. truncatula, 23°C and 80 μmol/m 2 /s for tomato, and 28°C/25°C day/night and 110 μmol/m 2 /s for rice. Light cycles were 16 h light/8 h dark for all species, and light was produced with a 50:50 mixture of 6500K and 3000K T5 fluorescent bulbs.
For studies of the Arabidopsis root hair and non-hair cell types, previously described INTACT transgenic lines were used ( Deal and Henikoff, 2010). These lines are in the Col-0 background and carry a constitutively expressed biotin ligase gene (ACT2p:BirA) and a transgene conferring cell-type-specific expression of the NTF gene (from the GLABRA2 promoter in non-hair cells or the ACTIN DEPOLYMERIZING FACTOR8 promoter in root hair cells). Plants were grown vertically on plates as described above for 7 d, at which point 1.25-cm segments from within the fully differentiated cell zone were harvested and flash frozen in liquid N2. This segment of the root contains only fully differentiated cells and excludes the root tip below and any lateral roots above.
For comparison of ATAC-seq using crude and INTACT-purified Arabidopsis nuclei, a constitutive INTACT line was used (ACT2p:BirA/UBQ10p:NTF) ( Sullivan et al., 2014), and nuclei were isolated as described previously ( Bajic et al., 2018). In short, after growth and harvesting as described above, 1 to 3 g of root tips was ground to a powder in liquid N2 in a mortar and pestle and then resuspended in 10 mL of NPB (20 mM MOPS, pH 7, 40 mM NaCl, 90 mM KCl, 2 mM EDTA, 0.5 mM EGTA, 0.5 mM spermidine, 0.2 mM spermine, and 1× Roche Complete protease inhibitors) with further grinding. This suspension was then filtered through a 70 μM cell strainer and centrifuged at 1200g for 10 min at 4°C. After decanting, the nuclei pellet was resuspended in 1 mL of NPB and split into two 0.5-mL fractions in new tubes. Nuclei from one fraction were purified by INTACT using streptavidin-coated magnetic beads as previously described ( Bajic et al., 2018) and kept on ice prior to counting and subsequent transposase integration reaction. Nuclei from the other fraction were purified by nonionic detergent lysis of organelles and sucrose sedimentation, as previously described ( Bajic et al., 2018). Briefly, these nuclei in 0.5 mL of NPB were pelleted at 1200g for 10 min at 4°C, decanted, and resuspended thoroughly in 1 mL of cold EB2 (0.25 M sucrose, 10 mM Tris, pH 8, 10 mM MgCl2, 1% Triton X-100, and 1× Roche Complete protease inhibitors). Nuclei were then pelleted at 1200g for 10 min at 4°C, decanted, and resuspended in 300 μL of EB3 (1.7 M sucrose, 10 mM Tris, pH 8, 2 mM MgCl2, 0.15% Triton X-100, and 1× Roche Complete protease inhibitors). This suspension was then layered gently on top of 300 μL of fresh EB3 in a 1.5-mL tube and centrifuged at 16,000g for 10 min at 4°C. Pelleted nuclei were then resuspended in 1 mL of cold NPB and kept on ice prior to counting and transposase integration.
For INTACT purification of total nuclei from root tips of M. truncatula, tomato, and rice, as well as purification of Arabidopsis root hair and non-hair cell nuclei, 1 to 3 g of starting tissue was used. In all cases, nuclei were purified by INTACT and nuclei yields were quantified as described previously ( Bajic et al., 2018).
Freshly purified nuclei to be used for ATAC-seq were kept on ice prior to the transposase integration reaction and never frozen. Transposase integration reactions and sequencing library preparations were then performed as previously described ( Bajic et al., 2018). In brief, 50,000 purified nuclei or 50 ng of Arabidopsis leaf genomic DNA was used in each 50 μL transposase integration reaction for 30 min at 37°C using Nextera reagents (Illumina FC-121-1030). DNA fragments were purified using the Minelute PCR purification kit (Qiagen), eluted in 11 μL of elution buffer, and the entirety of each sample was then amplified using High Fidelity PCR Mix (NEB) and custom bar-coded primers for 9 to 12 total PCR cycles. These amplified ATAC-seq libraries were purified using AMPure XP beads (Beckman Coulter), quantified by qPCR with the NEBNext Library Quantification Kit (NEB), and analyzed on a Bioanalyzer High Sensitivity DNA Chip (Agilent) prior to pooling and sequencing.
Sequencing was performed using the Illumina NextSeq 500 or HiSeq 2000 instrument at the Georgia Genomics Facility at the University of Georgia. Sequencing reads were either single-end 50-nucleotide or paired-end 36-nucleotide and all libraries that were to be directly compared were pooled and sequenced on the same flow cell.
Sequence Read Mapping, Processing, and Visualization
Sequencing reads were mapped to their corresponding genome of origin using Bowtie2 software ( Langmead and Salzberg, 2012) with default parameters. Genome builds used in this study were Arabidopsis version TAIR10, M. truncatula version Mt4.0, Tomato version SL2.4, and Rice version IRGSP 1.0.30. Mapped reads in .sam format were converted to .bam format and sorted using Samtools 0.1.19 ( Li et al., 2009). Mapped reads were then filtered using Samtools to retain only those reads with a mapping quality score of 2 or higher (Samtools “view” command with option “-q 2” to set mapping quality cutoff). Arabidopsis ATAC-seq reads were further filtered with Samtools to remove those mapping to either the chloroplast or mitochondrial genomes, and root hair and non-hair cell data sets were also subsampled such that the experiments within a biological replicate had the same number of mapped reads prior to further analysis. For normalization and visualization, the filtered, sorted .bam files were converted to bigwig format using the “bamcoverage” script in deepTools 2.0 ( Ramírez et al., 2016) with a bin size of 1 bp and RPKM normalization. Use of the term normalization in this article refers to this process. Heat maps and average plots displaying ATAC-seq data were also generated using the “computeMatrix” and “plotHeatmap” functions in the deepTools package. Genome browser images were made using the Integrative Genomics Viewer (IGV) 2.3.68 ( Thorvaldsdóttir et al., 2013) with bigwig files processed as described above.
Identification of Orthologous Genes among Species
Orthologous genes among species were selected exclusively from syntenic regions of the four genomes. Syntenic orthologs were identified using a combination of CoGe SynFind (https://genomevolution.org/CoGe/SynFind.pl) with default parameters, and CoGe SynMap (https://genomevolution.org/coge/SynMap.pl) with the QuotaAlign feature selected and a minimum of six aligned pairs required ( Lyons and Freeling, 2008 Lyons et al., 2008).
Peak Calling to Detect THSs
Peak calling on ATAC-seq data was performed using the “Findpeaks” function of the HOMER package ( Heinz et al., 2010). The parameters “-region” and “-minDist 150” were used to allow identification of variable length peaks and to set a minimum distance of 150 bp between peaks before they are merged into a single peak, respectively. We refer to the peaks called in this way as transposase hypersensitive sites or THSs.
Genomic Distribution of THSs
For each genome, the distribution of THSs relative to genomic features was assessed using the PAVIS web tool ( Huang et al., 2013) with “upstream” regions set as the 2000 bp upstream of the annotated transcription start site and “downstream” regions set as 1000 bp downstream of the transcription termination site.
TF Motif Analyses
ATAC-seq THSs that were found in two replicates of each sample were used for motif analysis. The regions were adjusted to the same size (500 bp for root tip THSs or 300 bp for cell-type-specific dTHSs). The MEME-ChIP pipeline ( Machanick and Bailey, 2011) was run on the repeat-masked fasta files representing each THS set to identify overrepresented motifs, using default parameters. For further analysis, we used the motifs derived from the DREME, MEME, and CentriMo programs that were significant matches (E value < 0.05) to known motifs. Known motifs from both Cis-BP ( Weirauch et al., 2014) and the DAP-seq database ( O’Malley et al., 2016) were used in all motif searches.
Assignment of THSs to Genes
For each ATAC-seq data set, the THSs were assigned to genes using the “TSS” function of the PeakAnnotator 1.4 program ( Salmon-Divon et al., 2010). This program assigns each peak/THS to the closest TSS, whether upstream or downstream, and reports the distance from the peak center to the TSS based on the genome annotations described above.
To examine motif-centered footprints for TFs of interest, we used the “dnase_average_profile.py” script in the pyDNase package ( Piper et al., 2013). The script was used in ATAC-seq mode [“-A” parameter] with otherwise default parameters.
Defining High-Confidence Target Sites for Transcription Factors
We used FIMO ( Grant et al., 2011) to identify motif occurrences for TFs of interest, and significant motif occurrences were considered to be those with a P value < 0.0001. Genome-wide high confidence binding sites for a given transcription factor were defined as transposase hypersensitive sites in a given cell type or tissue that also contain a significant motif occurrence for the factor and also overlap with a known enriched region for that factor from DAP-seq or ChIP-seq data (see also Supplemental Figure 2 for a schematic diagram of this process).
GO analyses using only Arabidopsis genes were performed using the GeneCodis 3.0 program ( Nogales-Cadenas et al., 2009 Tabas-Madrid et al., 2012). Hypergeometric tests were used with P value correction using the false discovery rate (FDR) method. AgriGO was used for comparative GO analysis of gene lists among species, using default parameters ( Du et al., 2010 Tian et al., 2017).
The raw and processed ATAC-seq data described here have been deposited in the NCBI Gene Expression Omnibus database under record number GSE101482. The characteristics of each data set (individual accession number, read numbers, mapping characteristics, and THS statistics) are included in Supplemental Data Set 8 . For comparison to our ATAC-seq data from root tips, we used a published DNase-seq data set from 7-d-old whole Arabidopsis roots (SRX391990), which was generated from the same INTACT transgenic line used in our experiments ( Sullivan et al., 2014). Publicly available ChIP-seq and DAP-seq data sets were also used to identify genomic binding sites for transcription factors of interest. These include ABF3 (AT4G34000 SRX1720080) and MYB44 (AT5G67300 SRX1720040) ( Song et al., 2016), HY5 (AT5G11260 SRX1412757), CBF2 (AT4G25470 SRX1412036), MYB77 (AT3G50060 SRX1412453), ABI5 (AT2G36270 SRX670505), MYB33 (AT5G06100 SRX1412418), NAC083 (AT5G13180 SRX1412546), MYB77 (AT3G50060 SRX1412453), WRKY27 (AT5G52830 SRX1412681), and At5g04390 (SRX1412214) ( O’Malley et al., 2016). Raw reads from these files were mapped and processed as described above for ATAC-seq data, including peak calling with the HOMER package. Published RNA-seq data from Arabidopsis root hair and non-hair cells ( Li et al., 2016a) were used to define transcripts that were specifically enriched in the root hair cell relative to the non-hair cell (hair-cell-enriched genes), and vice versa (non-hair-enriched genes). We defined cell-type-enriched genes as those whose transcripts were at least 2-fold more abundant in one cell type than the other and had an abundance of at least five RPKM in the cell type with higher expression.
Supplemental Figure 1. Comparison of read counts at enriched regions in DNase-seq versus ATAC-seq and Crude-ATAC-seq versus INTACT-ATAC-seq.
Supplemental Figure 2. Analysis of reproducibility in tomato ATAC-seq data
Supplemental Figure 3. Analysis of ATAC-seq signals at orthologous genes.
Supplemental Figure 4. Defining high-confidence binding sites and target genes for each TF.
Supplemental Figure 5. Overlaps of root tip transcription factor target genes.
Supplemental Figure 6. Wild-type and hy5-1 root tip morphology and gravitropism phenotypes.
Supplemental Figure 7. Comparison of ATAC-seq read counts between data sets.
Supplemental Figure 8. Footprinting at motifs of cell-type-enriched TFs in genomic DNA and cell-type-specific ATAC-seq data sets.
Supplemental Data Set 1 . Characteristics of THSs in Arabidopsis, M. truncatula, rice, and tomato.
Supplemental Data Set 2 . Syntenic orthologous genes in all four species.
Supplemental Data Set 3 . Expressolog gene sets in four species.
Supplemental Data Set 4 . Motifs common to THSs in all species.
Supplemental Data Set 5 . Predicted target genes for ABF3, CBF2, HY5, and MYB77 in all four species.
Supplemental Data Set 6 . Motifs overrepresented in cell-type-enriched differential transposase hypersensitive sites.
Supplemental Data Set 7 . Binding sites and target genes for cell-type-enriched TFs.
Supplemental Data Set 8 . ATAC-seq data set characteristics.
Examples of Cells
As mentioned above, archaebacteria are a very old form of prokaryotic cells. Biologists actually put them in their own “domain” of life, separate from other bacteria.
Key ways in which archaebacteria differ from other bacteria include:
- Their cell membranes, which are made of a type of lipid not found in either bacteria or eukaryotic cell membranes.
- Their DNA replication enzymes, which are more similar to those of eukaryotes than those of bacteria, suggesting that bacteria and archae are only distantly related, and archaebacteria may actually be more closely related to us than to modern bacteria.
- Some archaebacteria have the ability to produce methane, which is a metabolic process not found in any bacteria or any eukaryotes.
Archaebacteria’s unique chemical attributes allow them to live in extreme environments, such as superheated water, extremely salty water, and some environments which are toxic to all other life forms.
Scientists became very excited in recent years at the discovery of Lokiarchaeota – a type of archaebacteria which shares many genes with eukaryotes that had never before been found in prokaryotic cells!
It is now thought that Lokiarchaeota may be our closest living relative in the prokaryotic world.
You are most likely familiar with the type of bacteria that can make you sick. Indeed, common pathogens like Streptococcus and Staphylococcus are prokaryotic bacterial cells.
But there are also many types of helpful bacteria – including those that break down dead waste to turn useless materials into fertile soil, and bacteria that live in our own digestive tract and help us digest food.
Bacterial cells can commonly be found living in symbiotic relationships with multicellular organisms like ourselves, in the soil, and anywhere else that’s not too extreme for them to live!
Plant cells are eukaryotic cells that are part of multicellular, photosynthetic organisms.
Plants cells have chloroplast organelles, which contain pigments that absorb photons of light and harvest the energy of those photons.
Chloroplasts have the remarkable ability to turn light energy into cellular fuel, and use this energy to take carbon dioxide from the air and turn it into sugars that can be used by living things as fuel or building material.
In addition to having chloroplasts, plant cells also typically have a cell wall made of a rigid sugars, to enable plant tissues to maintain their upright structures such as leaves, stems, and tree trunks.
Plant cells also have the usual eukaryotic organelles including a nucleus, endoplasmic reticulum, and Golgi apparatus.
For this exercise, let’s look at a type of animal cell that is of great importance to you: your own liver cell.
Like all animal cells, it has mitochondria which perform cellular respiration, turning oxygen and sugar into large amounts of ATP to power cellular functions.
It also has the same organelles as most animal cells: a nucleus, endoplasmic reticulum, Golgi apparatus, etc..
But as part of a multicellular organism, your liver cell also expresses unique genes, which give it unique traits and abilities.
Liver cells in particular contain enzymes that break down many toxins, which is what allows the liver to purify your blood and break down dangerous bodily waste.
The liver cell is an excellent example of how multicellular organisms can be more efficient by having different cell types work together.
Your body could not survive without liver cells to break down certain toxins and waste products, but the liver cell itself could not survive without nerve and muscle cells that help you find food, and a digestive tract to break down that food into easily digestible sugars.
And all of these cell types contain the information to make all the other cell types! It’s simply a matter of which genes are switched “on” or “off” during development.
In this study, we determined the so far most extensively measured human cell proteome. We identified >10 000 proteins expressed in the commonly used human tissue culture cell line U2OS and demonstrate that protein discovery has reached saturation under the experimental conditions used, i.e., that further measurements of the same type would not be expected to identify additional proteins. We furthermore describe a large-scale estimate of protein abundances in a human cell. We and others have previously shown that the dynamic range of protein concentrations spans more than three orders of magnitude in the bacterium L. interrogans ( Malmstrom et al, 2009 ) and five orders of magnitude in yeast ( Ghaemmaghami et al, 2003 de Godoy et al, 2008 Picotti et al, 2009 ). In the present study, we demonstrate that the protein copy numbers of a human cell span at least seven orders of magnitude. This range is similar to that determined in mouse cells ( Schwanhausser et al, 2011 ). This finding is furthermore in good agreement with the volume of the relevant cell types, namely ∼0.2 μm 3 in L. interrogans ( Beck et al, 2009 ), and about ∼30 μm 3 in S. cerevisiae and ∼4000 μm 3 in U2OS, (assuming spherical shape and 4 and 20 μm diameter for yeast and U2OS, respectively).
Interestingly, the bacterium L. interrogans expresses a relatively small number of in very high copy proteins, e.g. proteins of the translation and protein folding system, metabolic enzymes as well as components of the cell wall. Those proteins make up the majority of the total protein mass ( Malmstrom et al, 2009 ) and a considerable fraction of the cytoplasmic volume ( Beck et al, 2009 ), while proteins functioning in signaling, protein transport, or regulatory pathways, e.g. transcription factors, comprise a minority of the quantitative proteome. To investigate whether the same holds true for eukaryotes, we systematically compared the four available data sets mentioned above (Figure 3). We arbitrarily grouped all functional categories into three major classes: (i) cellular core functions containing carbohydrate, nucleobase, nucleoside, nucleotide, nucleic acid metabolic processes, lipid and other metabolic processes as well as transcription, translation, DNA replication, transport, and other core functions (ii) regulatory functions, namely cytoskeleton organization, cell adhesion, cell division, phosphorylation, protein metabolic processes, signaling, developmental process, cell communication, and other regulatory functions and (iii) others. The bacterium L. interrogans devotes most of its protein mass (∼75%) to core and <25% to regulatory functions. In contrast, less than half of the analyzed protein mass of U2OS fulfills core functions, and 51% carries out regulatory functions. In particular, the total fraction of protein devoted to cytoskeleton organization, protein metabolic processes and signaling is largely expanded in U2OS cells, while other processes with the exception of central metabolic processes are largely reduced. A very similar picture emerged for mouse cells. Yeast, at a first glance, does not seem to follow this trend. However, it devotes only one third of the total protein mass to metabolism, while the corresponding number is >50% in L. interrogans. As a single cell eukaryote, yeast expands a significant fraction of its protein mass (∼30%) on translation and protein sorting. Taken together, this analysis indicates that the fraction of total protein mass devoted to regulatory functions is largely expanded in higher eukaryotes.
In multicellular species, domain families fulfilling regulatory functions have been more frequently subject to gene expansion than domains fulfilling core functions ( Vogel and Chothia, 2006 Ori et al, 2011 ). We therefore investigated, using the quantitative data generated in this study, how this effect is linked to protein abundance. We and others showed that protein abundance is linked to function, namely that high-abundant proteins are often responsible for core functions, such as energy metabolism and translation, while regulatory functions such as protein phosphorylation and transcriptional regulation are often carried out by low-abundant proteins (Figure 2B and C Supplementary Table S3 Schwanhausser et al, 2011 ). There are several lines of evidence suggesting that protein abundance is also linked to evolvability. It has been previously shown that highly expressed proteins evolve more slowly than proteins expressed at lower levels, i.e., they display a reduced protein divergence on the sequence level ( Pal et al, 2001 Subramanian and Kumar, 2004 ), while low-abundant proteins display decreased sequence conservation across organisms ( Schrimpf et al, 2009 ). It was further shown that protein families displaying lower abundance variability across species less often underwent gene duplication and that abundance variability scales inversely with protein expression ( Weiss et al, 2010 ). These findings indirectly suggest a link between protein abundance and gene duplicability. Our data support this hypothesis. We show a negative correlation between the frequency of domain families in the human genome and their median copy number per cell (Figure 2D Supplementary Figure S4A Supplementary Table S5). We also show that proteins, which have a higher number of paralogs, tend to be expressed at lower copy number (Supplementary Figure S4B). These findings underline the view that duplications of genes encoding for proteins expressed at high level are maintained under purifying selection, likely because of energy constraints ( Lane and Martin, 2010 ) or higher risk of protein aggregation and toxicity ( Drummond et al, 2005 ). Interestingly, a recent study that compares the relative expression level of gene products of three human cell lines on proteome and transcriptome level showed that proteins involved in regulatory functions more often vary in their expression levels as compared with core functions ( Lundberg et al, 2010 ). One might thus speculate that the large fraction of the human proteome expressed at low copy number and involved in regulatory function was the main source of biological innovation during evolution. This hypothesis is supported by the following lines of evidence: (i) domain families occurring in low-abundant proteins are significantly more correlated with increase in organism complexity than the ones present in highly expressed proteins (P=7.8e−9, one-sided Wilcoxon rank sum test Supplementary Figure S4C Supplementary Table S5). (ii) The abundance of proteins involved in core functions is more strongly conserved across species than for proteins involved in regulatory functions ( Schrimpf et al, 2009 ). (iii) The fraction of the proteome devoted to regulatory functions significantly expanded during the course of evolution (Figure 3).
Regulatory, often low-abundant proteins are key players in mediating the integration of external stimuli with the cell's internal state and they control fundamental biological processes such as cell proliferation, migration, and cell differentiation. It was recently shown for mouse cells that low-abundant proteins and mRNAs are less stable than high-abundant ones ( Schwanhausser et al, 2011 ). Therefore, expression at low copy numbers might provide an efficient way of dynamic regulation by translation and rapid turnover. Vice versa, cellular core functions might be more efficiently regulated by other means than degradation.
Current limitations of protein abundance indices determined from MS data are the availability of PTPs accounting for the multitude of isoforms within protein families and a bias toward proteins that produce fewer well-ionizing peptides. In particular, GO analysis reveals an underrepresentation of transmembrane proteins in the identified proteome (Supplementary Table S4). Such an effect has been observed before ( Schrimpf et al, 2009 ) and is likely a result of the reduced accessibility of membrane proteins for MS analysis, although we had used an MS compatible detergent during sample preparation. This finding is further underlined by fact that a significant fraction of high-abundant mRNAs not discovered on the protein level encodes for membrane proteins. Otherwise, the distribution of functional categories on the genome and proteome level is quite similar, suggesting high proteome coverage and that the assumption of an even extractability of proteins holds true for the majority of proteins but not for membrane proteins. We demonstrate the feasibility of establishing protein abundance scales in very complex proteomes with precision that is likely sufficient to allow the analysis of biological systems by means of computational modeling. The method used in this study is principally applicable to the majority of all cell types and might be useful to study a multitude of cellular states and organisms in the future.
Materials And Methods
Cells, DNA Construct, Antibodies, and Reagents
COS-7 cells (African Green Monkey American Type Culture Collection, Rockville, MD) were used in all experiments. They were maintained in DME (Biofluids, Rockville, MD) supplemented with 10% FBS, 2 mM glutamine, 100 U/ml penicillin, and 100 μg/ml streptomycin at 37°C in a 5% CO2 incubator. The cloning and expression of VSVG–GFP are as previously described (Presley et al., 1997). In brief, pCDM 8.1 vector carrying VSVG-ts045 with EGFP (Clontech, Palo Alto, CA) directly linked to the carboxy terminus was expressed in COS-7 cells using electroporation. VSVG–GFP expressing cells were grown on either 13-mm glass coverslips or in chambered coverglasses (Lab Tek, Naperville, IL). They were imaged in 2–3 ml RPMI without phenol red (Biofluids) which contained 20 mM Hepes buffer, pH 7.4, 150 μg/ml cycloheximide, and 20% fetal calf serum. The concentration of cycloheximide was sufficient to inhibit protein synthesis by 90% (Cole et al., 1998). All drugs were purchased from Sigma Chemical Co (St. Louis, MO). The following antibodies were used: rabbit polyclonal antiserum to AP1 and furin (J. Bonifacino, National Instutite of Child Health and Human Development [NICHD], National Institutes of Health [NIH]) rabbit polyclonal antiserum to GM130 (G. Warren, Imperial Cancer Research Fund, London, UK) rabbit polyclonal antiserum to β-COP and, mouse monoclonal antibodies to hemagglutinin (HA) (HA.11 Berkeley Antibody, Richmond, CA). Rhodamine-conjugated secondary antibodies were purchased from Southern Biotechnology (Birmingham, AL).
Fluorescence Microscopy and Image Processing
Cells were imaged at 40° or 32°C using a Zeiss LSM 410 (Carl Zeiss Inc., Thornwood, NY) with a 100× Zeiss PlanApochromat oil immersion objective NA 1.4, or a Zeiss upright model 3 photomicroscope with a Nikon Planapo 60× oil immersion objective NA 1.4 equipped with a silicon-intensified target video (SIT) camera VE1000SIT (Dage-MTI, Michigan City, IN) attached to an Argus-10 image processor (Hamamatsu, Hamamatsu City, Japan). Temperature was controlled with a Nevtek air stream stage incubator (Burnsville, VA). On the confocal microscope, GFP molecules were excited with the 488 line of a krypton-argon laser and imaged with a 515–540 bandpass filter. Rhodamine-labeled antibodies were excited with the 568 line and imaged with a long-pass 590 filter. Filter sets for conventional fluorescein imaging and a neutral density filter were used for imaging VSVG–GFP expressing cells on the SIT video microscope system. Images from the SIT camera were digitized and collected directly to RAM (8–15 frames/s) with an Apple Power Macintosh 9600/200 equipped with a PCI-based LG-5 video grabbing card (Scion, Frederick, MD) and 768 Mbytes of RAM space. Image capturing, processing, and automatic and manual data acquisition were performed using NIH Image 1.62 (Wayne Rasband Analytics, Research Services Branch, NIH, Bethesda, MD). Export to analogue video was performed with a Targa 1000 image capturing board (Truevision, Santa Clara, CA).
Confocal Image Acquisition for Kinetic Analysis and Quantitation
Confocal digital images (see Figs. 1–3) were collected using a Zeiss Plan-Neofluor 25× oil immersion objective NA 0.8 with a pinhole of 150 (corresponding to a focal depth of ∼22 μm) in order to maintain the entire cell within the center of the focal depth and thus to minimize changes in fluorescence efficiency due to VSVG–GFP moving away from the plane of focus. Time-lapse images were captured at 30–120 s intervals with 30–50% maximum laser power and 99% attenuation. The combination of low energy, high attenuation, and the less concentrated excitation laser beam caused by the low NA objective resulted in negligible photobleaching during repetitive imaging for over 3 h. Thus, VSVG-GFP–expressing cells incubated for 20 h at 40°C and imaged for 3 h in the presence of brefeldin A (5 μg/ml) and cycloheximide (150 μg/ml) showed no change in total fluorescence intensity. Average intensities for total cellular fluorescence and Golgi-associated fluorescence were measured using NIH Image 1.62 software after subtraction of background outside the cell. The overlap by ER and plasma membrane in Golgi regions of interest (ROI) was accounted for by fitting the measured Golgi fluorescence intensity values against the Golgi compartment plus small contributions from ER plus plasma membrane. The magnitudes of these small contributions were estimated directly by least squares fitting of the experimental data.
Conversion of Fluorescence Intensity to Number of GFP Molecules
The number of VSVG–GFP molecules expressed in a single cell was estimated by comparing the total cellular pixel intensity value in digitized images to a standard curve generated with solutions of known concentrations of recombinant purified GFP (Clontech) at identical power, attenuation, contrast, and brightness settings on the confocal microscope (Lippincott-Schwartz et al., 1998). The fluorescence in a 100-μm 2 ROI was then plotted against the number of GFP molecules estimated to be within the volumetric region of interest, determined by the product of the 100 μm 2 area and the full width half maximum (FWHM). The FWHM is the distance in the z direction between planes where the intensity is 50% that at the plane of focus. It can be calculated as in the equation (obtained from Zeiss confocal microscope manual),
where λ1 = emission wavelength (nm) λ2 = excitation wavelength (nm) NA = numerical aperture M = magnification n = refractive index of immersion medium P = pinhole setting in digital units. During data aquisition for kinetic analysis the pinhole setting of 9.84 was used yielding a FWHM of 22 μm. Plots of GFP molecules versus total fluorescence (the sum of the pixel values in an area) were fitted precisely with a linear function in the range of GFP concentrations throughout this paper.
In generating the standard curve, it was assumed that the FWHM approximates the z-dimensional thickness of the sample which contributes to the fluorescence signal, and that the efficiency of detection of fluorescence within the volume is constant. The validity of these assumptions was confirmed by imaging spherical droplets of GFP solution in oil with diameters smaller then the FWHM but also within the range of analyzed cells and organelles. In applying the standard curve to living cells, we assumed that the fluorescence parameters (i.e., quantum yield, detection efficiency, and proportion of properly folded and fluorescent GFP molecules) were similar for GFP chimeras in cells and for GFP in aqueous solution (Piston et al., 1998).
Kinetic Modeling of the Secretory Pathway
To extract information from the time-lapse digital images, standard techniques of kinetic analysis were adapted to fluorescence microscopy. The model used is shown in Fig. 2,A, and contains seven parameters whose values could be determined from the fluorescence data. There are three independent rate constants: KER is the effective overall rate constant for a two-compartment ER which was used to approximate the lag associated with VSVG folding, sorting, and export KG characterizes the rate-limiting steps between arrival in the Golgi and arrival in the PM, and KPM represents the rate-limiting steps between arrival at the PM and lysosomal degradation. There are also three sampling efficiencies, and one initial value for VSVG–GFP in the ER at the switch to permissive temperature. Values for the physiologically important parameters are given in Results (Table I). A small portion of the ER and PM is sampled in the Golgi ROI simply because a portion of these membranes overlays the Golgi region. These contaminating fractions were found to be 16.2 ± 8.8 (SD)% for ER contamination of the Golgi fluorescence and 15.5 ± 9.8 (SD)% for PM contamination of the Golgi fluorescence. These numbers were obtained by least squares fitting and represent the fractions of the ER and the PM that fell within the Golgi ROI outlined in Fig. 1 B. The sampling efficiency for Golgi fluorescence itself was found to be 86.5 ± 24.7 (SD)% and represents the efficiency with which light was collected from the Golgi compartment. We suspect this is less than 100% due to the complex geometry of the Golgi apparatus, the possibility that there is significant Golgi thickness flanking the focal plane, or because the fluorescent molecules are more tightly packed in the Golgi resulting in quenching of the fluorescence signal. In contrast, when the entire cell is taken as the region of interest, all the ER fluorescence was collected (sampling efficiency, 100%) and the resulting sampling efficiency for the plasma membrane was found to be 100% (measured over the entire cell, not just the Golgi) in nearly all cells analyzed (overall average: 97.8 ± 6.2 [SD]%).
These compartments and processes were translated to the corresponding system of mass-balance ordinary differential equations using the SAAM II (v 1.1) software (SAAM Institute, Seattle, WA) (Foster et al., 1994). The two data sets, total cell fluorescence and Golgi fluorescence, were fitted simultaneously using the generalized nonlinear least squares optimization procedure (Bell et al., 1996) in the SAAM II software to quantify the rate constants and sampling efficiencies for each cell. In other words, for each cell, the same rate constants and sampling efficiencies account for both data sets. This simultaneous fitting is essential it is impossible to estimate KER and KG from the time course of Golgi fluorescence alone. Mean residence times reported in Results are calculated as reciprocals of the effective exit rate constant for the cellular compartment in question. Convergence was achieved for all data sets with only an occasional (10 out of 67 cells analyzed) requirement for inclusion of a Bayesian, a priori, term based on the entire population. All the individual cells' rate constants and sampling efficiencies were readily estimated from the experimental data. Summarizing for all the cells in Table I, the mean coefficients of variation were 2.1% for KER, 2.5% for KG, 5.0% for KPM, 2.8% for the Golgi sampling efficiency, 4.3% for the contribution of ER to the Golgi ROI, and 3.5% for the contribution of the PM to the Golgi ROI. This means that all of the individual values, which contribute to the mean and standard deviations reported in Results, were determined with precision. Statistical significance of differences among the experimental groups (reported in Table I) was assessed using the standard t test for populations with unequal variance. Model hypothesis testing applying Michaelis– Menten rate laws was preformed using SAMMII (Foster et al., 1994).
Kinetic Modeling of PGCs
The same techniques and software were applied to the analysis of data on post-Golgi traffic via PGCs, shown in Fig. 7. In this case the model (see Fig. 7 C) contains four parameters whose values could be determined from the simultaneous fitting of PGC fluorescence and total fluorescence in the ROI. These parameters are the initial postbleach fluorescence in the Golgi compartment, the rate constant distributing VSVG–GFP to PGCs, the rate constant distributing VSVG–GFP to the other pathway, and the residence time for VSVG–GFP in PGCs. The least squares optimizer determined these parameters with coefficients of variation of 16.8, 19.6, 21.5, and 7.9%, respectively. Interestingly, the optimizer could not determine the residence time for VSVG–GFP arriving in the ROI by the other pathway, except to require that it be much (at least 17-fold) shorter than the residence time in PGCs. This suggests that the other pathway is not likely to represent post-Golgi intermediates that travel via microtubular tracks (since their residence times would be similar to PGCs), and may instead represent VSVG–GFP that has diffused into the ROI from regions of the PM outside the ROI. Derived parameters such as the fraction of post-Golgi traffic distributed to measured PGCs, the fraction distributed to the other pathway, and the residence time in the PGCs were determined with coefficients of variation of 5.2, 8.2, and 7.9%, respectively. Sampling efficiencies were taken as 100% for all structures in this ROI since they are in the most peripheral and flattened portion of the imaged cell.
Cell Lines: Types, Nomenclature, Selection and Maintenance (With Statistics)
The development and various other aspects of primary culture are described above. The term cell line refers to the propagation of culture after the first subculture.
In other words, once the primary culture is sub-cultured, it becomes a cell line. A given cell line contains several cell lineages of either similar or distinct phenotypes.
It is possible to select a particular cell lineage by cloning or physical cell separation or some other selection method. Such a cell line derived by selection or cloning is referred to as cell strain. Cell strains do not have infinite life, as they die after some divisions.
Types of Cell Lines:
Finite Cell Lines :
The cells in culture divide only a limited number of times, before their growth rate declines and they eventually die. The cell lines with limited culture life spans are referred to as finite cell lines. The cells normally divide 20 to 100 times (i.e. is 20-100 population doublings) before extinction. The actual number of doublings depends on the species, cell lineage differences, culture conditions etc. The human cells generally divide 50-100 times, while murine cells divide 30-50 times before dying.
Continuous Cell Lines :
A few cells in culture may acquire a different morphology and get altered. Such cells are capable of growing faster resulting in an independent culture. The progeny derived from these altered cells has unlimited life (unlike the cell strains from which they originated). They are designated as continuous cell lines.
The continuous cell lines are transformed, immortal and tumorigenic. The transformed cells for continuous cell lines may be obtained from normal primary cell cultures (or cells strains) by treating them with chemical carcinogens or by infecting with oncogenic viruses. In the Table. 36.1, the different properties of finite cell lines and continuous cell lines are compared.
The most commonly used terms while dealing with cell lines are explained below.
The divisor of the dilution ratio of a cell culture at subculture. For instance, when each subculture divided the culture to half, the split ratio is 1: 2.
It is the number of times that the culture has been sub-cultured.
It refers to the number of doublings that a cell population has undergone. It must be noted that the passage number and generation number are not the same, and they are totally different.
Nomenclature of Cell Lines:
It is a common practice to give codes or designations to cell lines for their identification. For instance, the code NHB 2-1 represents the cell line from normal human brain, followed by cell strain (or cell line number) 2 and clone number 1. The usual practice in a culture laboratory is to maintain a log book or computer database file for each of the cell lines.
While naming the cell lines, it is absolutely necessary to ensure that each cell line designation is unique so that there occurs no confusion when reports are given in literature. Further, at the time of publication, the-cell line should be prefixed with a code designating the laboratory from which it was obtained e.g. NCI for National Cancer Institute, Wl for Wistar Institute.
Commonly used cell lines:
There are thousands of cell lines developed from different laboratories world over. A selected list of some commonly used cell lines along with their origin, morphology and other characters are given in Table. 36.2.
Selection of Cell Lines:
Several factors need to be considered while selecting a cell line.
Some of them are briefly described:
In general, non-human cell lines have less risk of biohazards, hence preferred. However, species differences need to be taken into account while extrapolating the data to humans.
2. Finite or continuous cell lines:
Cultures with continuous cell lines are preferred as they grow faster, easy to clone and maintain, and produce higher yield. But it is doubtful whether the continuous cell lines express the right and appropriate functions of the cells. Therefore, some workers suggest the use of finite cell lines, although it is difficult.
3. Normal or transformed cells:
The transformed cells are preferred as they are immortalized and grow rapidly.
The ready availability of cell lines is also important. Sometimes, it may be necessary to develop a particular cell line in a laboratory.
5. Growth characteristics:
The following growth parameters need to be considered:
i. Population doubling time
ii. Ability to grow in suspension
iii. Saturation density (yield per flask)
The stability of cell line with particular reference to cloning, generation of adequate stock and storage are important.
7. Phenotypic expression:
It is important that the cell lines possess cells with the right phenotypic expression.
Maintenance of Cell Cultures:
For the routine and good maintenance of cell lines in culture (primary culture or subculture) the examination of cell morphology and the periodic change of medium are very important.
The cells in the culture must be examined regularly to check the health status of the cells, the absence of contamination, and any other serious complications (toxins in medium, inadequate nutrients etc.).
Replacement of Medium:
Periodic change of the medium is required for the maintenance of cell lines in culture, whether the cells are proliferating or non-proliferating. For the proliferating cells, the medium need to be changed more frequently when compared to non-proliferating cells. The time interval between medium changes depends on the rate of cell growth and metabolism.
For instance, for rapidly growing transformed cells (e.g. HeLa), the medium needs to be changed twice a week, while for slowly growing non-transformed cells (e.g. IMR-90) the medium may be changed once a week. Further, for rapidly proliferating cells, the sub-culturing has to be done more frequently than for the slowly growing cells.
The following factors need to be considered for the replacement of the medium:
The cultures with high cell concentration utilize the nutrients in the medium faster than those with low concentration hence the medium is required to be changed more frequently for the former.
A fall in the pH of the medium is an indication for change of medium. Most of the cells can grow optimally at pH 7.0, and they almost stop growing when the pH falls to 6.5. A further drop in pH (between 6.5 and 6.0), the cells may lose their viability.
The rate of fall in pH is generally estimated for each cell line with a chosen medium. If the fall is less than 0.1 pH units per day, there is no harm even if the medium is not immediately changed. But when the fall is 0.4 pH units per day, medium should be changed immediately.
Embryonic cells, transformed cells and continuous cell lines grow rapidly and require more frequent sub-culturing and change of medium. This is in contrast to normal cells, which grow slowly.
4. Morphological changes:
Frequent examination of cell morphology is very important in culture techniques. Any deterioration in cell morphology may lead to an irreversible damage to cells. Change of the medium has to be done to completely avoid the risk of cell damage.