Information

What is known about the coding sequence of Factor H in the human genome?

What is known about the coding sequence of Factor H in the human genome?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Factor H is a protein coded in 20 domains.

My question is whether these domains form some kind of repeat cluster in the human genome. Basically, I'm interested in the coding sequence from an assembly point of view: Is the coding region fully resolved? Apparently there is only one copy of the gene per haplotype, but are coding domains maybe similar to each other or repetitive in themselves?


As clearly stated in the WP article:

The molecule is made up of 20 complement control protein (CCP) modules (also referred to as Short Consensus Repeats or sushi domains) connected to one another by short linkers (of between three and eight amino acid residues) and arranged in an extended head to tail fashion.

So in your terminology yes, it is a repeat cluster. Each of those 20 domains will be similar to the others, although not identical.


The ensembl entry, which was easy to get to from the Wikipeda page, shows that the protein is one gene with 20 domains

http://uswest.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000000971;r=1:196651878-196747504;t=ENST00000367429


Understanding the human genome: ENCODE at BioMed Central

The completion of the human genome project in 2003 was an immeasurably important milestone, but (like an book written in code) left many biologists wondering what the sequence might actually mean. Consequently, the focus of human genomics that year began the transition from generating sequence -- to annotating the functional elements, hidden within the human genome's 3.2 billion As, Cs, Gs and Ts. With this goal in mind the ENCODE (Encyclopedia of DNA Elements) consortium was formed.

Some combinations of these nucleotides would together constitute the exons and introns that make up genes, while some would form regulatory elements. ENCODE set out to comprehensively annotate these elements in as much functional detail as possible which can now be found in the ENCODE explorer, a novel micro-site allowing seamless navigation between articles.

Following nearly ten years of data generation, the project's findings have now been published as a series of over 30 articles in a ground-breaking, multi-publisher collaboration, between BioMed Central, Nature and Genome Research.

Genome Biology, one of BioMed Central's flagship journals, has published six articles from this project. A further article is published in BMC Genetics. These articles address important questions relating to regulatory elements, specifically how they are defined and how they correlate with gene expression. In particular, The GENCODE pseudogene resource one of the Genome Biology papers, describes genes that have suffered a lethal number of mutations. but whose 'fossil' traces are still apparent in the genome. The article shows that some of these pseudogenes may still be functional, in some cases having been partially resurrected from gene death.

Professor Mark Gerstein, the lead author of two of these articles, explained, "Among the oddities turned up by the ENCODE project are pseudogenes -- stretches of fossil DNA, evolutionary relics of the biological past. Moreover, the project's data has shown that a number of pseudogenes may be active, not as protein-coding genes but as ncRNAs."

BioMed Central has been publishing peer reviewed, open access journals for 12 years and now has a portfolio of 270 journals in science and medicine. All of the 30 ENCODE articles will be open access -- meaning these articles will be freely accessible online for all and will be available as a collection on the micro-site ENCODE explorer, which is hosted by Nature and also through an iPad App.

Professor Ewan Birney, the head of the ENCODE consortium stated that, "The ENCODE consortium was very excited to work with Genome Biology, BMC Genetics and BioMed Central to make the details of our work widely available. By coordinating these publications and creating clear paths to the original data, and by ensuring that it is all open access, we have made these large-scale ENCODE resources truly transparent and accessible for the scientific community."


Introduction

Sleep plays a vital function for survival in animals [1–3], especially vertebrates and even some invertebrates [4]. It is essential in maintaining both physical and mental health, especially in humans where sleep deprivation is linked to diabetes, high blood pressure, obesity, and decreased immune function [5,6,7]. The timing and duration of sleep varies widely among mammals [8] and is regulated by a plethora of intricate mechanisms including many circadian clock genes [9].

Among the genes responsible for circadian regulation in mammals is the basic helix-loop-helix family member e41 [5, 10, 11], also known as “differentially expressed in chondrocytes protein 2” (DEC2). It is an essential clock protein that acts as a transcription factor which maintains the negative feedback loop in the circadian clock by repressing E-box-mediated transcription [5]. Specifically, by binding to the promoter region on the prepro-orexin gene, BHLHE41 acts as a repressor of orexin expression in mammals. Furthermore, disabling orexin results in narcolepsy in mammals, confirming that orexin plays a vital role in sleep regulation [5].

BHLHE41 has several conserved functional domains including a bHLH region and the “orange” domain. As a member of the bHLH family, BHLHE41 contains a

60 amino acid bHLH conserved domain that promotes dimerization and DNA binding [10]. Specifically, the bHLH domain is composed of a DNA-binding region, E-box/N-box specificity site, and a dimerization interface for polypeptide binding. The DNA-binding region is followed by two alpha-helices surrounding a variable loop region. As a member of the group E bHLH family, this protein specifically binds to an N-box sequence (CACGCG or CACGAG) based on BHLHE41 amino acid site 53 (glutamate) [12]. The other well studied conserved domain in BHLHE41 is the orange domain which provides specificity as a transcriptional repressor [13]. These domains are conserved between humans and zebrafish in both their amino acid composition and function [14]. Unfortunately, there is no 3D structure described for a mammalian BHLHE41 in Genbank’s Protein Data Bank [15] to determine the spatial effects of amino acid variants.

Because of its essential function in sleep regulation, anomalies in clock genes can lead to abnormal patterns of sleep that can manifest in a wide variety of ways, ranging from insomnia to oversleeping [1]. A rare point mutation in the BHLHE41 gene of Homo sapiens (P384R in NM_030762, also referred to as P385R as in [10]) confers a “short-sleeper phenotype”. The mutation involves a transversion from a C to G in the DNA sequence of BHLHE41, which results in a non-synonymous substitution from proline to arginine at amino acid position 385 of the BHLHE41 protein. Since proline (nonpolar) and arginine (electrically charged, basic) have chemically dissimilar structures and since substituting these amino acids is relatively rare (BLOSUM62 value of -2), it is not surprising that this mutation has a substantial phenotypic effect. Subjects with this allele reported shorter daily sleep patterns than those with the wild type allele, without reporting any other adverse effects [10]. The function of BHLHE41 in controlling sleep and circadian clocks is conserved between humans and mice, but untested in most other mammals [10]. In zebrafish, the BHLHE41 has similar structure (five exons separated by four introns) and high sequence similarity to human homologue [14], but no variation at this residue. In Drosophila melanogaster, the most similar gene to BHLHE41 is CG17100 (Clockwork Orange), but is only weakly similar (<11% amino acid identity [16]). However, transgenically introducing the short-sleeper allele P385R into Drosophila still resulted in the short-sleeper phenotype [10] suggesting the existence of a similar regulatory network. Another nonsynonymous substitution in BHLHE41 that correlates with altered sleep behavior in humans is Y362H [17]. This mutation reduced the ability of BHLHE41 to suppress CLOCK/BMAL1 and NPAS2/BMAL1 transactivation in vitro [17].

These short-sleeper variants could provide adaptive functions in other mammals. In such case, we may detect the signature of positive selection on those codons. However, genes such as BHLHE41 are essential for survival and reproduction and are therefore often highly conserved and are more likely to show patterns of purifying selection. Purifying selection can be manifested as higher rates of synonymous substitutions compared to rates of non-synonymous substitutions (dN-dS) [18]. Negative overall dN-dS values indicate purifying selection and are often evidence that a gene is involved in some essential function (like the circadian clock), yet a codon-by-codon dN/dS analysis can detect signs of positive selection (e.g,. adaptation at the molecular level) on specific codons. To date, no one has examined patterns of selection in BHLHE41.

In fact, very few nucleotide, nor amino acid comparisons have been made in mammals beyond human vs. mouse. With the rapid accumulation of mammalian genome sequences, a plethora of homologous sequences likely exist (see [12] for phylogenetic analysis of all bHLH, but only includes two mammals—human and mouse see [14] for a comparison of zebrafish and human that calls for further sampling of mammals). Furthermore, the well-resolved mammalian phylogeny [19, 20] provides a robust foundation for which to test for homology and confirm orthology. For most non-model mammalian species with whole-genome sequences, genes are predicted using algorithms that locate open reading frames (e.g., [21]), yet rarely are the predicted genes validated experimentally [22, 23]. Some algorithms compare putative open reading frames with model-species to confirm length and expected sequence variation. Accounting for any differences in the length of coding sequences can be a challenge, due to both the existence of alternative mRNA isoforms and an increasing time of divergence [24]. A comparative approach across a diversity of lineages can help elucidate any unusual patterns of sequence variation.

In order to further explore the function of the BHLHE41 gene, we analyzed the evolutionary relationships among the BHLHE41 coding sequence in humans and other mammals. There are two clear aims of this study: (1) to utilize pre-existing data in Genbank to determine whether any mammals other than humans have the “short-sleeper” allele or exhibit variation at amino acid sites P385R and Y362H, and (2) to assess the degree of biochemical changes at all amino acid substitutions and search for the footprints of selection (dN-dS). To address these goals, we compared BHLHE41 sequences from 27 species of mammals and a reptilian outgroup that came from sequenced cDNA and full genome sequencing projects. After creating a multiple sequence alignment, we used Bayesian and maximum likelihood analyses to investigate the evolutionary relationships underlying this gene among mammals to confirm orthology. Finally, we used the multiple sequence alignment to test for purifying and positive selection across codons.


Results

Genome occupancy of Myc and Max correlates with Pol II

Using the UCSC Genome Browser [31] and ChIP-Seq datasets generated from HeLa cells [32] with antibodies stringently validated by the ENCODE project [33], occupancies of Myc and Max visually correlate with Pol II better than with the E-box element CACGTG. For example, a broad view of 10 genes across a 200 kb region shows almost identical patterns for Myc and Max and a high level of visual correlation with promoter proximal paused polymerases on each of the genes (Figure  1 A). Many genes exhibit divergent transcription as indicated by GRO-Seq [34] that can result in paused Pol II in both orientations. A closer view of one such gene demonstrates that Myc and Max reside in a position between the two peaks of Pol II (Figure  1 B). It is important to remember that the position of the immunoprecipitated factor is not indicated by the envelope of mapped DNA fragments, but rather by the peak of that envelope. Visual analysis of highly expressed genes, exemplified by MYC, provides further evidence that Myc and Max occupancy is tied to Pol II, including polymerases within the transcribed regions and downstream of the Poly(A) addition site (Figure  1 C). For the three regions shown there is almost no correlation of Myc or Max with the canonical CACGTG E-box (Figure  1 ). In comparison, distributions of CTCF [35] and a number of other DNA-binding transcription factors (Additional file 1: Figure S1) are distinct from Myc, Max, and Pol II. When entire datasets were analyzed, genomic regions occupied by Myc exhibited a much more significant overlap with Pol II ChIP-Seq peaks than with the E-box element CACGTG (Fisher’s exact test: P value < 10 -300 vs. 4.5 × 10 -7 ).

Examples of Pol II, Myc, and Max occupancy. Genome browser tracks show occupancy determined by ChIP-Seq for Pol II, Myc, Max, and CTCF over the indicated gene regions in HeLa cells. The positions of the canonical CACGTG E-boxes are indicated. Regions around (A) chromosome 19 containing 10 genes, (B) PSMB2, and (C) MYC are shown. GRO-Seq data are for IMR90 cells from <"type":"entrez-geo","attrs":<"text":"GSE13518","term_id":"13518">> GSE13518 [34].

Several straightforward bioinformatic tools were used to obtain a global view of the correlation of Myc and Max compared to Pol II and CTCF. The average occupancy around the TSS of 20,886 genes in HeLa cells was calculated and plotted. Promoter proximal paused Pol II peaked on average 83 bp downstream of the TSS. Myc and Max on average peaked upstream of the TSS at -20 and -35, respectively (Figure  2 A). Myc and Max also exhibited a slope transition at around +300 which has been previously noted for Pol II, the Med1 subunit of Mediator, and other transcription factors [36,37]. High resolution heatmaps were generated to assess the uniformity of these distributions in the 4 kb region centered on the TSSs across the same gene set (Figure  2 B). Genes were ranked by the amount of Pol II in all four heatmaps. The patterns for Myc and Max occupancy are essentially identical and they closely match the occupancy pattern for Pol II, but not CTCF. These results indicate that Myc and Max are found about 100 bp upstream of the promoter proximal paused Pol II on most of the genes occupied by Pol II. In addition, Myc and Max were also positioned very closely with Pol II in enhancer regions (Additional file 1: Figure S1C).

Correlation of Myc and Max with Pol II occupancy. (A) Metagene analysis showing the average of 20,886 genes. (B) High resolution heatmaps of the same genes rank-ordered by Pol II occupancy. The region shown is from -2 kb to +2 kb around the TSS. (C) Correlation of the occupancy of the indicated proteins. (D) Metagene analyses of Myc, Max, and Pol II ChIP-Seq datasets from eight different cell lines (HeLa, GM12878, K562, H128, H2171, MM1S, P493, and U87). (E) Metagene analysis of Myc, Max, Med1, and Pol II ChIP-Seq datasets from four different cell lines (H2171, MM1S, P493, and U87). Average occupancies of regions from -1,000 to +1,000 bp around the TSS are shown.

These ChIP-Seq datasets were also compared using an algorithm that measures the similarity of peak positions and heights in any two datasets (Figure  2 C). A value of 0 means there is no overlap of the signals at any position and 1 indicates the datasets are identical. Myc and Max most closely correlate with each other, as expected. Importantly, the second highest genome-wide correlation for both Myc and Max was Pol II. The correlation of Myc with Pol II would not be expected to be as high as its correlation with Max because of the approximately 100 bp offset of Myc (and Max) from the peaks of promoter proximal paused Pol II. As expected, CTCF was the least well correlated with all datasets because it is bound by its CTC-containing motif mainly in intragenic regions [35]. The correlation analysis was extended to include Fos, Jun, and E2F1 and none of these factors correlated as well with Pol II as Myc and Max (Additional file 1: Figure S2).

We extended our analyses to eight human cell lines with Myc, Max, and Pol II ChIP-Seq datasets. All eight datasets were combined into a multi-genome metagene analysis and the results clearly indicated that on average, as was found in HeLa cells, Myc and Max were about 100 bp upstream of the promoter proximal paused Pol II and Myc is shifted downstream from Max (Figure  2 D). Datasets for the Med1 subunit of Mediator were available for four of these cell lines and the multi-genome analysis displayed a similar distribution for Myc and Med1 including a downstream bulge over the promoter proximal paused Pol II (Figure  2 E). These analyses strongly suggest that the Myc might be recruited to these genomic loci by the transcription machinery, with Mediator as a reasonable candidate.

Under stoichiometric conditions with high concentrations of proteins and DNA, Myc-Max heterodimers display relaxed sequence specificity

Because of the low correlation between Myc-Max genome occupancy and CACGTG sequences, we re-examined the DNA binding properties of the Myc and Max proteins. Full length versions of Myc and two isoforms of Max, Max S and Max L , were expressed in E. coli and purified to homogeneity (Figure  3 A). The two Max isoforms were also individually mixed with Myc under denaturing conditions, allowed to refold using a step dialysis protocol, and then purified to obtain native, homogeneous heterodimers of Myc-Max S and Myc-Max L (Figure  3 A). Electrophoretic mobility shift assays were carried out using three 26਋p dsDNA oligos that were identical except for the center 6਋ps that contained the canonical CACGTG E-box, GTGGTG, or a completely unrelated sequence ATCTAG (Figure  3 B). Native gels were silver stained to examine the shift in the position of 200 ng of protein. As expected, both homodimeric Max isoforms bound stoichiometrically to the CACGTG containing probe producing protein/DNA complexes that migrated further than the free proteins. Max S displayed only very weak, transient binding to the other two probes while Max L had reduced, but significant affinity for GTGGTG and low affinity for the ATCTAG probe (Figure  3 B). Both Myc-Max complexes, regardless of Max isoform, produced a discrete protein DNA complex with the CACGTG probe. Surprisingly, both heterodimers bound stoichiometrically to the other two non-E-box probes (Figure  3 B). Two individual studies assaying DNA binding with the same full-length proteins yielded identical shifting patterns [5,38]. The differences in the relative levels of staining of free and DNA-bound forms of Max versus Myc-Max was caused by differences in the staining (development time) of the four representative gels shown. When Max L and Myc-Max L were analyzed on the same gel they displayed similar staining levels and comparable increases in staining when bound to DNA (Figure  3 C). It is important to understand that these EMSAs (Figure  3 B and C) were carried out under stoichiometric conditions with high concentrations proteins and DNA. These conditions do not allow the determination of dissociation constants and, especially for Myc-Max, do not display the sequence specific differences in binding that are known to exist. Instead they show that Myc-Max can bind to any DNA sequence at the high, but not unreasonable concentration tested (125 nM). The Myc-Max-DNA complexes showed only a small change in mobility comparing to the free proteins. This could be due to a change in conformation of Myc-Max that leads to a lowering of the mobility like that seen for HEXIM1 bound to 7SK RNA [39].

Biochemical analysis of Myc and Max. (A) SDS-PAGE of the indicated recombinant proteins that were expressed in E. coli and purified as described in Methods. (B) EMSA using native polyacrylamide gel electrophoresis with 200 ng of the indicated proteins (250 nM Max dimer and 125 nM Myc-Max) with 0, 0.1, 0.3, 1, or 3-fold molar excess of the indicated dsDNA. The gels were silver stained to show the mobility of the proteins. The arrows indicate protein-DNA complexes. (C) EMSA with simultaneous staining of Max L and Myc-Max L . A total of 2.5 pmole of each protein (125 nM) per lane with two levels of the indicated DNA probes. Complexes containing indicated proteins are indicated with arrows. Note that in the Myc-Max prep some dissociation of Max has occurred leading to a low level of Max and Max-DNA species. (D, E, F, G, and H) EMSAs using 0.01 nM of the indicated radiolabeled probe (blue) with the indicated concentration of proteins and competitor DNAs.

Dissociation constants of the protein-DNA complexes were determined under the required non-stoichiometric conditions using 0.01 nM radiolabeled probe. Max L and Myc-Max L displayed tight binding to CACGTG (Kds of 0.4 nM and 0.1 nM, respectively) (Figure  3 D). Max L did not form a discrete complex with the ATCTAG probe with the concentrations of protein tested (Kd ϡ μM), but instead gave only a smeary band below the position of a tightly bound complex (arrow) (Figure  3 E). This is due to initial binding followed by release of the probe during the running of the gel. Myc-Max L displayed significant affinity for the ATCTAG probe (Kd = 20 nM) (Figure  3 F). Competition binding assays under these non-stoichiometric conditions demonstrated that CACGTG containing DNA was able to compete with the binding of Max L and Myc-Max L to the CACGTG probe (Figure  3 G and H). At 1,000-fold higher concentration, the ATCTAG containing DNA was also able to compete for binding of both Max and Myc-Max to the CACGTG probe (Figure  3 G and H). These results indicate that both Max and Myc-Max prefer to bind to the probe containing CACGTG as expected. In the stoichiometric assay described above, 125 nM Myc-Max but not 250 nM Max dimer formed discrete complexes with ATCTAG DNA. In the non-stoichiometric assay, Myc-Max displayed significantly higher affinity for the ATCTAG probe than Max and this difference was seen at 10 and 100 nM protein (Figure  3 F). In the competition assay (1 nM protein) the difference between Myc-Max and Max was not seen. The concentration dependent change in the relative binding of Myc-Max and Max to non-specific DNA we observed could be related to the different on and off rates for the two proteins [40]. From all of the in vitro binding studies shown so far, we conclude that Myc-Max demonstrates a sequence preference, but that it also has significant affinity for DNA lacking a canonical E-box.

Determination of the complete sequence preference for Myc-Max and comparison with occupancy in cells

In our first attempts at trying to compare the in vivo occupancy of Myc and Max to the location of E-boxes, we ran into difficulty because of the existence of a large number of reported non-canonical E-boxes. Without quantification of the relative affinity of Myc-Max for all these sites it was difficult to correlate them with in vivo occupancy. Because of this, protein-binding microarray (PBM) assays using 𠆊ll 10-mer’ universal array designs [41,42] were used to quantify the relative occupancies of the Myc-Max L heterodimer and the Max L homodimer across all possible 8 bp sequences (that is, 8-mers). After normalization, relative Myc-Max occupancy for each of the 32,896 8-mers exhibited a 56-fold range, from 0.018 to 1 (Figure  4 A, inset). Although the method is very different from the EMSA assay described above, the PBM results also reflect the relaxed sequence preferences of Myc-Max. Most of the sequences containing CACGTG had high occupancy, but flanking bases had a significant influence (Figure  4 A). In addition, we found several E-box variants and other core 6-mers with relatively high Myc-Max occupancy. The top 12 core 6-mers and the effect of the flanking bases are shown in Figure  4 A. Like the canonical CACGTG core, Myc-Max occupancy of the other core 6-mers was significantly affected by flanking bases.

Binding of Myc to all possible 8-mers and comparison with genomic occupancy. (A) Fluorescent signal generated by Myc in vitro binding with an array containing all possible 8-mers was normalized. Twelve core 6-mer sequences with the highest in vitro occupancy are shown. The relative affinity of all 8-mers for each 6-mer is shown (10 points if the 6-mer is a palindrome or 16 if it is not). The inset shows the sorted in vitro binding signal for all possible 8-mers. (B) Genome browser view of a region on chromosome 19 comparing Myc, Max, and Pol II occupancy with the distribution of the top 12 6-mers (from A). The height of each 6-mer peak is equal to its relative in vitro occupancy (shown as percent). (C, D) Zoomed in views of two regions shown in (B) that demonstrate the lack of correlation of Myc and Max occupancy with the intrinsic affinity for the underlying DNA determined in vitro.

The problem of not knowing the relative affinity of Myc-Max for the previously proposed non-canonical E-boxes was resolved by the PBM assays so we used that information to examine the role intrinsic DNA affinity plays in the occupancy of the heterodimer in cells. A genome browser track comprising the location and relative in vitro occupancy (percent of the top binding site) of each of the top 12 6-mers was generated that graphically displays the range of intrinsic affinities across the genome (Figure  4 B). This is an improvement compared to just marking canonical and non-canonical E-boxes without regard to relative affinities of the different sites. Visual comparison of the occupancy of Myc, Max, and Pol II in HeLa cells to the accurate distribution of intrinsic affinities does not provide evidence for a strong correlation between intrinsic affinity and occupancy in cells (Figure  4 B). Closer inspection revealed that strong binding sites were not occupied and Myc and Max were found in regions that did not have any of the top 12 6-mer sites (Figure  4 C and D).

Several analyses were performed to compute the correlation between the 8-mer sequence preferences determined by PBM and the actual genomic occupancy of Myc, as measured by ChIP-Seq. The ChIP-Seq Peak algorithm [36] was used to determine the genomic location of each of the top 30,000 Myc peaks in HeLa cells. A 100 bp interval surrounding each peak was scanned to find the 8-mer with the highest possible in vitro occupancy and this score was assigned to each ChIP-Seq peak. These in vitro occupancy scores were normalized to 1, rank-ordered from highest to lowest values, and then plotted for all 30,000 peaks (Figure  5 A, blue plot). Seventy-four percent of these Myc peaks were associated with low affinity 8-mers with in vitro occupancies below 0.2. To determine if the distribution of 8-mers around sites of Myc occupancy is different from what occurs by chance, the same analysis was performed on 30,000 100 bp regions randomly chosen from accessible DNA (DNase I sensitive regions [43]) (Figure  5 A, black plot). The choice of DNase I sensitive regions as control sequences for this analysis is justified by the fact that 95% of the Myc peaks fall within such regions. Comparison of the two plots indicated that, as expected, genomic loci occupied by Myc contain more sites with high in vitro Myc occupancy compared to random accessible DNA regions (Wilcoxon rank-sum test: P value < 2.2 × 10 -16 ). This enrichment is further shown by means of a receiver operating characteristic (ROC) curve (Figure  5 A, inset). ROCs are commonly used in genomic analyses to assess whether a specific quantitative feature (here, in vitro Myc occupancy) can distinguish between two classes of sequences (here, ChIP-Seq peaks versus random accessible regions). Although the area under the ROC curve is better than expected by chance (0.637 vs. 0.5), the ROC analysis shows that the in vitro 8-mer occupancies cannot be used to accurately predict whether an accessible genomic region will be bound by Myc in cells. Here, the ROC plot shows that at a false positive rate of 0.1, the true positive rate is only 0.25. To make only 10% false positive predictions of Myc in vivo binding using the in vitro 8-mer scores, we would only be able to capture 25% of the true Myc ChIP-Seq peaks. This means that the vast majority of sites occupied by Myc are associated with low scoring 8-mers, as graphically indicated in Figure  5 A.

Comparison of Myc ChIP-Seq occupancy with in vitro binding affinities. (A) The top 30,000 sites occupied by Myc (blue) were rank-ordered and scored by the in vitro occupancy of the best 8-mer in a 100 bp window (y-axis). This was repeated at 30,000 random locations of DNase I-sensitivity (black) and the results were directly compared by ROC analysis (inset). (B) The top 30,000 sites occupied by Myc were rank-ordered by ChIP-Seq signal and scored logarithmically by either normalized ChIP-Seq signal (blue line) or the in vitro occupancy of the best 8-mer in a 100 bp window (black dots). (C) The data in (B) are presented using a default R boxplot (box: 1st to 3rd quartile, line: median, whiskers: 1.5 × interquartile range beyond the box, outliers are stacked) with ChIP-Seq signal in blue and in vitro 8-mers in grey.

To further assess whether the intrinsic binding specificity of Myc-Max determines its level of genomic occupancy in the cell, the same Myc sites were rank-ordered by their ChIP-Seq occupancy and compared to the signal of the best 8-mer within a 100 bp window around each peak. The Myc ChIP-Seq signal of the top 30,000 peaks varies about 30-fold (Figure  5 B, blue line showing decreasing occupancy from left to right). Using the same x-axis, a second plot was generated that displays the relative affinity of the best 8-mer associated with each of these Myc peaks (Figure  5 B, black dots). A slight preference for high affinity 8-mers is visible over the top 5,000 Myc peaks, but the overwhelming conclusion is that 8-mers with a wide range of in vitro occupancies are found around Myc peaks irrespective of the level of in vivo occupancy (Figure  5 B). While a statistically significant correlation can be observed between Myc ChIP-Seq occupancy and in vitro 8-mer binding strength, this relationship is weak (Spearman correlation coefficient: ρ = 0.22, P value < 2.2 × 10 -16 ). Had the cellular occupancy correlated well with the affinity for the underlying DNA sequences, there would have been a cloud of black dots clustered around the blue curve in Figure  5 B and the Spearman correlation coefficient would have been close to 1. A plot of the same data after ChIP-Seq peaks were grouped into log-scaled bins provides a more detailed view of the high occupancy sites in cells that might be expected to correlate better with intrinsic DNA affinities. However, the huge range of in vitro occupancy scores is clearly found even for the highest occupancy sites (Figure  5 C). All these analyses suggest that Myc occupancy is driven only to a small extent by its intrinsic sequence preference, and additional mechanisms are required to recruit Myc to its genomic binding locations in the cell.

Genomic sites with higher relative levels of Max

Apart from associating with Myc, Max can form Max-Max homodimers or bind with Mad proteins to form Mad-Max heterodimers [44] and these can also bind E-box DNA sites [45]. We reasoned that such sites might have more Max than Myc. To identify these sites the HeLa Myc and Max datasets were normalized and a new track was generated in which the ChIP-Seq signal for Myc was subtracted from the signal for Max. Several thousand peaks with significant levels of extra Max were found. A representative region of chromosome 17, covering about 1 million bps that contains more than a dozen genes occupied by Pol II, Myc, and Max, is shown in Figure  6 . The region contains about 20 peaks of Myc and Max and two of these sites have significant levels of extra Max. Both peaks of extra Max are on top of high scoring CACGTG sites (Figure  6 B and C). Interestingly, the top 5,000 sites with extra Max (difference values greater than 0.5) were more tightly associated with high scoring 8-mers than were Myc sites (Additional file 1: Figure S3A) and had a more significant overlap with CACGTG than the Myc sites (Fisher’s exact test: P value 㰐 -300 for extra Max sites vs. 4.5 × 10 -7 for Myc sites). The top 1,487 peaks of extra Max (difference values greater than 1.0) were selected for further analysis (Additional file 2: Table S1). These sites were always close to peaks of Myc, Max, and Pol II, but only 417 of these peaks were within 250 bp of an annotated TSS. Gene Ontology (GO) analysis was performed on the associated genes, but no significant enrichment in any type of gene was uncovered. To determine if sites of extra Max might affect gene expression, the mRNA levels of those genes were compared to the mRNA levels of the top 12,000 expressed genes as determined by RNA-Seq. The RNA levels of 351 (of the 417) genes that were identifiable in the RNA-Seq dataset were distributed uniformly across the entire range of top 12,000 expressed genes covering more than three orders of magnitude in RNA levels (Additional file 1: Figure S3B). Thus, the sites with extra Max do not seem to be associated with any particular set of genes and do not correlate with the expression level of the genes they are associated with. Overall, sites with extra Max showed a stronger preference for E-box elements compared to Myc.

Examples of sites with more Max than Myc. Genome browser views of normalized Myc, Max, and ‘Max minus Myc’ occupancy and peaks generated by ChIP-Seq Peak. The distribution of the top 12 6-mers with their relative in vitro occupancies is also displayed. (A) A large region from chromosome 17. (B, C) Close-ups of the two regions with extra Max showing alignment with high scoring 6-mers.


New Study Reveals 1 Million Human Genome Sequence Errors Across Two NGS Platforms

April 1, 2011 | “What does it mean to have a ‘healthy’ genome?” That was the question that University of Utah geneticist Mark Yandell and colleagues set out to address in an important recent paper in the journal Genetics in Medicine .* Among the key conclusions: there are 1.1 million discrepancies when the identical human genome sample is sequenced using two popular next-generation sequencing (NGS) platforms.

As Yandell and coworkers point out in the paper’s introduction, neither J. Craig Venter’s nor James Watson’s genomes were found to contain any strongly deleterious gene variants likely to cause or strongly predispose them to genetic illness, prompting some commentators to express skepticism regarding the prognostic value of personal genome sequences.

“To date,” the authors write, “the standard reply to the skeptic has been that healthy adults have healthy genomes. Although reasonable, this rebuttal presumes that we know what a healthy genome is. No doubt, a clean bill of genomic health will be the most common clinical scenario in genomic medicine. However just what does a healthy genome look like? What is the impact of sequencing technology on prognostic accuracy? What role will ethnicity play in prognosis? Finally, how useful will existing resources, such as OMIM, be for categorizing personal genome variants as deleterious? The answers to these questions are of immediate importance for the future of genomic medicine.”

Yandell recently spoke to Bio-IT World about his team’s results, including the release of the 10Gen set of personal genome variant data. While his own group works on tools for genome annotation and functional genomics, he is increasingly interested in developing tools for personal genome analysis. Late last year, in collaboration with Martin Reese and colleagues at San Francisco-based software firm Omicia, Karen Eilbeck (University of Utah), Gabor Marth (Boston College), Paul Flicek (EBI) and Lincoln Stein (Ontario Institute for Cancer Research), the consortium published a paper in Genome Biology describing a standardized file format called GVF (Genome Variation Format) for exchanging and comparing personal genome sequences .

In the new paper, the collaboration presents an analysis of the first ten publicly-available human genome sequences, including the genomes of Watson, Venter, Steve Quake, two Asian and four HapMap individuals, one of which has been sequenced on two platforms. A major goal is to explore ways to interpret personal genome sequences for clinical diagnostic purposes, says Yandell, rather than from a population genetics viewpoint.

Although Yandell’s team looked at the first ten human genomes sequenced and publically released using six different platforms (Sanger, Illumina, Life Tech, Complete, Roche/454, Helicos), and found that the platform differences were not sufficient to obscure the ethnic relationships between the genomes, there was a striking result from the side-by-side comparison of two published sequence datasets on the same HapMap sample. This subject was an anonymous African subject (NA18507) that was sequenced independently both by David Bentley’s team at Illumina (published in Nature in 2008) and Kevin McKernan’s group at Life Technologies on the SOLiD platform (published in Genome Research in 2009).

Although the two sequences shared some 77% of the total variants, Yandell and colleagues found that they differ at more than 1.1 million positions. (The Life Technologies and Illumina versions of the NA18507 genome had 575,099 and 526,836 unique positions, respectively.)

“Most people are quite shocked,” says Yandell. “But is it glass half full or half empty? From the standpoint of whole genomes consisting of 3 billion bases, there is actually very good congruence. If you’re trying to do population genetics, it’s pretty good to do platform cross comparisons.”

The view is less rosy from a diagnostics point of view, however. “Congruence is better within the coding regions of genes but it’s still a long way from perfect. We find 99% congruence within coding regions, but even then, if you’re trying to do diagnostics, taking into effect platform considerations is something that has to be done.”

Yandell stresses that sequence discrepancies are not simply a matter of which NGS platform is selected. “It’s also the variant calling procedures,” he says. “Depending upon which tool you use, you can see pretty big differences between even the same genome called with different tools—nearly as big as the two Life Tech/Illumina genomes.”

It also depends on the parameters used with the software tools, an issue that is not as broadly recognized in the NGS community as it should be, says Yandell. “There’s still a bit of black art in variant calling. It’s not so much the accuracy of the sequencing platforms, it’s also how you’re post-processing the data and calling the variants. Right now, there’s no right answer, but a lot of smart people are working very hard on this.”

On average, each personal genome contains between 20,000-25,000 single nucleotide variants in protein-coding genes compared to the reference genome. In the collaboration with the Omicia group, Yandell also found that focusing on the OMIM (Online Mendelian Inheritance in Man) collection of disease genes provides the same result as whole genome sequences in defining ethnicity with 80% certainty. “The magnitude of that signal struck us as interesting,” says Yandell. “There’s a long-term bias towards disease studies in particular ethnic groups.”

Another result was that the African genomes are typically homozygous for many more OMIM variants than the Caucasian genomes. “That’s probably due to what we might call background effects,” says Yandell. “You’ve got alleles that do you no harm as an African or African-American, but in a Caucasian or Asian background, they are legitimately disease predisposing.”

“That has implications for diagnostic medicine,” Yandell continues. “It can’t be ethnically blind. The right decision will depend upon the ethnicity of the individual. That’s a touchy subject in the field, because people get concerned when you mention ethnicity. There are already [some areas of medicine] that takes ethnicity into account. We will likely have to do that in the diagnostics domain as well.”

Following the development of a standardized file format called GVF for personal genome sequences, Yandell needed a trial set of personal genomes to use for software development, both for his own group and the broader community—the 10Gen set. (Those data are available from the Sequence Ontology website .)

Yandell’s next goal is to establish methods to automatically analyze newly resequenced genomes. A priority is to provide what he calls “clinical decision support”—relating individual DNA variants to known disease-causing variants. The goal here—primarily the Omicia side of the collaboration—is to mine a personal genome sequence, identify all known alleles associated with ill health, and then relate that to known variants in an easy manner for rapid reports.

Another focus is developing an ontology to classify disease genes for even broader clinical decision support. “The idea is you’re not just asking if someone has a nasty allele in the cystic fibrosis (CF) or BRCA1 gene, but looking at sets of genes, e.g. all genes in cardiovascular health or cancer. Does this individual have an especially unlucky combination of slightly deleterious alleles spread among several genes all involved in the same disease, which might give them a red light for cardiovascular health, even though there’s no one bad allele for that disease?”

The flip side for this clinical decision support is what to do with ‘private’ variants, novel variants that look potentially problematic? “What does it mean when you sequence someone and they have a stop codon smack in the middle of a growth factor receptor?” says Yandell “What do you do then? How do you know if you have a problem?”

That aspect of analyzing novel variants—of which every individual has hundreds—has prompted the Yandell lab in collaboration with Omicia to develop software called VAAST (Variant Annotation and Selection Tool). “It’s a tool to automatically identify damaged genes and disease-causing variants, even if they’re completely novel and never been seen before,” says Yandell, who thinks it could have a big impact. (A manuscript describing the software has been submitted, and the software will be made publicly available for academic use—and commercially through Omicia—once that paper is published.)

But Yandell has already demonstrated the potential of the VAAST tool, testing it on the same dataset used in a 2010 study identifying the Mendelian gene mutation for Miller syndrome. An earlier analysis of the genomes of the two affected siblings and their parents using a popular tool called SIFT, which predicts the phenotypic severity of amino acid changes, resulted in hundreds of variants flagged as highly deleterious, which then had to be sorted through by hand. VAAST, by comparison, identifies the disease causing alleles automatically.

The Yandell group set out to come up with a probabilistic tool that not only considers the severity of the DNA variant but also frequency information. “If everyone in the case dataset is homozygous for a stop codon in some particular gene, but 75% humans are homozygous for that allele, you can say this is unlikely to be deleterious,” says Yandell. “Those probabilistic arguments are what things like SIFT don’t do… We wanted to develop a tool that would deal with all those frequencies in a truly probabilistic fashion, so you could identify disease-causing genes with greater accuracy.”

Importantly, says Yandell, it’s fast. “You can process the genome in just a few minutes, which really cuts down on the cost of analysis,” he says. Yandell says he also has unpublished data in which VAAST identified a mystery X-linked gene mutation in a large Utah family in a matter of 15 minutes.

“I think this is huge,” says Yandell. “I was skeptical at first. I wasn’t like a member of ‘the personal genomes cult,’ if you will. I just started playing with these data, and wow: there really are prognostic and diagnostic answers to be found in them. Now I’m truly a believer.”


The HBBP1 Pseudogene Is Functional

Not only did Moleirinho et al. (2013) determine that the HBBP1 pseudogene was markedly non-variable and likely functional, they verified the hypothesis of inferred functionality with data from the ENCODE project (Dunham et al. 2012). Moleirinho et al. detected significant interactions between a segment of the β-globin cluster comprising both HBD and HBBP1, and different regions upstream of the lead gene HBE, which overlap the locus control region (LCR). As mentioned earlier, the LCR is the main control region approximately 6,000 to 18,000 bases from the β-globin cluster that engages in long range interactions via complex chromatin loops with the globin genes in the cluster (Dean 2011 Xu et al. 2010 Xu et al. 2012). Moleirinho et al. also elaborated that

A variety of previous papers have documented complex transcriptional control combined with long range chromatin interactions in the β-globin locus (Deng et al. 2012 Dostie et al. 2006 Xu et al. 2010).

These observations by Moleirinho et al. are further validated and augmented by yet another recent report by Sheffield et al. (2013) in which the HBBP1 pseudogene was shown to have at least eight network correlations within a wide variety of open and active transcriptional control sites across the β-globin locus. This is more than any other gene in the β-globin cluster. These data were derived from hematopoietic (blood) stem cells. The integrated data was based on the analysis of open and active chromatin determined by DNase1 sensitivity—a highly accurate regulatory indicator of functional chromatin (Thurman et al. 2012). The DNase results were also correlated with large-scale gene expression data.

Additional proof of gene function is the identification and characterization of a transcriptional product(s). In the case of the HBBP1 pseudogene, its multiple gene products are regulatory RNAs. In the UCSC genome browser ENCODE version 14 comprehensive gene annotation set for the HBBP1 pseudogene, annotation tracks are shown for the localization of 14 spliced expressed sequences (chr11: 5263100-5265425) that align within the HBBP1 locus as shown in Fig. 2. Many of these represent processed transcripts and/or the products of alternative splicing/transcription. The Ensembl database lists two main consensus (reference) transcripts (ENST00000433329, ENST00000454892) of 439 and 455 bases in length. One consensus version contains two exons while the other has three, and there are alternatively spliced variants of these as shown in the Vega Genome Browser (ensembl.org) for the manually curated “Havana” annotations for HBBP1. The Ensembl gene variation data set for HBBP1 lists 16 different exon variant transcripts and 42 different intron variant transcripts (useast.ensembl.org/Homo_sapiens/Gene/Variation_Gene/Table?g=ENSG00000229988r=11: 5263184-5264767). This diversity of transcript variation is partially facilitated by six different sets of exon start/end sites within the HBBP1 gene as described at the GeneLoc database at the Weizmann Institute (genecards.weizmann.ac.il/geneloc). These expressed sequence regions in the HBBP1 gene overlap and correspond with annotation tracks for transcriptionally active chromatin and transcription factor binding discussed in more detail below (see fig. 2).

Fig. 2. UCSC genome browser data showing selected gene annotation and ENCODE-related tracks for the HBBP1 locus. Analysis image accessed at genome.ucsc.edu on May 7, 2013. View larger image.

A breakdown of the transcriptional profiles for a wide variety of human pseudogenes is also located at the pseudoMap database (Chan et al. 2013a pseudomap.mbc.nctu.edu.tw). The current entry for the HBBP1 pseudogene lists the two ENSEMBL consensus IDs and indicates that the target gene regulated by the two HBBP1 transcripts is HBE1, the assumed parent (via hypothetical gene duplication) and the first gene in the β-globin cluster. However, the recent data published by Sheffield et al. (2013) clearly shows that the regulatory activity of the HBBP1 pseudogene associates across a wide variety of functional chromatin sites in the β-globin cluster.


References:

(1) Chimpanzee Sequencing and Analysis Consortium. Initial Sequence of the Chimpanzee Genome and Comparison with the Human Genome. Nature 2005, 437 (7055), 69–87. doi: 10.1038/nature04072.

(2) International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome. Nature 2001, 409 (6822), 860–921. doi: 10.1038/35057062.

(3) Pollard, K. S. Salama, S. R. King, B. Kern, A. D. Dreszer, T. Katzman, S. Siepel, A. Pedersen, J. S. Bejerano, G. Baertsch, R. Rosenbloom, K. R. Kent, J. Haussler, D. Forces Shaping the Fastest Evolving Regions in the Human Genome. PLoS Genet 2006, 2 (10), e168. doi: 10.1371/journal.pgen.0020168.

(4) Kostka, D. Hubisz, M. J. Siepel, A. Pollard, K. S. The Role of GC-Biased Gene Conversion in Shaping the Fastest Evolving Regions of the Human Genome. Molecular Biology and Evolution 2012, 29 (3), 1047–1057. doi: 10.1093/molbev/msr279.

(5) Levchenko, A. Kanapin, A. Samsonova, A. Gainetdinov, R. R. Human Accelerated Regions and Other Human-Specific Sequence Variations in the Context of Evolution and Their Relevance for Brain Development. Genome Biology and Evolution 2018, 10 (1), 166–188. doi: 10.1093/gbe/evx240.

(6) Green, R. E. Krause, J. Briggs, A. W. Maricic, T. Stenzel, U. Kircher, M. Patterson, N. Li, H. Zhai, W. Fritz, M. H. Y. Hansen, N. F. Durand, E. Y. Malaspinas, A. S. Jensen, J. D. Marques-Bonet, T. Alkan, C. Prufer, K. Meyer, M. Burbano, H. A. Good, J. M. Schultz, R. Aximu-Petri, A. Butthof, A. Hober, B. Hoffner, B. Siegemund, M. Weihmann, A. Nusbaum, C. Lander, E. S. Russ, C. Novod, N. Affourtit, J. Egholm, M. Verna, C. Rudan, P. Brajkovic, D. Kucan, Z. Gusic, I. Doronichev, V. B. Golovanova, L. V. Lalueza-Fox, C. de la Rasilla, M. Fortea, J. Rosas, A. Schmitz, R. W. Johnson, P. L. F. Eichler, E. E. Falush, D. Birney, E. Mullikin, J. C. Slatkin, M. Nielsen, R. Kelso, J. Lachmann, M. Reich, D. Paabo, S. A Draft Sequence of the Neandertal Genome. Science 2010, 328 (5979), 710–722. doi: 10.1126/science.1188021.

(7) Hubisz, M. J. Pollard, K. S. Exploring the Genesis and Functions of Human Accelerated Regions Sheds Light on Their Role in Human Evolution. Current Opinion in Genetics & Development 2014, 29, 15–21. doi: 10.1016/j.gde.2014.07.005.

(8) Krause, J. Pääbo, S. Genetic Time Travel. Genetics 2016, 203 (1), 9–12. doi: 10.1534/genetics.116.187856.

(9) Xu, K. Schadt, E. E. Pollard, K. S. Roussos, P. Dudley, J. T. Genomic and Network Patterns of Schizophrenia Genetic Variation in Human Evolutionary Accelerated Regions. Molecular Biology and Evolution 2015, 32 (5), 1148–1160. doi: 10.1093/molbev/msv031.

(10) Doan, R. N. Bae, B.-I. Cubelos, B. Chang, C. Hossain, A. A. Al-Saad, S. Mukaddes, N. M. Oner, O. Al-Saffar, M. Balkhy, S. Gascon, G. G. Homozygosity Mapping Consortium for Autism Nieto, M. Walsh, C. A. Mutations in Human Accelerated Regions Disrupt Cognition and Social Behavior. Cell 2016, 167 (2), 341-354.e12. doi: 10.1016/j.cell.2016.08.071.

(11) Gallego Romero, I. Pavlovic, B. J. Hernando-Herraez, I. Zhou, X. Ward, M. C. Banovich, N. E. Kagan, C. L. Burnett, J. E. Huang, C. H. Mitrano, A. Chavarria, C. I. Friedrich Ben-Nun, I. Li, Y. Sabatini, K. Leonardo, T. R. Parast, M. Marques-Bonet, T. Laurent, L. C. Loring, J. F. Gilad, Y. A Panel of Induced Pluripotent Stem Cells from Chimpanzees: A Resource for Comparative Functional Genomics. eLife 2015, 4, e07103. doi: 10.7554/eLife.07103.


The Dangers of Hyperadaptationism

The overreliance on adaptationist “just-so stories” in the field of evolutionary biology has been openly criticized since the 1970s. Famously, Gould and Lewontin (1979) compared such thinking to the ideology espoused by Pangloss, the fictional professor from Voltaire’s novel Candide who used just-so stories to prove that we lived in the best of all possible worlds. Unfortunately hyperadaptionalism, or the belief that the vast majority of traits found in an organism (including its DNA) are present due to some selective force, has plagued much of molecular biology as well (Sarkar, 2014). The proclamation that a biochemical activity is equivalent to function (ENCODE Project Consortium et al., 2012) is just another example of this ideology. Using this logic we would state that any transcribed DNA is functional, but would this mean that the transcript (or transcriptional process) is functional by virtue of its mere existence? To resolve this paradox, we would either have to state that (1) although the DNA is functional, its output, the RNA (or the act of transcription) is not or (2) that all RNAs are de facto functional. Obviously both of these nonsensical conclusions have their roots in hyperadaptionalist thinking and an abuse of the concept of biological function. To resolve this, we need to install a more rigorous definition of function. However, this can only be accomplished if we properly define the null hypothesis.


New limits to functional portion of human genome reported

An evolutionary biologist at the University of Houston has published new calculations that indicate no more than 25 percent of the human genome is functional. That is in stark contrast to suggestions by scientists with the ENCODE project that as much as 80 percent of the genome is functional.

In work published online in Genome Biology and Evolution, Dan Graur reports the functional portion of the human genome probably falls between 10 percent and 15 percent, with an upper limit of 25 percent. The rest is so-called junk DNA, or useless but harmless DNA.

Graur, John and Rebecca Moores Professor of Biology and Biochemistry at UH, took a deceptively simple approach to determining how much of the genome is functional, using the deleterious mutation rate - that is, the rate at which harmful mutations occur - and the replacement fertility rate.

Both genome size and the rate of deleterious mutations in functional parts of the genome have previously been determined, and historical data documents human population levels. With that information, Graur developed a model to calculate the decrease in reproductive success induced by harmful mutations, known as the "mutational load," in relation to the portion of the genome that is functional.

The functional portion of the genome is described as that which has a selected-effect function, that is, a function that arose through and is maintained by natural selection. Protein-coding genes, RNA-specifying genes and DNA receptors are examples of selected-effect functions. In his model, only functional portions of the genome can be damaged by deleterious mutations mutations in nonfunctional portions are neutral since functionless parts can be neither damaged nor improved.

Because of deleterious mutations, each couple in each generation must produce slightly more children than two to maintain a constant population size. Over the past 200,000 years, replacement-level fertility rates have ranged from 2.1 to 3.0 children per couple, he said, noting that global population remained remarkably stable until the beginning of the 19th century, when decreased mortality in newborns resulted in fertility rates exceeding replacement levels.

If 80 percent of the genome were functional, unrealistically high birth rates would be required to sustain the population even if the deleterious mutation rate were at the low end of estimates, Graur found.

"For 80 percent of the human genome to be functional, each couple in the world would have to beget on average 15 children and all but two would have to die or fail to reproduce," he wrote. "If we use the upper bound for the deleterious mutation rate (2 × 10?8 mutations per nucleotide per generation), then . the number of children that each couple would have to have to maintain a constant population size would exceed the number of stars in the visible universe by ten orders of magnitude."

In 2012, the Encyclopedia of DNA Elements (ENCODE) announced that 80 percent of the genome had a biochemical function. Graur said this new study not only puts these claims to rest but hopefully will help to refocus the science of human genomics.

"We need to know the functional fraction of the human genome in order to focus biomedical research on the parts that can be used to prevent and cure disease," he said. "There is no need to sequence everything under the sun. We need only to sequence the sections we know are functional."


Abstract

The ENCyclopedia Of DNA Elements (ENCODE) project is an international research consortium that aims to identify all functional elements in the human genome sequence. The second phase of the project comprised 1640 datasets from 147 different cell types, yielding a set of 30 publications across several journals. These data revealed that 80.4% of the human genome displays some functionality in at least one cell type. Many of these regulatory elements are physically associated with one another and further form a network or three-dimensional conformation to affect gene expression. These elements are also related to sequence variants associated with diseases or traits. All these findings provide us new insights into the organization and regulation of genes and genome, and serve as an expansive resource for understanding human health and disease.



Comments:

  1. Pin

    Only Shine

  2. Culloden

    If they say they are on the wrong track.



Write a message