Fold Coverage of sequence read?

Fold Coverage of sequence read?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What does it mean when N-Fold coverage of read of sequence? Details will be much appreciated if a link provider with brief explanation.

Fold coverage is often derived with respect to a genomic locus, not a read. In sequencing experiments, fold coverage of a genomic loci (coordinate along a reference assembly) will be the number of aligned reads that overlap the position.

N-fold coverage usually refer to the average number of reads spanning across a particular region of interest. For example, if you are interested in a gene and would like to know how confident your data is, you may look for the coverage.

There is no absolute definition on how N-fold coverage is calculated. The most common method is to get the arithmetic average per base.

Deep Sequencing

Deep sequencing refers to sequencing a genomic region multiple times, sometimes hundreds or even thousands of times. This next-generation sequencing (NGS) approach allows researchers to detect rare clonal types, cells, or microbes comprising as little as 1% of the original sample.

Estimating Depth of Coverage

This technical note helps you estimate the depth of sequencing coverage you want to achieve.

Uses of Deep Sequencing

Deep sequencing is useful for studies in cancer, microbiology, and other research involving analysis of rare cell populations. For example, deep sequencing is required to identify mutations within tumors, because normal cell contamination is common in cancer samples, and the tumors themselves likely contain multiple sub-clones of cancer cells.

Factors Affecting Sequencing Depth

The need for deep sequencing depends on a number of factors. For example, in cancer research, the required sequencing depth increases for low purity tumors, highly polyclonal tumors, and applications that require high sensitivity (identifying low frequency clones). Cancer sequencing depth typically ranges from 80× to up to thousands-fold coverage.

Factors Impacting Cancer Sequencing Depth

Purity of the tumor

Tumors usually consist of a mixture of normal and tumor tissue. A tumor that contains 50% normal tissue would require double the sequencing depth to detect the tumor mutations with the same confidence as a 100% pure tumor sample.

Heterogeneity of the tumor

Advanced tumors are frequently polyclonal. The more clonal types that are present, the deeper the sequencing needs to be to represent each clonal type properly.

Sensitivity required*

Clones representing 1% of the original tumor have the potential to become the predominant clone during drug-resistant relapse. A 1% clone will only be represented once in 100× coverage, assuming the tumor contains no normal tissue.

Deep Sequencing for Bacterial Drug Resistance Studies

A targeted deep sequencing assay identifies multidrug-resistant tuberculosis strains responsible for silent outbreaks.

Featured Products

TruSight Tumor 15

Deep coverage across the 15 genes most commonly mutated in solid tumors to detect rare variants.

MiniSeq Benchtop Sequencer

Cost-effective targeted deep sequencing for low-throughput labs.

Additional Information

Learn more about areas where deep sequencing is commonly used.

Interested in receiving newsletters, case studies, and information from Illumina based on your area of interest? Sign up now.

Related Solutions

Amplicon Sequencing

Ultra-deep sequencing of PCR amplicons enables analysis of specific genomic regions of interest. Learn more about amplicon sequencing and find comprehensive solutions.

Targeted Gene Sequencing

Targeted gene sequencing panels are useful tools for analyzing specific mutations in a given sample. Both predesigned and custom panels are available.

*i.e. the probability of detecting a mutation at a given allele frequency or abundance level of a tumor clone

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).

Fold Coverage of sequence read? - Biology

During and after translation, individual amino acids may be chemically modified, signal sequences may be appended, and the new protein “folds” into a distinct three-dimensional structure as a result of intramolecular interactions. A signal sequence is a short tail of amino acids that directs a protein to a specific cellular compartment. These sequences at the amino end or the carboxyl end of the protein can be thought of as the protein’s “train ticket” to its ultimate destination. Other cellular factors recognize each signal sequence and help transport the protein from the cytoplasm to its correct compartment. For instance, a specific sequence at the amino terminus will direct a protein to the mitochondria or chloroplasts (in plants). Once the protein reaches its cellular destination, the signal sequence is usually clipped off.

Many proteins fold spontaneously, but some proteins require helper molecules, called chaperones, to prevent them from aggregating during the complicated process of folding. Even if a protein is properly specified by its corresponding mRNA, it could take on a completely dysfunctional shape if abnormal temperature or pH conditions prevent it from folding correctly.


Copy number changes are a useful diagnostic indicator for many diseases, including cancer. The gold standard for genome-wide copy number is array comparative genomic hybridization (array CGH) [1, 2]. More recently, methods have been developed to obtain copy number information from whole-genome sequencing data ([3] reviewed by [4]). For clinical use, sequencing of genome partitions, such as the exome or a set of disease-relevant genes, is often preferred to enrich for regions of interest and sequence them at higher coverage to increase the sensitivity for calling variants [5]. Tools have been developed for copy number analysis of these datasets, as well, including CNVer [6], ExomeCNV [7], exomeCopy [8], CONTRA [9], CoNIFER [10], ExomeDepth [11], VarScan 2 [12], XHMM [13], ngCGH [14], EXCAVATOR [15], CANOES [16], PatternCNV [17], CODEX [18], and recent versions of Control-FREEC [19] and cn.MOPS [20]. However, these approaches do not use the sequencing reads from intergenic and, usually, intronic regions, limiting their potential to infer copy number across the genome.

During the target enrichment, targeted regions are captured by hybridization however, a significant quantity of off-target DNA remains in the library, and this DNA is sequenced and represents a considerable portion of the reads. Thus, off-target reads provide a very low-coverage sequencing of the whole genome, in addition to the high-coverage sequencing obtained in targeted regions. While the off-target reads alone do not provide enough coverage to call single-nucleotide variants (SNVs) and other small variants, they can provide useful information on copy number at a larger scale, as recently demonstrated by cnvOffSeq [21] and CopywriteR [22].

We developed a computational method for analysis of copy number variants and alterations in targeted DNA sequencing data that we packaged into a software toolkit. This toolkit, called CNVkit, implements a pipeline for CNV detection that takes advantage of both on– and off-target sequencing reads and applies a series of corrections to improve accuracy in copy number calling. We compare binned read depths in on– and off-target regions and find that they provide comparable estimates of copy number, albeit at different resolutions. We evaluate several bias correction algorithms to reduce the variance among binned read counts unlikely to be driven by true copy number changes. Finally, we compare copy ratio estimates by the CNVkit method and two competing CNV callers to those of array CGH, and find that CNVkit most closely agrees with array CGH. In summary, we demonstrate that both on– and off-target reads can be combined to provide highly accurate and reliable copy ratio estimates genome-wide, maximizing the copy number information obtained from targeted sequencing.

Protein Domains & Fold Classification

Detailed analysis of a protein&rsquos fold can be used to reveal its function and evolutionary history, which sometimes may be difficult to detect only using information from the amino acid sequence. Study of the relationships between the amino acid sequence and the fold may also provide deeper insights into the fundamental principles of protein structure and may aid in the design of new proteins with predefined structure and activity. For fold assignment we first need to assign the secondary structure, which is usually done by many computer programs. All PDB entries also contain detailed description of the secondary structure of the protein, including the sequence number and name of the first and last amino acid residues of each helix and &beta-strand.

Domains in proteins
As shown on the image below, while some proteins only contain a single domain, others may have several domains. Some domains have a clearly defined function associated with them, like the Rossmann fold domain (also called coenzyme-binding domain, see Proteopedia for history and details of the Rossmann fold ), discussed earlier. Such domains often &ldquocarry&rdquo their function with them when they get inserted into different proteins during evolution. Other domains, like the 4-helix bundle, are there probably just for their stability.

Below are examples of a one-domain ( hemoglobin , on the left), and a 4-domain protein ( pyruvate kinase, on the right).

The domains in pyruvate kinase are well separated from each other and have different fold. The top domain on the figure above is a &beta-sheet domain, while the other two are of alpha/beta type (see the respective Proteopedia page for details).

In most organisms the functional unit of these two proteins is tetrameric (contains 4 subunits). In the case of hemoglobin there will be 4 molecules (and 4 domains) in each functional unit, while functional unit of pyruvate kinase will contain 12 domains. The quaternary structure of the proteins is shown below (hemoglobin left, and pyruvate kinase on the right. Clicking the images will take you to the PDB 3D view ):

Defining a domain
A domain may be characterized by the following:
1- Spatially separated unit of the protein structure
2- Often has sequence and/or structural resemblance to other protein structures or domains.
3- Often has a specific function associated with it.

Fold classification databases give detailed information on the domain content of each protein and the fold associated with the domains. The procedure followed by CATH (C-class, A-Architecture, T-Topology, H-Homologous superfamily) and SCOP (Structural Classification of Proteins), includes:

  • Assignment of secondary structure
  • Assignment of domains
  • Assignment of a Class to each domain (based on secondary structure content - alpha, beta or alpha/beta types of proteins)
  • Assignment of Architecture (same as Fold , amino acid sequences not necessarily homologous - common evolutionary origin not required)
  • Assignment of Topology (same Fold + common evolutionary origin - homology)
  • Assignment of Homologous superfamily (Superfamily defines a group of proteins that appear to be homologous, even in the absence of significant sequence similarity)

1t5aA01 corresponds to chain A (since there are 4 chains in the PDB entry) and domain number 01 (of 3). If we click on the first domain, we get information about its classification (image below) - Class: Alpha &Betaeta, Architecture: 2-Layer Sandwich, Topology: Pyruvate Kinase C-terminal domain. This information is highly valuable in homology modeling , especially in cases when we need to model different domains using different modeling templates, the so-called multi-template homology modeling (discussed in more detail in the homology modelling tutorials).

We will dive deeper into the CATH database later when we discuss the second homology modelling project (coming soon).
In the next section we will look at the PDB and PDBsum protein databases, both essential for protein structure analysis, for example when planning a homology modelling project.

Single-Molecule Enzymology: Nanomechanical Manipulation and Hybrid Methods

A.H. Laszlo , . J.H. Gundlach , in Methods in Enzymology , 2017

4 Nanopore Measurements Turned Into SPRNT

With the basic concepts of nanopore sequencing explained, we are now able to describe SPRNT. From nanopore sequencing, it is already clear that a nanopore is a very useful single-molecule tool to study the properties of the enzyme that controls the DNA. Gyarfas et al. demonstrated enzyme functionality at single-nucleotide step resolution ( Gyarfas et al., 2009 ), but their sensitivity was in part limited by using αHL instead of the much more sensitive engineered MspA porin. However, SPRNT is distinct since it goes much beyond the precision of

6 Å single-nucleotide steps.

4.1 The Ion Current Is a Smooth Function of Position

In nanopore sequencing , the DNA stops temporarily in one-nucleotide intervals. During the pauses of the DNA motion, the current is measured. From one stop to the next, the ion current can change considerably. If the current change for a full nucleotide displacement of

6 Å is ΔI, would the current for the DNA being displaced by only 2 Å be ΔI′ = ΔI2 Å/6 Å = ΔI/3? In other words, can one interpolate between ion current levels to find the DNA's position with subnucleotide resolution? Or more generally, is the level plot the result of sampling a smoothly varying ion current curve as a function of position so that a nonlinear interpolation can be used? The hypothesis of the ion current varying smoothly with position seemed likely to be correct since up to four nucleotides were involved in determining each current level, effectively employing a low-pass filter that smoothes the underlying current function.

The first test of the hypothesis was to measure a piece of DNA that has a level plot with large current changes at different positions within the pore. The level plot was measured at 180 mV and then at 140 mV. The 180 mV potential pulls the DNA with a greater force toward the trans chamber than at 140 mV. Since the DNA acts like a spring, the measurements at 180 mV place the DNA further toward trans than at 140 mV. Fig. 4 C shows the level plots for these two voltages fitted with smoothly varying spline interpolations. Both splines look similar in shape with the 180 mV current curve being scaled up because of the higher voltage, but most importantly, the positions of the two curves are shifted by a fraction of a nucleotide relative to each other. In Fig. 4 D the 140 mV data points were scaled upward and shifted horizontally by δ = 0.29 nucleotide positions. This demonstration shows that the level plots are sampled from a common smooth curve and that the ion currents through nanopores can resolve positional changes of the DNA that are much smaller than one nucleotide.

Fig. 4 . Transduction of current to distance. (A) Regions of high current contrast can be used to measure DNA position precisely. This shows how small uncertainties in measured current translate to small positional uncertainty. (B) Schematic depiction of DNA position within the pore at two different voltages differences in the applied electric force result in different DNA extensions. (C) Current levels observed for phi29 DNAP-controlled motion of DNA through MspA at 180 and 140 mV of applied potential. A cubic spline interpolant has been applied to each set of current steps. Note that, apart from a scaling factor, the shape of the spline is identical, but the peak of the spline has shifted a distance δ. (D) After a linear scale and offset are applied to the two splines a horizontal displacement δ = 0.29 nt brings the two splines in line with one another. This experiment has two important results: (1) The current levels observed during single-nucleotide stepping of DNA through MspA lie along an underlying smooth curve that is well approximated by a spline. (2) This spline provides a direct mapping from current to DNA position, and we can use it to measure subnucleotide movement of DNA.

Modified from Derrington, I. M., Craig, J. M., Stava, E., Laszlo, A. H., Ross, B. C., Brinkerhoff, H., … Gundlach, J. H. (2015). Subangstrom single-molecule measurements of motor proteins using a nanopore. Nature Biotechnology, 33(10), 1073–1075. .

The variance of the ion current levels is typically less than one pA. In some regions with large ion current changes ( Fig. 4 A) this variance translates to a DNA position uncertainty of

0.06 nucleotides. With an internucleotide spacing of

6 Å this means that DNA position changes as small as

0.4 Å (40 pm) are resolvable with each event. Such extreme position precision seems impossible, given that the DNA, the enzyme position and the pore length are subject to Brownian motion with significantly larger amplitudes. However, these fluctuations happen at much shorter timescales so that the time average (mean position) over typical level durations remains precisely determined.

SPRNT's temporal resolution is remarkable also (see discussion in Section 6 ) with large level transitions being resolvable after

50 μs. As in all precision measurements, the time resolution and position resolution are anticorrelated. Both the time resolution and position resolution are optimal if the magnitude of current changes is large. In places along the DNA where the underlying smooth current curve is flat, e.g., at minima and maxima, SPRNT's resolution is less good. Using the quadromer map, discussed earlier, specific DNA constructs can be designed that optimize SPRNT's sensitivity.


The ultimate goal of any sequencing project is to determine every single base-pair of the original set of chromosomes. As we described above, rarely is an assembly program able to reconstruct a single piece of DNA per chromosome, leading to gaps in the reconstruction of the genome. These gaps are filled in through directed sequencing experiments in a process called finishing or gap closure . At this stage in the sequencing project, additional laboratory experiments and extensive manual curation are performed to validate the correctness of the final assembly, leading to a high-quality reconstruction of the original genome.

Fold Coverage of sequence read? - Biology

In our study published in Nature, we demonstrate how artificial intelligence research can drive and accelerate new scientific discoveries. We’ve built a dedicated, interdisciplinary team in hopes of using AI to push basic research forward: bringing together experts from the fields of structural biology, physics, and machine learning to apply cutting-edge techniques to predict the 3D structure of a protein based solely on its genetic sequence.

Our system, AlphaFold – described in peer-reviewed papers now published in Nature and PROTEINS – is the culmination of several years of work, and builds on decades of prior research using large genomic datasets to predict protein structure. The 3D models of proteins that AlphaFold generates are far more accurate than any that have come before - marking significant progress on one of the core challenges in biology. The AlphaFold code used at CASP13 is available on Github here for anyone interested in learning more or replicating our results. We’re also excited by the fact that this work has already inspired other, independent implementations, including the model described in this paper , and a community - built, open source implementation , described here .

What is the protein folding problem?

Proteins are large, complex molecules essential to all of life. Nearly every function that our body performs - contracting muscles, sensing light, or turning food into energy - relies on proteins, and how they move and change. What any given protein can do depends on its unique 3D structure. For example, antibody proteins utilised by our immune systems are ‘Y-shaped’, and form unique hooks. By latching on to viruses and bacteria, these antibody proteins are able to detect and tag disease - causing microorganisms for elimination. Collagen proteins are shaped like cords, which transmit tension between cartilage, ligaments, bones, and skin. Other types of proteins include Cas9, which, using CRISPR sequences as a guide, act like scissors to cut and paste sections of DNA antifreeze proteins, whose 3D structure allows them to bind to ice crystals and prevent organisms from freezing and ribosomes, which act like a programmed assembly line, helping to build proteins themselves.

The recipes for those proteins - called genes - are encoded in our DNA. An error in the genetic recipe may result in a malformed protein, which could result in disease or death for an organism. Many diseases, therefore, are fundamentally linked to proteins. But just because you know the genetic recipe for a protein doesn’t mean you automatically know its shape. Proteins are comprised of chains of amino acids (also referred to as amino acid residues). But DNA only contains information about the sequence of amino acids - not how they fold into shape. The bigger the protein, the more difficult it is to model, because there are more interactions between amino acids to take into account. As demonstrated by Levinthal’s paradox , it would take longer than the age of the known universe to randomly enumerate all possible configurations of a typical protein before reaching the true 3D structure - yet proteins themselves fold spontaneously, within milliseconds. Predicting how these chains will fold into the intricate 3D structure of a protein is what’s known as the “protein folding problem” - a challenge that scientists have worked on for decades. This unsolved problem has already inspired countless developments, from spurring IBM’s efforts in supercomputing ( BlueGene ), to novel citizen science efforts ( [email protected] and FoldIt ) to new engineering realms, such as rational protein design.

Why is protein folding important?

I think that we shall be able to get a more thorough understanding of the nature of disease in general by investigating the molecules that make up the human body, including the abnormal molecules, and that this understanding will permit. the problem of disease to be attacked in a more straightforward manner such that new methods of therapy will be developed.

Scientists have long been interested in determining the structures of proteins because a protein’s form is thought to dictate its function. Once a protein’s shape is understood, its role within the cell can be guessed at, and scientists can develop drugs that work with the protein’s unique shape.

Over the past five decades, researchers have been able to determine shapes of proteins in labs using experimental techniques like cryo-electron microscopy , nuclear magnetic resonance and X-ray crystallography , but each method depends on a lot of trial and error, which can take years of work, and cost tens or hundreds of thousands of dollars per protein structure. This is why biologists are turning to AI methods as an alternative to this long and laborious process for difficult proteins. The ability to predict a protein’s shape computationally from its genetic code alone – rather than determining it through costly experimentation – could help accelerate research.

How can AI make a difference?

Fortunately, the field of genomics is quite rich in data thanks to the rapid reduction in the cost of genetic sequencing. As a result, deep learning approaches to the prediction problem that rely on genomic data have become increasingly popular in the last few years. To catalyse research and measure progress on the newest methods for improving the accuracy of predictions, a biennial global competition called CASP (Critical Assessment of protein Structure Prediction) was established in 1994, and has become the gold standard for assessing predictive techniques. We’re indebted to decades of prior work by the CASP organisers, as well as to the thousands of experimentalists whose structures enable this kind of assessment.

DeepMind’s work on this problem resulted in AlphaFold, which we submitted to CASP13. We’re proud to be part of what the CASP organisers have called “unprecedented progress in the ability of computational methods to predict protein structure,” placing first in rankings among the teams that entered (our entry is A7D).

Our team focused specifically on the problem of modelling target shapes from scratch, without using previously solved proteins as templates. We achieved a high degree of accuracy when predicting the physical properties of a protein structure, and then used two distinct methods to construct predictions of full protein structures.

Using neural networks to predict physical properties

Both of these methods relied on deep neural networks that are trained to predict properties of the protein from its genetic sequence. The properties our networks predict are: (a) the distances between pairs of amino acids and (b) the angles between chemical bonds that connect those amino acids. The first development is an advance on commonly used techniques that estimate whether pairs of amino acids are near each other.

We trained a neural network to predict a distribution of distances between every pair of residues in a protein (visualised in Figure 2). These probabilities were then combined into a score that estimates how accurate a proposed protein structure is. We also trained a separate neural network that uses all distances in aggregate to estimate how close the proposed structure is to the right answer.

Figure 2: Two ways of visualising the accuracy of AlphaFold’s predictions. The top figure features the distance matrices for three proteins. The brightness of each pixel represents the distance between the amino acids in the sequence comprising the protein–the brighter the pixel, the closer the pair. Shown in the top row are the real, experimentally determined distances and, in the bottom row, the average of AlphaFold’s predicted distance distributions. Importantly, these match well on both global and local scales. The bottom panels represent the same comparison using 3D models, featuring AlphaFold’s predictions (blue) versus ground-truth data (green) for the same three proteins.

Using these scoring functions, we were able to search the protein landscape to find structures that matched our predictions. Our first method built on techniques commonly used in structural biology, and repeatedly replaced pieces of a protein structure with new protein fragments. We trained a generative neural network to invent new fragments, which were used to continually improve the score of the proposed protein structure.

The second method optimised scores through gradient descent - a mathematical technique commonly used in machine learning for making small, incremental improvements - which resulted in highly accurate structures. This technique was applied to entire protein chains rather than to pieces that must be folded separately before being assembled into a larger structure, to simplify the prediction process.

The AlphaFold version used at CASP13 is available on Github for anyone interested in learning more, or replicating our protein folding results.

What happens next?

While we’re thrilled by the success of our protein folding model, there’s still much to be done in the realm of protein biology, and we’re excited to continue our efforts in this field. We’re committed to establishing ways that AI can contribute to basic scientific discovery, with the hope of making real-world impact. This approach might serve to ultimately improve our understanding of the body and how it works, enabling scientists to target and design new, effective cures for diseases more efficiently. Scientists have only mapped structures for about half of all the proteins made by human cells. Some rare diseases involve mutations in a single gene, resulting in a malformed protein which can have profound effects on the health of an entire organism. A tool like AlphaFold might help rare disease researchers predict the shape of a protein of interest rapidly and economically. As scientists acquire more knowledge about the shapes of proteins and how they operate through simulations and models, this method may eventually help us contribute to efficient drug discovery, while also reducing the costs associated with experimentation. Our hope is that AI will be useful for disease research, and ultimately improve the quality of life for millions of patients around the world.

But potential benefits aren’t restricted to health alone - understanding protein folding will assist in protein design, which could unlock a tremendous number of benefits . For example, advances in biodegradable enzymes - which can be enabled by protein design - could help manage pollutants like plastic and oil, helping us break down waste in ways that are more friendly to our environment. In fact, researchers have already begun engineering bacteria to secrete proteins that will make waste biodegradable, and easier to process.

The success of our first foray into protein folding is indicative of how machine learning systems can integrate diverse sources of information to help scientists come up with creative solutions to complex problems at speed. Just as we’ve seen how AI can help people master complex games through systems like AlphaGo and AlphaZero , we similarly hope that one day, AI breakthroughs will help serve as a platform to advance our understanding of fundamental scientific problems, too.

It’s exciting to see these early signs of progress in protein folding, demonstrating the utility of AI for scientific discovery. Even though there’s a lot more work to do before we’re able to have a quantifiable impact on treating diseases, managing waste, and more, we know the potential is enormous. With a dedicated team focused on delving into how machine learning can advance the world of science, we’re looking forward to seeing the many ways our technology can make a difference.

Listen to our podcast featuring the researchers behind this work.

This blog post is based on the following work:

The AlphaFold version used at CASP13 is available on Github for anyone interested in learning more, or replicating our protein folding results.

This work was done in collaboration with Andrew Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Sandy Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David Jones, David Silver, Koray Kavukcuoglu and Demis Hassabis

Read length

During sequencing, it is possible to specify the number of base pairs that are read at a time. For example, one read might consist of 50 base pairs, 100 base pairs, or more. Longer reads can provide more reliable information about the relative locations of specific base pairs. (This helps to address a common challenge that arises in sequencing because the same read sequences can appear in multiple places within a genome.) However, it is usually more expensive to generate longer reads.

Genomic Resources

Genome Assemblies

Date Released Release Name Coverage Comments
December 2015 Btau_5.0.1 25x Improved draft assembly combined 19x PacBio data with UMD 3.1 using PBJelly.
July 2012 Btau_4.6.1 7.1x Draft assembly replaced data in Btau_4.5 with high quality finished data where available.
October 2009 Btau_4.5 7.1x Draft assembly incorporating additional Whole Genome Shotgun (WGS) contigs from Btau_2.0 into Btau_4.0
April 2009 Btau_4.2 7.1x Draft assembly Btau_4.0 data replaced with high quality finished sequence where available.
October 2007 Btau_4.0 7.1x Draft assembly using (WGS) reads from small insert clones and BAC sequences. Mapped to chromosomes using refined mapping information.
August 2006 Btau_3.1 7.1x Draft assembly (WGS) reads from small insert clones and BAC sequences.
March 2005 Btau_2.0 6.2x Preliminary assembly using (WGS) reads from small insert clones and BAC end sequences.
September 2004 Btau_1.0 3x Preliminary assembly using (WGS) reads from small insert clones.

Additional Resources

Project History

Sequencing of the bovine (Hereford) genome consumed a large part of 2004-06 BCM-HGSC resources. The project was staged to produce an initial 3x WGS assembly followed by a second 6x WGS assembly to allow gene predictions for preliminary annotation, and a final assembly including BAC sequences for improved local assembly refinement. The 3x assembly was used by Ensembl to test their pipeline on low coverage genome assemblies. The 6x WGS assembly was released and gene predictions from Ensembl and NCBI have been made public. The final assembly has been completed using the Atlas assembler and BAC data from equal numbers of BACs sequenced individually or by the CAPSS clone pooling strategy.

WGS samples from six other breeds were sequenced to identify SNPs for genetic studies. A panel of 10,000 SNPs were mined at the HGSC from this interbreed dataset and genotyping reagents were developed. 227 animals representing 9 breeds were genotyped and analyzed. An expanded set of markers (32,000) was applied to a more extensive group of animals (449). The results form the basis of a Bovine HapMap Project. Affymetrix now markets a bovine genotyping chip based on this work, allowing broader translation of the genome project into applications.

A meeting of sixty bovine researchers in Houston in March 2005 began coordination for overall genome analysis and the future of bovine research. A working group for annotation selected themes for analysis (Global analyses Muscle, Immune function Lactation Energy partitioning, Metabolism, Rumen function Reproduction, Endocrinology, Sex determination, Development Imprinted genes, HDACs, Methyl transferases Bovine models of human diseases Non-coding RNA Genetics/Genotyping Behavior, Maternal nurturing Prion protein). The Bovine Genome Project leveraged NHGRI funds with funds from other sources. This allowed the utility of the sequence to be enhanced by genetic analysis, outside the scope of the standard basic genome project deliverables.

Watch the video: Understanding Gene Coverage and Read Depth (May 2022).