Mechanism of Retrotransposition - Biology

Although the mechanism of retrotransposition is not completely understood, it is clear that at least two enzymatic activities are utilized. One is an integrase, which is an endonuclease that cleaves at the site of integration to generate a staggered break (Figure 9.17). The other is RNA-dependent DNA polymerase, also called reverse transcriptase. These activities are encoded in some autonomous retrotransposons, including both LTR-retrotransposons such as retroviral proviruses and non-LTR-retrotransposons such as LINE1 elements.

Figure 9.17. Transposition via an RNA-intermediate in retrotransposons.LINE1, or L1 repeats are shown as an example.

The RNA transcript of the transposable element interacts with the site of cleavage at the DNA target site. One strand of DNA at the cleaved integration site serves as the primer for reverse transcriptase. This DNA polymerase then copies the RNA into DNA. That cDNA copy of the retrotransposon must be converted to a double stranded product and inserted at a staggered break at the target site. The enzymes required for joining the reverse transcript (first strand of the new copy) to the other end of the staggered break and for second strand synthesis have not yet been established. Perhaps some cellular DNA repair functions are used.

The model shown in Figure 9.17 is consistent with any RNA serving as the template for synthesis of the cDNA from the staggered break. However, LINE1 mRNA is clearly used much more often than other RNAs. The basis for the preference of the retrotransposition machinery for LINE1 mRNA is still being studied. Perhaps the endonuclease and reverse transcriptase stay associated with the mRNA that encodes them after translation has been completed, so that they act in ciswith respect to the LINE1 mRNA. Other repeats that have expanded recently, such as Alurepeats in humans, may share sequence determinants with LINE1 mRNA for this cispreference.

Clear evidence that retrotransposons can move via an RNA intermediate came from studies of the yeast Ty-1elements by Gerald Fink and his colleagues. They placed a particular Ty-1element, called TyH3under control of a GALpromoter, so that its transcription (and transposition) could be induced by adding galactose to the media. They also marked TyH3with an intron. After inducing transcription of TyH3, additional copies were found at new locations in the yeast strain. When these were examined structurally, it was discovered that the intron had been removed. If the RNA transcript is the intermediate in moving the Ty-1element, it is subject to splicing and the intron can be removed. Hence, these results fit the prediction of an RNA-mediated transposition. They demonstrate that during transposition, the flow of Ty-1sequence information is from DNA to RNA to DNA.


If yeast Ty-1 moved by the mechanism illustrated for DNA-mediated replicative transposition in Figure 9.13, what would be predicted in the experiment just outlined? Also, would you expect an increase in transposition when transcription is induced?


June 27, 1970 was a significant day for our understanding of both the flow of information in biological systems and the evolution of eukaryotic genomes as this was the day that Nature published back-to-back papers reporting the discovery of an enzyme that copies RNA into DNA. This soon became known as reverse transcriptase and the RNA tumour viruses in which it was detected were renamed retroviruses. The realisation that retroviruses can convert their genomic RNA into DNA provided a route by which they could integrate into the chromosomes of infected cells as Howard Temin and his colleagues had proposed some years earlier. At the time it was thought that the ability to copy RNA into DNA would be confined to retroviruses. One of the more startling outcomes of whole genome DNA sequencing has been the discovery that eukaryotes can have more reverse transcriptase genes than genes coding for any other protein, and that the largest single component of many eukaryotic genomes has been generated by reverse transcription.


Distribution and Evolution

Retrotransposons are found in all eukaryotes but not in prokaryotes. There is a direct correlation between the size of a eukaryotic genome and the abundance but not necessarily the type of retrotransposons. For example, 3% of the small yeast genome is composed of retrotransposons, which are all of the LTR class. The much larger human genome is over 30% retrotransposons, predominantly of the non-LTR class. Finally, 75% of the even larger maize genome are retrotransposons, predominantly of the LTR class.

Retrotransposons usually establish long-term associations with the host genome. This differs from the transposons which are believed to be active for only a short time in any genome and are dependent on horizontal transfers between species for their long-term survival. The predominant vertical (through the germline) inheritance of retrotransposons is most pronounced in the non-LTR elements. L1 elements have been slowly accumulating throughout the 100-million-year history of mammalian genomes. R2 elements have been stable components of arthropod genomes for over 500 million years. These stable relationships of retrotransposons with the host genome are believed to have given rise to specialized insertion strategies. All retrotransposons in yeast insert either into heterochromatin or immediately upstream of tRNA genes, where they do not interfere with the expression of host genes. Similarly, a variety of retrotransposons in arthropods insert at specific locations in the rRNA genes or telomeric sequences of their host.

The long-term relationship between retrotransposons and the host genome raises the question of what controls their copy number, and whether they have positive as well as negative effects on the genome. Mobile elements have been suggested to supply sequence variation which could enable hosts to evolve rapidly. On the other hand, the excessive numbers of these elements in many species suggest a wanton disregard for the well-being of the host genome. A number of eukaryotes have evolved elaborate mechanisms in attempts to eliminate or silence these elements. Much remains to be understood of this ‘molecular arms race.’


The first description of an approximately 6.4 kb long LINE-derived sequence was published by J. Adams et al. in 1980. [10]

Based on structural features and the phylogeny of its key enzyme, the reverse transcriptase (RT), LINEs are grouped into five main groups, called L1, RTE, R2, I and Jockey, which can be subdivided into at least 28 clades. [11] ( fig. 1 )

In plant genomes, so far only LINEs of the L1 and RTE clade have been reported. [12] [13] [14] Whereas L1 elements diversify into several subclades, RTE-type LINEs are highly conserved, often constituting a single family. [15] [16]

In fungi, Tad, L1, CRE, Deceiver and Inkcap-like elements have been identified, [17] with Tad-like elements appearing exclusively in fungal genomes. [18]

All LINEs encode a least one protein, ORF2, which contains an RT and an endonuclease (EN) domain, either an N-terminal APE or a C-terminal RLE or rarely both. A ribonuclease H domain is occasionally present. Except for the evolutionary ancient R2 and RTE superfamilies, LINEs usually encode for another protein named ORF1, which may contain an Gag-knuckle, a L1-like RRM (InterPro: IPR035300), and/or an esterase. LINE elements are relatively rare compared to LTR-retrotransposons in plants, fungi or insects, but are dominant in vertebrates and especially in mammals, where they represent around 20% of the genome. [11] ( fig. 1 )

L1 element Edit

The LINE-1/L1-element is one of the elements that are still active in the human genome today. It is found in all mammals [19] except megabats. [20]

Other elements Edit

Remnants of L2 and L3 elements are found in the human genome. [8] It is estimated that L2 and L3 elements were active

200-300 million years ago. Unlike L1 elements, L2 elements lack flanking target site duplications. [21] The L2 (and L3) elements are in the same group as the CR1 clade, Jockey. [22]

In human Edit

In the first human genome draft the fraction of LINE elements of the human genome was given as 21% and their copy number as 850,000. Of these, L1, L2 and L3 elements made up 516,000, 315,000 and 37,000 copies, respectively. The non-autonomous SINE elements which depend on L1 elements for their proliferation make up 13% of the human genome and have a copy number of around 1.5 million. [8] They probably originated from the RTE family of LINEs. [23] Recent estimates show the typical human genome contains on average 100 L1 elements with potential for mobilization, however there is a fair amount of variation and some individuals may contain a larger number of active L1 elements, making these individuals more prone to L1-induced mutagenesis. [24]

Increased L1 copy numbers have also been found in the brains of people with schizophrenia, indicating that LINE elements may play a role in some neuronal diseases. [25]

LINE elements propagate by a so-called target primed reverse transcription mechanism (TPRT), which was first described for the R2 element from the silkworm Bombyx mori.

ORF2 (and ORF1 when present) proteins primarily associate in cis with their encoding mRNA, forming a ribonucleoprotein (RNP) complex, likely composed of two ORF2s and an unknown number of ORF1 trimers. [26] The complex is transported back into the nucleus, where the ORF2 endonuclease domain opens the DNA (at TTAAAA hexanucleotide motifs in mammals [27] ). Thus, a 3'OH group is freed for the reverse transcriptase to prime reverse transcription of the LINE RNA transcript. Following the reverse transcription the target strand is cleaved and the newly created cDNA is integrated [28]

New insertions create short TSDs, and the majority of new inserts are severely 5’-truncated (average insert size of 900pb in humans) and often inverted (Szak et al., 2002). Because they lack their 5’UTR, most of new inserts are non functional.

It has been shown that host cells regulate L1 retrotransposition activity, for example through epigenetic silencing. For example, the RNA interference (RNAi) mechanism of small interfering RNAs derived from L1 sequences can cause suppression of L1 retrotransposition. [29]

In plant genomes, epigenetic modification of LINEs can lead to expression changes of nearby genes and even to phenotypic changes: In the oil palm genome, methylation of a Karma-type LINE underlies the somaclonal, 'mantled' variant of this plant, responsible for drastic yield loss. [30]

Human APOBEC3C mediated restriction of LINE-1 elements were reported and it is due to the interaction between A3C with the ORF1p that affects the reverse transcriptase activity. [31]

A historic example of L1-conferred disease is Haemophilia A, which is caused by insertional mutagenesis. [32] There are nearly 100 examples of known diseases caused by retroelement insertions, including some types of cancer and neurological disorders. [33] Correlation between L1 mobilization and oncogenesis has been reported for epithelial cell cancer (carcinoma). [34] Hypomethylation of LINES is associated with chromosomal instability and altered gene expression [35] and is found in various cancer cell types in various tissues types. [36] [35] Hypomethylation of a specific L1 located in the MET onco gene is associated with bladder cancer tumorogenesis, [37] Shift work sleep disorder [38] is associated with increased cancer risk because light exposure at night reduces melatonin, a hormone that has been shown to reduce L1-induced genome instability. [39]

Access to Document

  • APA
  • Standard
  • Harvard
  • Vancouver
  • Author
  • RIS

Research output : Contribution to journal › Article › peer-review

T1 - Retrotransposition mechanisms

N2 - Recent developments in the area of the transposition mechanisms used by retrotransposons and related retroviral pathways are discussed. In particular, advances in the areas of retrotransporon gene expression, virus-like particle assembly, reverse transcription, and integration are reviewed.

AB - Recent developments in the area of the transposition mechanisms used by retrotransposons and related retroviral pathways are discussed. In particular, advances in the areas of retrotransporon gene expression, virus-like particle assembly, reverse transcription, and integration are reviewed.

Interfering with retrotransposition by two types of CRISPR effectors: Cas12a and Cas13a

CRISPRs are a promising tool being explored in combating exogenous retroviral pathogens and in disabling endogenous retroviruses for organ transplantation. The Cas12a and Cas13a systems offer novel mechanisms of CRISPR actions that have not been evaluated for retrovirus interference. Particularly, a latest study revealed that the activated Cas13a provided bacterial hosts with a "passive protection" mechanism to defend against DNA phage infection by inducing cell growth arrest in infected cells, which is especially significant as it endows Cas13a, a RNA-targeting CRISPR effector, with mount defense against both RNA and DNA invaders. Here, by refitting long terminal repeat retrotransposon Tf1 as a model system, which shares common features with retrovirus regarding their replication mechanism and life cycle, we repurposed CRISPR-Cas12a and -Cas13a to interfere with Tf1 retrotransposition, and evaluated their different mechanisms of action. Cas12a exhibited strong inhibition on retrotransposition, allowing marginal Tf1 transposition that was likely the result of a lasting pool of Tf1 RNA/cDNA intermediates protected within virus-like particles. The residual activities, however, were completely eliminated with new constructs for persistent crRNA targeting. On the other hand, targeting Cas13a to Tf1 RNA intermediates significantly inhibited Tf1 retrotransposition. However, unlike in bacterial hosts, the sustained activation of Cas13a by Tf1 transcripts did not cause cell growth arrest in S. pombe, indicating that virus-activated Cas13a likely acted differently in eukaryotic cells. The study gained insight into the actions of novel CRISPR mechanisms in combating retroviral pathogens, and established system parameters for developing new strategies in treatment of retrovirus-related diseases.

Keywords: Cell biology Molecular biology.

Conflict of interest statement

Conflict of interestThe authors declare that they have no conflict of interest.


Fig. 1. Implementation of Cas12a editing system…

Fig. 1. Implementation of Cas12a editing system in S. pombe and editing of MEL1 gene.

Fig. 2. Design and construction of Tf1-splicing…

Fig. 2. Design and construction of Tf1-splicing reporter system for retrotransposition.

Fig. 3. Interference of Tf1 retrotransposition by…

Fig. 3. Interference of Tf1 retrotransposition by CRISPR-Cas12a.

Fig. 4. Prolonged crRNA targeting eliminated residual…

Fig. 4. Prolonged crRNA targeting eliminated residual Tf1 retrotransposition by Cas12a.

Fig. 5. Interfering with Tf1 retrotransposition by…

Fig. 5. Interfering with Tf1 retrotransposition by CRISPR-Cas13a via targeting its RNA intermediates.

Access to Document

  • APA
  • Standard
  • Harvard
  • Vancouver
  • Author
  • RIS

Research output : Contribution to journal › Article › peer-review

T1 - Retrotransposition mechanisms

N2 - Recent developments in the area of the transposition mechanisms used by retrotransposons and related retroviral pathways are discussed. In particular, advances in the areas of retrotransporon gene expression, virus-like particle assembly, reverse transcription, and integration are reviewed.

AB - Recent developments in the area of the transposition mechanisms used by retrotransposons and related retroviral pathways are discussed. In particular, advances in the areas of retrotransporon gene expression, virus-like particle assembly, reverse transcription, and integration are reviewed.

Biology of Mammalian L1 Retrotransposons

AbstractL1 retrotransposons comprise 17% of the human genome. Although most L1s are inactive, some elements remain capable of retrotransposition. L1 elements have a long evolutionary history dating to the beginnings of eukaryotic existence. Although many aspects of their retrotransposition mechanism remain poorly understood, they likely integrate into genomic DNA by a process called target primed reverse transcription. L1s have shaped mammalian genomes through a number of mechanisms. First, they have greatly expanded the genome both by their own retrotransposition and by providing the machinery necessary for the retrotransposition of other mobile elements, such as Alus. Second, they have shuffled non-L1 sequence throughout the genome by a process termed transduction. Third, they have affected gene expression by a number of mechanisms. For instance, they occasionally insert into genes and cause disease both in humans and in mice. L1 elements have proven useful as phylogenetic markers and may find other practical applications in gene discovery following insertional mutagenesis in mice and in the delivery of therapeutic genes.

Additional open reading frames in LTR retrotransposons

Although retrotransposon gag and pol genes are believed to be necessary and sufficient for transposition, a number of retrotransposon families with aberrant genomic organizations have now been identified (Figure 3). One frequent structural change is the addition of coding information.

Retrotransposons with 'env-like' genes

One of the main differences between retrotransposons (with a wholly intracellular life-cycle) and their infectious retrovirus cousins is the presence of an envelope (env) gene in the latter, which allows a virus particle to infect another cell. A number of retroelements have an extra ORF in the same position as the env gene found in retrovirus genomes (Figure 3). The best characterized examples of env-containing retroelements are the Drosophila errantiviruses, including gypsy and ZAM [9, 10]. The life-cycle of these elements has been examined in detail, and gypsy has been shown to be infectious [11, 12].

The presence of an env gene within a retroelement is not limited to the errantiviruses genomic studies have revealed that env-like ORFs are widespread among retrotransposons in both the Pseudoviridae (sireviruses) and Metaviridae (errantiviruses, metaviruses and semotiviruses) [13, 14]. Elements contaning an env-like ORF in each of these lineages also originate from diverse host species. The retroelement most recently shown to have an env-like ORF, Boudicca, is a metavirus from a human blood fluke [15]. Other examples of metaviruses include the Athila elements, which represent a large proportion of the retroelements in Arabidopsis [16]. In a related element in barley, Bagy-2, the env-like transcript is spliced, similarly to the env transcripts of retroviruses [17]. Members of the sirevirus group make up half of the approximately 400 Pseudoviridae sequences present in GenBank, and of these, about one third have an env-like ORF (X.G. and D.V., unpublished observation). Semotiviruses (also called BEL retrotransposons) with env-like ORFs have also been described in nematode genomes as well as in pufferfish and Drosophila [18, 19].

Do Env-like proteins enable these diverse retroelements to become infectious? In a few cases, the env-like genes have been shown to be significantly similar in sequence to genes of different viruses, suggesting that they were acquired by retrotransposons through transduction of a cellular gene [13]. Except for some errantiviruses, where the Env-like protein has been implicated in infection, the function of the Env-like proteins remains unclear. The amino-acid sequences of these proteins are highly divergent, making it difficult to assess whether or not they have a common function. That said, many Env-like proteins have predicted transmembrane domains (like retroviral Env proteins), although this is not a universal feature. It is possible that retroviral activity has evolved several times in the history of retrotransposons, or that these genes may confer novel function(s), such as movement between tissues of an organism (as suggested for the gypsy elements) or movement within cells (such as between the cytoplasm and the nucleus). Alternatively, the Env-like proteins could serve as chaperone proteins to facilitate replication. Functional studies are required to discern the biological roles of these interesting genes.

Other additional ORFs

Other novel coding regions have also been identified within various retrotransposons, but it is unclear how broadly these coding sequences are conserved. For example, RIRE2 of rice - a metavirus - has a small ORF of unknown function upstream of its gag gene [20]. Some plant retrotransposons carry ORF(s) that are antisense to the genomic RNA transcript (Figure 3), including the metaviruses RIRE2 of rice and Grande1 of maize [21, 22]. The functions of the antisense ORFs are also unknown. In a few cases, retrotransposons have acquired sequences that probably do not have any role in the life cycle of the elements. The Bs1 retrotransposon of maize, for example, has transduced a cellular gene sequence - in this case a part of a gene encoding an ATPase [23, 24].

Materials and methods

Gene retrocopy insertion detection from mapped paired end reads

Paired end reads consist of two DNA sequences flanking an internal unsequenced region. Given the average insert size of a sequencing library, and the locations relative to a reference genome where either end of a paired end fragment map, a pair of mappings is termed concordant if the sequenced ends are mapped to the reference genome at an interval and orientation compatible with the library construction. Conversely, a pair of mappings is termed discordant if the paired ends are mapped too far apart or in the wrong orientation relative to the reference genome to which they are mapped. Given sufficient read depth and agreement between multiple paired reads, discordant read pairings can contain information about genome rearrangements relative to the reference if the rearrangements bring two pieces of the genome into proximity that are distant from one another in the reference genome. Here, we use discordant read mappings to detect GRIPs by finding multiple discordant mappings that connect exonic sequences to a consistent location distant from the exons. We refer to the genome or genomes from which a sequencing library was generated and analyzed as the query genome. For some region of a chromosome, if the sequence of the query genome matches the sequence of the reference genome, read pairs mapped to that region will be concordant as shown in the normal mapping of Figure 1. Alternately, if a region in the query genome contains a structural variant (insertion, deletion inversion, and so on) relative to the reference, some or all of the read pairs mapping to that location may be discordant. Figure 1 also demonstrates the pattern of discordant mappings indicative of a gene retrocopy insertion in the query genome. In order to confidently predict the presence of a gene retrocopy in a query genome or genomes, we require at least eight distinct mappings between the source gene and its insertion location, with at least two mappings spanning each junction. Illumina sequencing chemistry yields paired reads where the first read in the pair is sequenced on the top strand and the second read is sequenced on the bottom strand, such that the first read maps to the top (+) strand of the reference genome and the second read maps to the bottom (-) strand of the reference genome. Given this property, the reads mapping to the 5' side of the predicted insertion site must be on the top strand and the reads on the 3' side of the site must be on the bottom strand. Likewise, the mappings of the discordant reads themselves must be consistent with this pattern. We also require that the reads mapping to the source gene must correspond to at least two distinct exons. Additionally, we filter out putative insertion sites where the site is in a region of the genome that contains an annotated or unannotated pseudogene. Unannotated pseudogenes are ascertained by comparing the insertion site +/- 500 bp to the rest of the reference genome using BLAT [76]. This method (GRIPper) was implemented in Python using pysam [77] and is available from github [78]. An archival version of the software is also available as Additional file 4 however, we suggest using the most up-to-date version via github.

Breakpoint ascertainment from soft-clipped reads

Many of the human samples analyzed in this study were mapped using bwa [79], which allows for part of a read to align as long as the seed sequence meets the minimum mismatch criteria. The unaligned portion of these mappings is marked as soft-clipped. This provides a convenient means to check for breakpoints by looking for consistent break ends corresponding to the 5' and 3' junctions of the inserted gene retrocopy. Target site duplications are ascertained by searching for correspondence between the sequences on either side of the breakpoint.

Local sequence assembly to identify exon-exon junctions

In order to identify exon-exon junctions that are present in inserted processed gene retrocopy sequences, we employed a two-stage local assembly strategy. First, read pairs that map within 500 bp of a predicted insertion site that are discordant, one-end-anchored (reads where the mate is unmapped), or have at least one read in the pair that is soft-clipped are used as input to a short read assembler. For a first attempt at assembly, we use Velvet [80] with a k-mer size of 31, the shortPaired option to indicate the reads were paired, and an insert length of 300. The resulting contigs are aligned back to the reference genome using BLAT [76] to identify reads that map to exonic sequences corresponding to the source gene and without aligning to the intervening introns (spliced alignments). The majority of junctions are ascertained in this first step using Velvet which utilizes de Bruijn graphs to guide assembly. Secondarily, the discordant, one-end-anchored, and soft-clipped reads corresponding to the remaining insertions for which an exon-exon junction was not apparent were then assembled using PRICE [81], which utilizes a seed-and-extend assembly strategy, and aligned back to the reference to identify spliced junctions. We ran PRICE for 20 cycles using the anchored read pairs (those which map uniquely near the gene retrocopy insertion site) as the seed sequences.

Simulation of novel gene retrocopy insertions

Retrogene insertions were simulated by adding insertions of spliced, polyadenylated mRNA transcripts to sample TCGA-60-2711-11 (LUSC-2711 Normal) using bamsurgeon [82]. Bamsurgeon can add structural variants (including insertions) to existing BAM files through local assembly followed by modification of the assembled contig, simulation of paired read coverage (100 paired end base pairs with 300 unsequenced insert base pairs), realignment, and replacement into the original BAM. We added a total of 2,000 insertions from 200 different processed mRNAs (Table S11 in Additional file 1) to LUSC-2711, and downsampled the resultant BAM from 60× average coverage to 40×, 30×, 20×, 10×, and 5× using DownSampleSam, part of the Picard suite of utilities [83]. We used GRIPper to detect the spiked-in processed mRNAs to evaluate the detection characteristics. At 60× coverage we obtained perfect precision and a recall of 0.751 (1,501 true positives and 499 false negatives with no false positives). As expected, recall decreases with decreasing coverage (Table S10 in Additional file 1). In general, false negatives are due to single exon genes (for example, OR7G2) at high coverage and mainly due to insufficient read support at low coverage. Since we combined reads from both tumor and normal genomes for all TCGA samples in this study, which have coverage of 30× or greater, detection of germline insertions was done on samples with an effective coverage of 60× or greater.

Identifying gene retrocopy insertions included in the reference genome assembly

GRIPs in the reference genome that are not present in other individuals will appear as deletions relative to the reference. To detect these, we cross-referenced the deletion data from the 1,000 Genomes Project [34, 57] with pseudogene annotations from GENCODE/ENCODE [84] and Yale [1]. Deletions were obtained in variant call format from the 1,000 Genomes Project FTP server, and pseudogene annotations where obtained from the UCSC Genome Browser [85], and from human build 65 [3]. To allow for repetitive sequences in gene UTRs we allowed the deletion to span a region up to three times larger than the surrounded pseudogene annotation. We also required homology between the deleted sequence and the source gene of the annotated pseudogene. A list of the GRIPs ascertained in this way is included in Additional file 1 (Table S7 in Additional file 1), two of which correspond to both of the processed pseudogene deletion polymorphisms (pseudocopies of GCSH and ITGB1) mentioned in a previous study [17].

Strategy for low-pass genome sequence data and tumor/normal pairs

In order to ascertain insertion sites from a large collection of genomes sequenced at low (2× to 5×) coverage, or to ensure maximum sensitivity in ascertaining cancer-specific insertions, we combine data across multiple samples. This is accomplished simply by extracting discordant reads where one end maps to an exon and the other end elsewhere in the reference genome from each genome of interest, and analyzing the merged set of discordant reads en masse while keeping track of the sample identifier associated with each discordant pair of mapped reads. When insertions are called, all genomes contributing reads to a call are considered to have the insertion.

Calculating coverage of gene annotations

In order to test for enrichment or depletion of gene retrocopy insertions relative to gene annotations, we must have an accurate figure for how much of the reference genome assembly is covered by the set of annotations used. For both human and mouse, we used UCSC genes [86]: human version 5 and mouse version 5. From BED formatted versions of these annotation tracks, the bedCoverage tool from the Kent source utilities was used to calculate the fraction of the genome covered. To calculate enrichment, we performed a one-sample proportions test with continuity correction using the prop.test function in R [87].

Calculating distance between GRIP profiles

The Jaccard distance [52] is defined as:

where A and B are sets of gene retrocopy insertions for two genomes.

Watch the video: ΝΩΤΙΑΙΟΣ ΜΥΕΛΟΣ (January 2022).