Abstract
Transposable Elements (TEs), which comprise almost half of the human genome, can contribute to the evolution of novel transcriptional circuitry. Specific and biologically meaningful interpretation of TE-initiated transcripts has been marred by computational and methodological deficiencies. We developed the software suite LIONS (www.github.com/ababaian/LIONS) to analyze paired-end RNAseq data for detecting and quantifying significant TE-contributions to a transcriptome. The LIONS suite serves as a platform for TE RNA-seq analysis and can be applied to a broad set of data sets for the study of development, stress treatments, aging and cancer.
Background
A major fraction of the human genome is composed of transposable elements (TEs), which along with sequences encoding transposition machinery, contain promoters, enhancers and other cis-regulatory sequences (1). Initially these regulatory sequences are necessary for TE retrotransposition, although over time they can be co-opted or exapted into host gene regulation (1–4). Even as mutations hamstring a TE’s capability to mobilize, the intrinsic promoter in the elements may remain functional, such as the long terminal repeats (LTRs) flanking endogenous retroviruses (ERVs) or the 5’ sense and anti-sense promoters of LINE-1 elements (5,6). Thus TEs can be viewed as a dispersed reservoir of regulatory sequences from which transcriptional innovation may arise.
The percentage of transcripts initiated within repetitive DNA as measured by Cap Analysis Gene Expression (CAGE) is substantial, ranging from ~3-15% depending on the tissue (7). These TE-initiated transcripts are enriched for long non-coding RNAs (lncRNA) (8,9). In human embryonic stem cells (hESCs), ERV transcription in particular is a marker of pluripotency in mice (10). There is also growing evidence that ERV-initiated transcripts are functionally involved in the evolution of the human stem cell transcriptome (11–14).
TEs in the vicinity of protein coding genes may gain function over evolutionary time as alternative tissue-specific promoters, like the THE1D LTR element that drives placental-specific transcription of human IL2RB (15). More interestingly, over the course of cancer evolution, normally dormant TE promoters can be exploited to express a protooncogene. Such “onco-exaptations” have been identified for the expression of CSF1R (16) and IRF5 (17) in Hodgkin Lymphoma, FABP7 (18) in Diffuse Large B-cell Lymphoma and ALK in melanoma (19) amongst others (20). While a number of cases of onco-exaptations have been documented, the mechanisms underlying these oncogenic events remains largely unexplored.
It has been proposed that TE invasions can function as evolutionary accelerants, promoting adaptation and correlating with the radiation of species (21,22) and therefore there is a significant interest in understanding the extent and evolutionary mechanisms by which TEs contribute to a cell’s transcriptome. Previous transcriptome-wide studies designed to detect TE-derived promoters have analyzed annotated mRNAs (23), ESTs (24), assembled transcripts (8,9,25), short Cap Analysis Gene Expression CAGE tags (7), Paired-end ditag sequences (26), paired-end ‘chimeric fragment’ RNA-seq screening (18,27,28), targeted TE events such as ERV9-driven (29) or L1-driven transcripts (30) and loci-gene correlation studies (31). While these methods have proved useful, they have significant limitations.
5’ CAGE is the clearest measure of transcription start sites (TSSs) but provides insufficient information on the resultant transcript structure. RNA-seq assembly methods may not identify the true 5’ end of transcripts or suffer from a high false positive rate due to TE exonization events. The TEexonization problem also creates high false-positive rates in chimeric fragment-based and hybridization-based methods that have gone unaddressed (27–29,32). Moreover, none of the aforementioned studies have attempted to quantify the strength or contribution of the putative TE-initiated isoforms to overall transcript expression when alternative promoters exist. Therefore, effective TE-initiating transcript screens have required extensive human-inspection and have failed to provide a quantitative, genome-wide assessment of TEs initiating biologically significant transcription.
To quantitatively measure and compare the contribution of TE promoters to normal and cancer transcriptomes we developed a tool that incorporates features of previous methods but significantly builds upon them. We were motivated to use paired-end RNA-seq data alone, a broadly available data-type, to rapidly measure TE-initiations and transcriptome contributions. With a defined set of TE-initiated transcripts in each library, commonalities and differences between sets of data (biological replicates) can be determined. Together these analyses have been packaged to give rise to the LIONS suite (Figure 1).
Results
To quantify the contribution of TE promoters to the transcriptome from RNA-seq data alone, we were motivated to develop the LIONS analysis suite. Briefly, RNA-seq data along with a reference genome, gene and repeat annotation are inputs for the classification and annotation of TE-initiated transcripts (Figure 1A). For each RNA-seq library, a standard (.lion) file of TE-initiated transcripts is the output, and then can be grouped into biological categories such as cancer versus normal controls, for comparison (Figure 1B). A detailed outline of the analysis is provided in materials and methods.
TEs intersect exons in three main categories; as initiations at the 5’ end of a transcript; as exonizations either with or without being involved at a splice junction; and at the 3’ end as a termination site for transcripts (Figure 2A). The core LIONS classification segregates the initiations from non-initiation events. This is biologically pertinent in the analysis of TE transcription since non-initiation events outnumber initiation events by three orders of magnitude (Figure 2B). Thus analyses based on chimeric read clusters alone, or TE-transcription levels alone do not necessarily reflect autonomous transcriptional activity of TEs but rather simply correlation or propensity to be transcribed as part of other transcripts. This is non-trivial as TEs have long been known to be enriched at 5’ and 3’ untranslated transcribed regions (UTRs) and within long-noncoding (lnc)RNAs (8,9).
To test the operating characteristics of LIONS, RNA-seq reads based on the ENCODE (33) K562 and H1 embryonic stem cell line transcriptomes were simulated at varying depths as a benchmark. Simulated TE-exon pair clustering of reads plateaus at ~52% sensitivity regardless of further increase in sequencing depth (Figure 3A). This plateau emphasizes the systemic difficulty of accurately determining either 5’ or 3’ ends of transcripts from RNA-seq data alone, but the undetected TE start sites correlate with lower overall expression (Figure 3B). TE promoter analysis is confounded by the basic biological properties of TE TSSs, they are weaker and more biologically irreproducible (have higher cell-cell variation) than their non-TE TSS counterparts in CAGE analyses (Supplementary Figure 1). From the fraction of TE TSSs which are measurable by chimeric fragments, the default LIONS parameters have a sensitivity of 36.35% and specificity of 98.63% (Figure 3C). The relative proportion of each class of TE TSS called largely matches the proportions of TE TSSs of the input transcriptomes, which rules out a systematic bias towards any one class of TE (Figure 3D). Altogether, while the set of TEs read-out by LIONS is not highly sensitive especially for lower expressed transcripts, it is highly specific and accurately reflects the underlying promoter activity of TEs.
To evaluate the accuracy of LIONS-classified TE-initiations, a set of Hodgkin lymphoma-specific and recurrent (relative to B-cell controls) chimeric transcripts were assayed by RT-PCR. In silico predictions were largely in agreement with RNA assayed by RT-PCR at 55.4% and 89.2% sensitivity and specificity respectively (Supplementary Figure 2).
Altogether LIONS is able to detect a specific set of TE-initiated transcripts from RNA-seq data alone. The detected set is enriched for higher expressed transcripts which, in a biological context such as cancer, are expected to be more relevant than the low expression / high variation TE-initiated transcripts.
Discussion
The preceding principles of local RNA sequencing analysis to distinguish TE-derived transcription initiation from exonization or termination can also be seen as a specific-case of the ab initio RNA-seq assembly problem. Local calculations used in LIONS, namely read threading and upstream coverage could be generalized to the entire transcriptome. Further refinement of these methods such as inclusion of aligned-strand bias measures (34), position-aware Hidden Markov Model or machine-learning trained sorting algorithms to detect the molecular signature of TSSs could be used to improve the accuracy of transcript assembly.
LIONS suite is limited in similar ways as other assembly methods are, namely in regions of high transcriptional complexity, especially if non-stranded data is used and there is bi-directional transcription. The coverage around all transcript ends in RNA-seq is reduced relative to interior sequences (34) and confounded by lower overall expression and higher variability of expression of TE TSSs in general (7).
The focus of the LIONS suite on transcriptional initiation is the low-hanging fruit for TE-gene interactions. Additional analysis of chimeric read clusters may quickly yield TE sets which are incorporated into transcripts, such as TE-derived splice acceptors and donors in the newly classified characterized exitrons (35). Anecdotally, one of the largest difficulties in developing LIONS was distinguishing the true initiation events from exitron-like events that occur within a TE. This distinction is also one of the greatest limitations of previous studies looking at TE-derived transcriptomes (27,29,32), which did not make this distinction.
The source code for all LIONS components is available at www.github.com/ababaian/LIONS and all analyses are based on a standard .lions output file. Standardization was performed to encourage users to share down-stream analysis scripts such that graphs and statistics of TEs could be reproducible and applied to different data sets quickly.
Materials & Methods
Initialization, Alignment and Assembly
For an accurate measurement of TE initiated transcripts starting from whole transcriptome sequencing data the East Lion was developed (Figure 1A). The central principle in detecting transcription start sites within TEs is that a local analysis is performed to search for patterns of sequencing reads consistent with transcriptional initiation.
The primary LIONS input is a set of paired-end RNA sequencing data either in fastq or bam format. The datasets can be biologically or technically grouped for later comparisons or individual libraries can be run. Additionally a reference genome (hg19), a RepeatMasker (36) analysis of that genome (hg19 – 2009-04-24), and a set of reference protein-coding genes (UCSC Genes, 2013-06-14) is required.
A workspace for the project is initialized on the system and an alignment is run with the splice-aware aligner tophat2 (v.2.0.13) (37) such that secondary alignments for multi-mapping reads are retained and flagged; tophat2 –report-secondary-alignments. On systems that support qsub parallelization and multiple CPU cores, each library is aligned in parallel with multiple threading allowing for rapid analysis of large datasets.
Following alignment, ab initio transcriptome assembly is performed on each library using repeat-optimized parameters of Cufflinks (v.2.2.1) (38); cufflinks --min-frags-per-transfrag 10 --maxmultiread-fraction 0.99 --trim-3-avgcov-thresh 5 –trim-3-dropoff-frac=0.1 --overlap-radius 50. The use of an assembly substantially reduced false-positive TE-initiation calls relative to using a reference gene set since only transcript isoforms which exist in the data are considered, although it is possible to forego this step and use a reference gene set. The generated alignment and assembly is then processed to generate a bigwig coverage file for visualization and basic statistics for each exon and TE are calculated such as read-coverage and RPKM.
Chimeric Fragment Cluster Analysis
To search the sequencing data for potential TE-exon interactions, each TE-exon pair for which a chimeric fragment cluster exists are considered. Briefly, a chimeric fragment cluster is a set of reads in which one read maps to a TE and its pair maps to an exon from the assembly (Figure 2). These TEexon pairs form the basis for classification into one of three cases; TE-initiation, -exonization or -termination of the transcript (Figure 2).
Classification requires the calculation of a series of values that are then fed into a classification algorithm. First, the relative position of the TE and exon boundaries with respect to the direction of transcription is compared. Only intersection cases in which the TE could initiate transcription are considered (Supplementary Figure 3A). A thread ratio is then calculated, the ratio of read pairs in which one read maps outside of a TE in either the downstream or upstream direction. A high thread ratio distinguishes TE-initiation events from TE-exonizations, that is to say, if a TE initiates transcription then there should exist a strong bias towards the number of read-pairs downstream of the element (Supplementary Figure 3B).
For the detection of TE-initiated transcripts of most biological significance further restrictions are imposed. Single exon contigs are excluded from the analysis to reduce the false positive rate (retained introns, low abundance lincRNA). To quickly discard rare TE-initiated isoforms when an alternative, highly expressed isoform exists, TE contribution was estimated as the peak coverage within the TE divided by the peak coverage of it’s interacting exon (Supplementary Figure 3C). Together these values form the basis on which TE-initiation, -exonization or -termination can be distinguished.
Classification of TE-exon interactions is performed by the sorting algorithm that can be customized (Supplementary Figure 4). The default set of parameters termed, ‘oncoexaptation’ were manually defined by extensive manual inspection of the training ENCODE sequencing data and comparison with supporting ChIP-seq and CAGE data (Supplementary Figure 5). The default parameters are trained to specifically detect high-abundance isoforms of TE-initiated transcripts with a significant contribution to overall gene expression. These are conservative but offer the most biologically relevant results with respect to cancer biology.
TE-initiated transcripts can be further sub-classified by their intersection to a set of protein-coding genes into; chimeric transcripts, TE-initiated transcripts which transcribe in the sense-orientation into a neighbouring protein-coding gene; anti-sense TE-transcripts, non-coding TE-initiated transcripts which run anti-sense to a protein-coding gene; or long intergenic non-coding (linc) TE-transcripts which don’t overlap a known protein-coding gene. Of particular interest to cancer biology are chimeric transcripts that result in the overexpression of oncogenes, such as previously identified in Hodgkin Lymphoma for IRF5 and CSF1R (16,17).
Alternative filtering settings exist and are continually added based on the experimental demand such as; ‘screen’, a sensitive but error-prone (exonizations called as initiations) method or; ‘drivers’, detection of TE-initiated transcripts which are exclusively transcribed from TEs. Each of these settings are customizable and should be tailored towards individual project requirements.
These analyses and filters are applied independent for each RNA-seq library and a standard .lion file is created. Sets of .lion files (that is sets of RNA-seq library analyses) are then grouped into a merged .lions file for set-based comparisons.
Operating Characteristics
To test the performance of the LIONS classification, a simulation of RNA-seq data was generated to benchmark the operating characteristics of the classifier. Starting with aligned RNA-seq from H1 embryonic stem cells and K562 chronic myeloid leukemia cell line, simulated transcriptomes were generated. First the top 20,000 expressed transcripts in gencode (v19) in K562 and the top 20,000 expressed contigs from the H1esc assembly were selected as the basis for the simulated transcriptome. FluxSimulator (39) was then used to generate paired-end fastq at 5, 30, 100 and 200 millions reads in H1esc and 25, 100 and 200 million reads in K562 using a reference hg19 sequence. The simulated data are then processed by LIONS and compared to the expected input transcriptomes are defined as a ‘ground truth’.
Recurrent and Group-specific TE-promoters
Grouping and comparing sets of TE-initiated transcripts is of central importance to understanding the biology of their activity. TE-initiated transcripts are more variable then non-TE transcripts across biological replicates (Supplementary Figure 1) and therefore the signals from individual transcriptomes are noisy. The reasoning then is that grouping TE-initiated transcripts across biological replicates and asking which transcripts are recurrent will enrich for TE-initiated transcripts of consequence. In a similar line of reasoning, comparing one biological group against another can identify TE-initiated transcripts, or even classes of TEs that are more transcriptionally active in one grouping of transcriptomes against another.
To detect recurrent TE-initiated transcripts between libraries with different assemblies, the set TEs which initiate transcription are considered. The recurrence cut-off parameter is the number or proportion of libraries within a biological group that a given TE initiating transcription is required to be present. In contrast, the specificity cut-off is the number or proportion of comparison (control) libraries the initiating TE is present in. Together, TEs which have greater than the recurrent cut-off and less then the specificity parameter cut-off are considered recurrent and specific TE-initiated transcripts for a group (Figure 1B).
A clear case in which recurrent and biological-group specific TE-initiated transcripts is significant is in cancer biology. The onco-exaptation hypothesis (20) predicts that the highly variable TE-initiated transcripts can be selected for during cancer evolution and therefore transcripts recurrent and cancer-specific are enriched for oncogenes or transcripts involved in the biology of the cancer. RNA-seq Data Sets
ENCODE training RNA-seq fastq files was downloaded from the UCSC ENCODE ftp site. Hodgkin Lymphoma cell line and primary B-cell transcriptomes (17,40–43) bam files were converted to fastq for re-analysis by LIONS. Accession and library details are in Supplementary Table 1. Hodgkin lymphoma cell line culture and RT-PCR.
Hodgkin Lymphoma cell culture, RNA isolation and cDNA synthesis was performed as previously described (17). Primers for RT-PCR are listed in Supplementary Table 2.
List of Abbreviations
- CAGE
- Cap Analysis Gene Expression
- ERV
- Endogenous Retrovirus
- LTR
- Long Terminal Repeat
- TE
- Transposable Elements
- TSS
- Transcription Start Site