Abstract
Whole genome duplication (WGD) has been a major evolutionary driver of increased genomic complexity in vertebrates, yet little is known about how selection operates on the resulting gene duplicates. Here, we present a draft genome assembly of a salmonid species, European grayling (Thymallus thymallus) and use comparative genomics and transcriptomics to understand evolutionary consequences of WGD in the genome of salmonid ancestor ~80 million years ago (Ss4R). We find evidence for lineage-specific rates in rediploidization and that ~60% of the Ss4R ohnologs have experienced different types of non-neutral evolution of tissue-specific gene expression regulation. Distinct selective pressures were associated with tissue type, biological function and selection pressure on protein coding sequence. Finally, our results indicate the role of adaptive divergence of Ss4R duplicates in the evolution of salmonid metabolism and identifies loss of purifying selection on one Ss4R ohnolog encoding a key chloride pump linked to the evolution of anadromy.
Introduction
Whole genome duplication (WGD) has played a vital role in the evolution of vertebrate genome complexity. Two rounds of genome duplication events occurred in the ancestor of all vertebrates (1R and 2R events), a third WGD at the base of the teleosts (3R) 225-333 million years ago (MYA) [1]. Several teleost lineages experienced additional WGD events including the salmonid ancestor 88-103 MYA (Ss4R) [2, 3]. Directly following autopolyploidization, duplicated chromosomes pair randomly with any of their homologous counterparts resulting in an increased risk of formation of multivalents and consequently production of non-viable aneuploid gametes. Restoring bivalent chromosome pairing is therefore a critical step towards a functional genome post-WGD [4]. Hence, sequence divergence or structural rearrangements are indispensable for blocking multivalent formation, suppressing recombination and driving the process of returning to a functional diploid state, a process called rediploidization. As the mutational process is stochastic, the resolution of ohnologs (gene duplicates resulting from WGD) is achieved independently for different duplicated chromosomes and hence occurs at different rates in various genomic regions.
The functional redundancy arising from gene duplication is believed to be a source for the evolution of novel traits and adaptations [2]. Duplicate genes that escape loss or pseudogenization are known to acquire novel regulation and expression divergence [5, 6, 7]. Functional genomic studies over the past decade have demonstrated that large-scale duplications lead to the rewiring of regulatory networks through divergence of spatial and temporal expression patterns [8]. As changes in gene regulation are known to be important in the evolution of phenotypic diversity and complex trait variation [9, 10], these post-WGD shifts in expression regulation provide a large substrate for adaptive evolution. Several studies have investigated the genome-wide consequences of WGD on gene expression evolution in vertebrates(e.g. [11, 12, 13, 14, 15, 16, 17] and have revealed that a large proportion of gene duplicates have evolved substantial regulatory divergence, and that in most cases one copy retains ancestral-like regulation (consistent with regulatory neo-functionalization). However, to what extent this divergence in expression is linked to adaptation remains to be understood. A major factor contributing to this knowledge gap is the lack of studies that integrate functional data from multiple species sharing the same WGD [18], which allows us to distinguish neutral divergence in biological function from that maintained by purifying selection [19]. Further, confidently identifying gene duplicates retained from paleopolyploidy events like 2R and 3R dating back to >300-500 million years (MY) is challenging.
Salmonids have emerged as a model for studying functional consequences of autopolyploidization in vertebrates, owing to their relatively young WGD event (<100MYA) and ongoing rediploidization [20, 16]. Recent studies on genome evolution subsequent to Ss4R have shown that the rediploidization process has been temporally overlapping with species radiation, resulting in lineage-specific ohnolog resolution (LORe) that may fuel differentiation of genome structure and function [21, 17]. In fact, only 75% of the duplicated ancestral salmonid genome had rediploidized at the time of the basal split in the Salmonidae family ~60 MYA. Consequently, ~25% of the Ss4R duplicates have undergone rediploidization independently in the Salmoninae and Thymallinae clades. Interestingly, the species within these two clades have also evolved widely different genome structures, ecology, physiology and life history adaptations [22]. The species in the subfamily Salmoninae have fewer and highly derived chromosomes resulting from large-scale chromosomal rearrangements and chromosomal fusions, display extreme phenotypic plasticity, and have evolved the capability of migrating between fresh and salt-water habitats (anadromy) [3]. In contrast, the Thymallinae species (graylings) have a more ancestral genome structure with few or no chromosome fusions [23, 24, 25, 26] (Supplementary Figure S1). Further, grayling species are generally less plastic and have not evolved anadromy. This unique combination of both shared and lineage-specific rediploidization histories and striking difference in genome structure and adaptations provides an intriguing study system for addressing key questions about the evolutionary consequences of WGD.
In order to gain deeper insights into how selection has shaped the evolution of gene duplicates post WGD, we have sequenced, assembled and annotated the draft genome of the European grayling (Thymallus thymallus), a species representative of an early diverging non-anadromous salmonid lineage. We use this novel genomic resource in a comparative phylogenomic framework to gain insights into the consequences of lineage-specific rediploidization and genome-wide selective constraints on gene expression regulation. Our analyses of expression patterns across the two duplicated salmonid genomes (grayling and Atlantic salmon) demonstrate that a large fraction of the duplicates originating from Ss4R have experienced purifying selection to maintain ancestral tissue-specific expression regulation. Moreover, widely diverse biological processes are correlated to differences in evolutionary constraints during the 88-100MY of evolution post-WGD, pointing towards underlying differences in adaptive pressures in non-anadromous grayling and the anadromous Atlantic salmon.
Results
Genome assembly and annotation
We sequenced the genome of a wild-caught male grayling individual sampled from the Norwegian river Glomma using the Illumina Hiseq 2000 platform (Supplementary Table S1 and S2). De novo assembly was performed using ALLPATHS-LG [27], followed by assembly correction using Pilon [28], resulting in 24,343 scaffolds with an N50 of 284 Kbp and a total size of 1.468 Gbp (Table 1). The scaffolds represent approximately 85% of the k-mer based genome size estimate of ~1.8 Gbp. The C-values estimated previously for European grayling are 2.1pg (http://www.genomesize.com/) and 1.9 [25]. To annotate gene structures, we used RNA-seq data from nine tissues extracted from the sequenced individual. Repeat masking with a repeat library constructed using a combination of homology and de novo based methods identified and masked approximately 600Mb (~40%) of the assembly, dominated by class1 DNA transposable elements (Supplementary Table S3 and a repeat landscape in Supplementary Figure S2). Finally, the transcriptome data, the de novo identified repeats along with the UniProt proteins [29] and Atlantic salmon coding sequences [16] were utilized in the MAKER annotation pipeline, predicting a total of 117,944 gene models, of which 48,753 protein coding genes were retained based on AED score (Annotation edit distance), homology with UniProt and Atlantic salmon proteins or presence of known domains. Assembly completeness was assessed at the gene level based on CEGMA and BUSCO. The assembly contains 236 (95.16%) out of 248 conserved eukaryotic genes (CEGs) with 200 (80.65%) complete CEGs. Of the 3,698 BUSCO genes of the class actinopterygii, 3,192 complete (86.3%) and 222 (6%) fragmented genes were found in the assembly (Table 1).
Divergent rediploidization rates among the salmonid lineages
Previous studies have suggested that 25% of the genome of the most recent common salmonid ancestor was still tetraploid when the grayling and Atlantic salmon lineages diverged [16, 17]. To test this hypothesis, we used a phylogenomic approach to characterize rediploidization following Ss4R in grayling.
We inferred 23,782 orthologous gene groups among gene models from Homo sapiens (human), Mus musculus (mouse), Danio rerio (zebrafish), Gasterosteus aculeatus (stickleback), Oryzias latipes (medaka), Esox lucius (Northern pike), Salmo salar (Atlantic salmon), Oncorhynchus mykiss (Rainbow trout) and Oncorhynchus kisutch (coho salmon) (Figure 1). These orthogroups were used to infer gene trees. 20,342 gene trees contained WGD events older than Ss4R (Ts3R or 2R) and were further subdivided into smaller subgroups (i.e. clans, see Methods for details and Supplementary Figure S3). To identify orthogroups with retained Ss4R duplicates, we relied on the high-quality reference genome of Atlantic salmon [16]. A synteny-aware blast approach [16] was first used to identify Ss4R duplicate pairs/ohnolog pairs in the Atlantic salmon genome and this information was used to identify a total of 8,527 gene trees containing high confidence ohnologs originating from Ss4R. Finally, gene trees were classified based on the tree topology into duplicates conforming to LORe and those with ancestrally diverged duplicates and thus following the topology expected under ancestral ohnolog resolution (henceforth referred to as AORe) (Figure 2a). In total, 3,362 gene trees correspond to LORe regions (2,403 with a single copy in grayling) and 5,113 correspond to an AORe-like topology. This data was cross-checked with the LORe coordinates suggested by Robertson et al [17] and cases that did not conform were omitted from further analyses. This final set consisted of 5,340 gene trees containing Ss4R duplicates from both species (4603 AORe, 737 LORe), in addition to 482 ortholog sets containing Ss4R duplicates in Atlantic salmon but not grayling.
To identify regions of ancestral and lineage-specific rediploidization in the grayling genome, we assigned genes from gene trees that contained Ss4R duplicates to genomic positions on the Atlantic salmon chromosomes (Figure 2). In Atlantic salmon, several homeologous chromosome arms (2p-5q, 2q-12qa, 3q-6p, 4p-8q, 7q-17qb, 11qa-26, 16qb-17qa) have previously been described as Ss4R regions under delayed rediploidization [16, 17] (indicated in Figure 2b as red and blue ribbons). Interestingly, the homeologous LORe regions 2q-12qa, 4p-8q and 16qb-17qa had only one grayling ortholog corresponding to two copies in Atlantic salmon, suggesting either loss of large duplicated blocks or sequence assembly collapse. To probe further into these regions, we mapped the grayling Illumina paired end reads that were used for the assembly back to the grayling genome sequence using BWA-MEM [30] and determined the mapped read depth for each of the grayling genes. Single copy grayling genes in LORe regions had consistently double read depth (~100x) compared to the LORe duplicates genes in grayling (Figure 2c and Supplementary Figure S4a), indicating assembly collapse rather than loss of large chromosomal regions. Additionally, the SNP density of the scaffolds in these regions computed using Free-Bayes [31] (quality filter of 30) displayed values on an average twice the background SNP density, albeit with a much wider distribution (Figure 2d and Supplementary Figure S4b). The observed assembly collapse in some Ss4R regions in grayling could be related to a generally slower rediploidization rate compared to the Atlantic salmon lineage. To test for the difference in sequence divergence in the AORe and LORe, we computed the synonymous substitution rates (dS) between duplicate pairs in Atlantic salmon and grayling. Indeed, the difference between Atlantic salmon-pair-dS and grayling-pair-dS was significantly larger in LORe compared to AORe regions (Wilcoxon test, p=4.62e-11, 95% Confidence Interval: -Inf, -0.006), supporting a slower rediploidization rate in grayling LORe regions than in Salmoninae (Supplementary Figure S5).
Selection on gene expression regulation following Ss4R WGD
To investigate how selection has operated on Ss4R ohnologs, we used tissue gene expression data from Atlantic salmon and grayling in a comparative phylogenetic approach. We classified expression evolution fates (EEF) of Ss4R duplicates by first applying hierarchical clustering (see Methods) of tissue expression across duplicate pairs in Atlantic salmon and grayling. Each set of Ss4R duplicate pairs, hereafter referred to as ohnolog-tetrads, was assigned to one out of eight tissue-dominant expression clusters (Supplementary Figure S6). Next, ohnolog-tetrads were classified into groups of five distinct EEF categories each representing differences in past selection pressure on the tissue-regulation of ohnolog pairs. (see Table 2, Figure 3). The conserved divergence (EEF1) category represents expresison divergence among ohnologs that is identical for both species. EEF1 is thus best explained by purifying selection on ancestral ohnolog expression divergence. Fixed-lineage divergence of expression (EEF2) represents cases of conserved expression regulation among duplicates within species. EEF3 and 4 include ohnolog-tetrads with species-specific expression divergence pointing to species specific adaptive divergence or relaxed purifying selection in one duplicate. Lastly, EEF5 are ohnolog-tetrads with all genes expressed in the same tissue, thus pointing to strong purifying selection to maintain ancestral tissue-specificity. In addition to these five categories, there were ohnolog-tetrads where three, or all four of the duplicates were in different tissue-expression clusters. These were grouped into a 6th ‘unclassified’ EEF category assumed to be enriched in ohnolog-tetrads under neutral or nearly neutral evolution, or be a result of low tissue specificity (Table 2). After applying a gene tree topology-based filtering criteria (see Methods), 3,951 ohnolog-tetrads that conformed to expectations of LORe (507) or AORe (3444) gene tree topologies were used in further analyses.
Of the 6 classes of EEFs, unclassified (EEF6, 30-38%) and conserved tissue regulation (EEF5, ~25%) were the most common, followed by species-specific divergence of one duplicate (EEF3 and 4), lineage-specific divergence of both duplicates (EEF2), and conserved divergence (EEF1) (Table 2). Although the relative size of the EEF categories were similar in rank in LORe and AORe regions (Table 2), the EEF category sizes were significantly different (Fisher’s exact test, two sided, p-value < 0.0005, Supplementary Table S4). This difference was caused by enrichment of conserved-diverged expression evolution (EEF1) in AORe tetrads and enrichment for lineage-specific expression divergence (EEF2) among LORe tetrads.
As different tissues are involved in different biological functions, we expect that regulatory evolution of gene expression is shaped by tissue-specific selective pressures [32]. To test this expectation, we evaluated the hypothesis that tissues are disproportionately represented in EEFs 1-5 compared to the tissue distribution across all tetrads. For all but one EEF-class (EEF3), between 2-5 tissue-expression clusters were significantly over- or underrepresented (Fisher tests, two sided, Bonferroni corrected p-value<=0.05), with the conserved tissue regulation class (EEF5) being the most skewed in tissue representation with a bias towards brain-specific expression (Table S4). This finding was supported by high tissue-specificity (Tau score) of genes in ohnolog-tetrads associated with EEF5 (Supplementary Figure S8).
To evaluate if the ohnologs in different EEF classes were associated with distinct biological processes, we applied GO term enrichment tests on genes in EEF 1-5. Only 27 (among 721) overrepresented GO terms (p-adjusted < 0.05) were shared among >=2 of the five groups of expression evolution fates. Further inspection of top 10 GO terms in each EEF classes (Figure 3) shows links between EEF-categories and involvements in biological functions. EEF5 ohnologs under strict evolutionary constraints are highly enriched in brain-specific expression and enriched for GO functions related to behaviour and neural functions. In contrast, EEF1, which represents ohnologs that underwent divergence in gene regulation following WGD, are associated with functions related to lipid metabolism, development, and immune system.
Further, to test whether distinct evolutionary trajectories at the regulatory level (EEFs 1-5) were coupled to distinct patterns of protein-coding sequence evolution, we estimated dN/dS ratios for each duplicate pair within each species and compared the dN/dS distribution in each EEF class with that of the neutral-like (‘unclassified’) regulatory evolution (Figure 4 and Supplementary Figure S9). Low dN/dS (<<1) indicates strong purifying selection pressure. As with gene expression evolution, EEF 1-5 show clear variability in among-ohnolog dN/dS ratio, with conserved divergence (EEF1) having significantly higher dN/dS ratio compared to the neutral-like (‘unclassified’) category (Wilcoxon rank sum, p=0.02) and EEF 2 and 5 having significantly lower dN/dS ratios (Wilcoxon rank sum, p=0.00016 and p=8.405e-05, respectively). The ohnolog pairs showing species-specific expression divergence (EEF 3 and 4) did not have a significantly different dN/dS ratio compared to the neutral-like category (Wilcoxon rank test, p-values=0.27 and 0.58, respectively).
Loss of purifying selection on chloride ion transporter regulation in non-anadromous grayling
The most apparent difference in biology between grayling and Atlantic salmon is the anadromous life history in Atlantic salmon, i.e. the ability to migrate between freshwater and saltwater, a trait that grayling has not evolved. Saltwater acclimation involves changes in switching from ion absorption to ion secretion to maintain osmotic homeostasis. To assess whether key genes associated with the ability to adapt to seawater are under divergent selection for expression regulation in Atlantic salmon and grayling, we probed into EEF 3 and 4 for overrepresented GO terms related to ion-homeostasis (i.e. potassium, sodium or chloride regulation/transport). Interestingly, in EEF4, where a single grayling gene displayed diverged tissue-specific expression regulation, we found that ‘regulation of chloride transport’ was overrepresented. One of the genes associated with this GO term was the classical anadromy-associated salinity induced cystic fibrosis transmembrane conductance regulator (CFTR), which transports chloride ions over cell membranes in the gill. To determine if the grayling CFTR duplicate with diverged expression also had signatures of coding sequence divergence, we computed branch-specific dN/dS. Notably, the grayling CFTR displaying diverged expression regulation also displays a two-fold increase in dN/dS compared to its Ss4R duplicate with the conserved expression regulation, reflecting relaxation of purifying selection pressure in the non-anadromous grayling (Figure 5a and 5b).In Atlantic salmon, both CFTR Ss4R copies have been found to be involved in saltwater adaptations [33]. The combined activity of both CFTR copies might provide a fitness advantage in Atlantic salmon while the divergence of the second copy could simply indicate a different functional fate in freshwater grayling.
Discussion
A major limitation in previous studies of evolution of gene regulation following WGD in vertebrates has been the inability to distinguish between neutral and adaptive novel shifts in expression [18]. Our comparative approach provides valuable insights into the importance of selection in the contrasting processes of maintaining ancestral gene regulation and driving spatial expression divergence of ohnologs following WGD. The most commonly observed non-neutral expression evolution fate of ohnologs following Ss4R is the remarkable conservation of tissue-specific expression, predominantly in brain, across the 60 million years of independently evolving salmonid lineages (Table 2, EEF5). Our results corroborate the observation of biased retention of WGD derived duplicate genes related to nervous system and a strong expression conservation pattern in brain that has been described across vertebrates [34, 35, 36, 37]. Brain-specific genes are typically under strong purifying selection pressure owing to their specialized functions in specific cell types and complex networks of signaling cascades involving high-dimensional protein-protein interactions.
The least common expression evolution fate (EEF1, ~5%) are duplicates that reflect adaptive regulatory divergence followed by strong purifying selection in both grayling and Atlantic salmon. Although rare, duplicates of class EEF1 are particularly interesting as they represent key candidates for salmonid specific adaptive evolution of novel gene regulation enabled by the WGD. Salmonids are believed to have evolved from a pike-like ancestor; a relatively stationary ambush predator [38]. Under this assumption, early salmonid evolution must have involved adaptation to new pelagic and/or riverine habitats. Adaptations to new environments and evolution of different life history strategies are known to be associated with strong selective pressure on immune-related genes (e.g [39, 40]). In line with this, we see an overrepresentation of immune-related genes in the EEF1 class (Figure 3). Furthermore, pikes are generally piscivorous throughout their lifespan, while salmonids depend more on aquatic and terrestrial invertebrate prey with significantly lower input of lipids (especially in early life) [41]. Interestingly, the EEF1 duplicates are also enriched for liver-expressed genes involved in lipid-homeostasis metabolism and energy storage (glycogen)-related functions (Figure 3 and GO test results in Supplementary file 2). Taken together, our results suggest a role of Ss4R ohnologs in adaptive evolution of novel gene expression regulation related to new pathogenic pressures in a new type of habitat, and also optimization of lipid-homeostasis and glycogen metabolism-related functions in response to evolution of a more active pelagic/riverine life with limited lipid resources.
Our comparative analysis of Ss4R duplicates in Atlantic salmon and grayling suggests a difference in the rate of rediploidization between the two species. We find a set of LORe regions, corresponding to whole chromosome arms in Atlantic salmon [17, 16], represented by single copy genes in grayling as a result of assembly collapse. This strongly suggests that sequences are in fact present as near-identical duplicated regions in the grayling genome. In combination with an overall lower neutral sequence divergence observed among Ss4R duplicates in grayling, this finding further supports a lower rate of rediploidization in the grayling genome as compared to that in the Atlantic salmon lineage. The larger chromosome arm-sized regions still being virtually indistinguishable at the sequence level (~10% in total, i.e. blue ribbons in Figure 2b) are likely still recombining or have only ceased to do so in the recent evolutionary past. Large-scale chromosomal rearrangements often follow genome duplication to block or hinder recombination among duplicated regions [42]. The difference we observe in the rediploidization history of the two salmonids is thus likely linked to the distinctly different chromosome evolution in Atlantic salmon and grayling (Supplementary Figure S1) [43].
Evolution of anadromy, the ability to migrate between fresh- and seawater, is a fundamental difference in life history strategies between Atlantic salmon and European grayling. Among the ohnologs with grayling-specific divergence (EEF4), we found overrepresentation of genes associated with “ion homeostasis” functions. One of these ohnolog pairs comprises two CFTR genes, encoding a membrane chloride channel that exports chloride ions out of cells [44]. The grayling CFTR ohnolog with diverged tissue regulation has lost gill-tissue specificity and shows relaxed purifying selection pressure at the protein coding sequence level as well (>2-fold increase in dN/dS, Figure 5). The Ss4R CFTR ohnologs in Atlantic salmon are both primarily expressed in gills (Figure 5), and both are upregulated upon exposure to seawater [45]. It is therefore plausible that maintaining two functional CFTR genes is an adaptive trait in anadromous salmonids in that it improves their ability to remove excess chloride ions and maintain ion homeostasis in the sea. Conversely, in non-anadromous species, there is no selective pressure to maintain both CFTR copies, and this has resulted in the return to a single functional CFTR ohnolog copy in grayling.
In summary, we present the draft genome of European grayling using an efficient and cost effective short read sequencing strategy. The comparative genome and transcriptome analysis between Atlantic salmon and grayling provides novel insights into evolutionary fates of ohnologs subsequent to WGD and into associations between signatures of selection pressures on gene duplicate regulation and the evolution of key traits, including anadromy. Hence, the genome resource of grayling opens up new exciting avenues for utilizing salmonids as a model system to understand the evolutionary consequences of WGD in vertebrates.
Methods
Sampling and sequencing
A male grayling specimen was sampled outside of its spawning season (October 2012) from the River Glomma at Evenstad, Norway. The fish was humanely sacrificed and various tissue samples were immediately extracted and conserved for later DNA and RNA analysis. Fin clips were stored on 96% ethanol for DNA sequencing. Tissues from muscle, gonad, liver, head kidney, spleen, brain, eye, gill and heart were stored in RNALater for RNA extraction. The DNA was extracted from fin clips using a standard high salt DNA extraction protocol. A paired-end library with an insert size ~180 (150 bp read length) and mate pair libraries of insert size ~3kb and 6 kb (100bp read length) were sequenced using the Illumina HiSeq2000 platform (Table S1). Total RNA was extracted from the different tissue samples using the RNeasy mini kit (Qiagen) following the manufacturer’s instructions. The library construction and sequencing was carried out using Illumina TruSeq RNA Preparation kit on Illumina HiSeq2000 (Table S2). All the library preparation and sequencing was performed at the McGill University and Génome Québec Innovation Centre.
Genome assembly and validation
The sequences were checked for their quality and adapter trimming was performed using cutadapt (version 1.0) [46]. A de novo assembly was generated with Allpaths-LG (release R48777) [27] using the 180bp paired-end library and the mate pair (3kb and 6kb) libraries. Assembly polishing was carried out using pilon (version 1.9) [28]. The high copy number of mitochondrial DNA often leads to high read coverage and thus misassembly. The mitochondrial genome sequence in the assembly was thus reassembled by extracting the reads that mapped to the grayling (Thymallus thymallus) mtDNA sequence (GenBank ID: NC_012928), followed by a variant calling step using Genome Analysis Toolkit (GATK) (version 3.4-46) [47]. The consensus mtDNA sequence thus obtained was added back to the assembly.
To identify and correct possibly erroneous grayling scaffolds, we aligned the scaffolds against a repeat masked version of the Atlantic salmon genome [16] using megablast (E-value threshold 1e-250). Stringent filtering of the aligned scaffolds (representing 1.3 Gbp of the 1.4 Gbp assembly) identified 13 likely chimeric scaffolds mapping to two or more salmon chromosomes (Supplementary File 1), which were then selectively ‘broken’ between, apparently, incorrectly linked contigs.
Transcriptome assembly
The RNAseq data from all the tissue samples were quality checked using FastQC (version 0.9.2). The sequences were assembled using the following two methods. Firstly, a de-novo assembly was performed using the Trinity (version 2.0.6) [48] pipeline with default parameters coupled with in-silico normalization. This resulted in 730,471 assembled transcript sequences with a mean length of 713 bases. RSEM protocol based abundance estimation within the Trinity package was performed where the RNA-seq reads were first aligned back to the assembled transcripts using Bowtie2 [49], followed by calculation of various estimates including normalized expression values such as FPKM (Fragments Per Kilobase Million). A script provided with Trinity was then used to filter transcripts based on FPKM, retaining only those transcripts with a FPKM of at least one. Secondly, reference guided RNA assembly was performed by aligning the RNA reads to the genome assembly using STAR (version 2.4.1b) [50]. Cufflinks (version 2.1.1) [50, 51] and TransDecoder [52] were used for transcript prediction and ORF (open reading frame) prediction, respectively. The resulting transcripts were filtered and retained based on homology against zebrafish and stickleback proteins, using BlastP and PFAM (1e-05). The de-novo method resulted in 134,368 transcripts and the reference based approach followed by filtering resulting in 55,346 transcripts.
Genome Annotation
A de novo repeat library was constructed using RepeatModeler with default parameters. Any sequence in the de-novo library matching a known gene was removed using Blastx against the UniProt database. CENSOR and TEclass were used for classification of sequences that were not classified by RepeatModeler. Gene models were predicted using an automatic annotation pipeline involving MAKER (version2.31.8), in a two-pass iterative approach (as described in https://github.com/sujaikumar/assemblage/blob/master/README-annotation.md). Firstly, ab initio gene predictions were generated using GeneMark ES (version 2.3e) [53] and SNAP (version 20131129) [54] trained on core eukaryotic gene dataset (CEGMA). The first round of MAKER was then run using the thus generated ab initio models, with the UniProt database as the protein evidence, the de novo identified repeat library and EST evidences from the transcriptomes assembled using de novo and the reference guided approaches, along with the transcript sequences from the recent Atlantic salmon annotation [16]. The second pass involved additional data from training AUGUSTUS [55] and SNAP models on the generated MAKER predictions. Putative functions were added to the gene models using BlastP against the UniProt database (e-value 1e-5) and the domain annotations were added using InterProScan (version 5.4-47) [56]. Using the MAKER standard filtering approach, the resulting set of genes were first filtered using the threshold of AED (Annotation Edit Distance), retaining gene models with AED score less than 1 and PFAM domain annotation. AED is a quality score given by MAKER that ranges from 0 to 1 and indicates the concordance between predicted gene model and the evidence provided, where an AED of 0 indicates that the gene models completely conforms to the evidence. Further, for the genes with AED score of 1 and no domain annotations, a more conservative Blast search was performed against UniProt proteins and Atlantic salmon proteins with an e-value cut-off of 1e-20. The genes with hits to either of these databases were also retained. The completeness of the annotations was again assessed using CEGMA [57] and BUSCO [58].
Analysis of orthologous groups
We used orthofinder (version 0.2.8, e-value threshold at 1e-05) [59] to identified orthologous gene groups (i.e orthogroup). As input to orthofinder, we used the MAKER-derived T. thymallus gene models as well as protein sequences from three additional salmonid species (Atlantic salmon, Rainbow trout and coho salmon), four non-salmonid teleost species (Esox lucius, Danio rerio, Gasterosteus aculeatus, Oryzias latipes) and two mammalian outgroups (Homo sapiens, Mus musculus). Rainbow trout protein annotations were taken from https://www.genoscope.cns.fr/trout/. Atlantic salmon, Esox lucius data were downloaded from NCBI ftp server (ftp://ftp.ncbi.nih.gov/genomes/, release 100). The transcriptome data for coho salmon was obtained from NCBI (GDQG00000000.1) and translated using TransDecoder. All other annotations were downloaded from ENSEMBL.
Each set of orthogroup proteins were then aligned using MAFFT(v7.130) [60] using default settings and the resulting alignments were then used to infer maximum likelihood gene trees using FastTree (v2.1.8) [61] (Figure 1 a and b). As we were only interested in gene trees containing information on Ss4R duplicates, complex orthogroup gene trees (i.e. containing 2R or 3R duplicates of salmonid genes) were subdivided into the smallest possible subtrees. To this end, we developed an algorithm to extract all clans (defined as unrooted monophyletic clade) from each unrooted tree [62] with two monophyletic salmonid tips as well as non-salmonid outgroups resulting in a final set of 20,342 gene trees. In total, 31,291 grayling genes were assigned to a clan (Figure 1 and Supplementary Figure S2). We then identified homoelogy in the Atlantic salmon genome by integrating allvs-all protein BLAST alignments with a priori information of Ss4R synteny as described in Lien et al. 2016 [16]. Using the homeology information, we inferred a set of high confidence ohnologs originating from Ss4R. The clans were grouped based on the gene tree topology into duplicates representing LORe and those with ancestrally diverged duplicates. The LORe regions were further categorized into two (duplicated or collapsed) based on the number of corresponding T.thymallus orthologs. This data was plotted on Atlantic salmon chromosomes using circos plot generated using OmicCircos (https://bioconductor.org/packages/release/bioc/html/OmicCircos.html).
Expression divergence and conservation
The grayling RNA-seq reads from each of the eight tissues (liver, muscle, spleen, heart, head kidney, eye, brain, gills) were mapped to the genome assembly using STAR (version 2.4.1b). The reads uniquely mapping to the gene features were quantified using htseq-count [63]. The CPM value (counts per million), here used as a proxy for expression, was then calculated using edgeR [64]. Similar CPM datasets were obtained from Atlantic salmon RNA-seq data reported in Lien et al [16]. Filtering of ortholog groups (i.e. clans) was performed prior to analyses of expression evolution of Ss4R ohnologs: 1) we only considered Ss4R duplicates that were retained in both Atlantic salmon and grayling, 2) the Ss4R duplicates were classified into AORe or LORe, based on topologies of the ortholog group gene trees, only genes with non-zero CPM value were considered. This filtering resulted in a set of 5,026 duplicate pairs from both Atlantic salmon and grayling, referred to as ohnolog-tetrads. The gene expression values from the gene duplicates in the ohnolog-tetrads were clustered using hclust function in R, using Pearson correlation into eight tissue dominated clusters. The expression pattern in the eight clusters of the genes in ohnolog-tetrads was used to further classify them into one of the EEF categories. To quantify the breadth of expression (i.e., the number of tissues a gene is expressed in), we calculated the tissue specificity index Tau [65] for all the genes in ohnolog-tetrads, where a Tau value approaching 1 indicates higher tissue specificity while 0 indicates ubiquitous expression.
Sequence evolution
To estimate coding sequence evolution rates, we converted amino acid alignments to codon alignments using pal2nal [66]. The seqinr R package (http://seqinr.r-forge.r-project.org/) was used to calculate pairwise dN and dS values for all sequences in each alignment using the “kaks” function. For in-depth analyses of branch specific sequence evolution of the CFTR genes, we used the codeml model in PAML (version 4.7a) [67]. To assess if sequences in the CFTR gene tree evolved under similar selection pressure we contrasted a fixed dN/dS ratio (1-ratio) model and a free-ratio model of codon evolution. A likelihood ratio test was conducted to assess whether a free ratio model was a significantly better fit to the data. Branch specific dN/dS values were extracted from the ML results for the free ratios model.
The two Pacific salmon genes in the CFTR tree (Figure 5) correspond to a gene from Rainbow trout and another from Coho salmon. A blat search of CFTR gene against the Rainbow trout assembly (https://www.genoscope.cns.fr/trout/) resulted in hits on three different scaffolds, with one complete hit and two other partial hits on unplaced scaffolds. Additionally, Coho salmon data is based on a set of genes inferred from transcriptome data. Therefore, the presence of a single copy in the tree for the two species is likely an assembly artifact.
Gene Ontology (GO) analysis
The gene ontology term (GO) enrichment analysis was performed using elim algorithm of topGO R package (http://www.bioconductor.org/packages/2.12/bioc/html/topGO.html), with a significance threshold of 0.05 against the reference set of all Ss4R duplicates.
Data availability
The Illumina reads have been deposited at ENA under the project accession: PRJEB21333. The genome assembly and annotation data are available at https://doi.org/10.6084/m9.figshare.c.3808162.
Author’s contributions
LAV, KSJ, SJ and SL planned the project and generation of the data. SRS and SV performed all the analyses with help from AJN, OKT and SL. SRS, SV and AJN drafted the manuscript. All authors read and commented on the manuscript.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This research was supported by funding from University of Oslo to the SAK project “Building a marine genome hub” and the Strategic Research Initiative, Center for Computational Inference in Evolutionary Life Science (CELS) to KSJ. We thank Kim M. Bærum for sampling of grayling and Marianne H. S. Hansen for excellent technical assistance. Sample preparation, library contruction and sequecning were carried out at the Norwegian Sequencing Centre (NSC), Norway and McGill University and Génome Québec Innovation Centre, Canada. The computational work was performed on the Abel Supercomputing Cluster (Norwegian Metacenter for High-Performance Computing (NOTUR) and the University of Oslo), operated by the Research Computing Services group at USIT, the University of Oslo IT-department. We greatly appreciate Daniel J. Macqueen, Torgeir R. Hvidsten and Marine S. Brieuc for critical reading of the manuscript.