Giant lungfish genome elucidates the conquest of land by vertebrates

Abstract

Lungfishes belong to lobe-fined fish (Sarcopterygii) that, in the Devonian period, 'conquered' the land and ultimately gave rise to all land vertebrates, including humans^1,2,3. Here we determine the chromosome-quality genome of the Australian lungfish (Neoceratodus forsteri), which is known to have the largest genome of any animal. The vast size of this genome, which is about 14× larger than that of humans, is attributable mostly to huge intergenic regions and introns with high repeat content (around 90%), the components of which resemble those of tetrapods (comprising mainly long interspersed nuclear elements) more than they do those of ray-finned fish. The lungfish genome continues to expand independently (its transposable elements are still active), through mechanisms different to those of the enormous genomes of salamanders. The 17 fully assembled lungfish macrochromosomes maintain synteny to other vertebrate chromosomes, and all microchromosomes maintain conserved ancient homology with the ancestral vertebrate karyotype. Our phylogenomic analyses confirm previous reports that lungfish occupy a key evolutionary position as the closest living relatives to tetrapods^4,5, underscoring the importance of lungfish for understanding innovations associated with terrestrialization. Lungfish preadaptations to living on land include the gain of limb-like expression in developmental genes such as hoxc13 and sall1 in their lobed fins. Increased rates of evolution and the duplication of genes associated with obligate air-breathing, such as lung surfactants and the expansion of odorant receptor gene families (which encode proteins involved in detecting airborne odours), contribute to the tetrapod-like biology of lungfishes. These findings advance our understanding of this major transition during vertebrate evolution.

Main

Lungfish (Dipnoi) share with land-dwelling vertebrates the ability to breathe air though lungs, which are homologous to our own. Since their discovery in the nineteenth century, lungfish have attracted scientific interest and were initially thought to be amphibians^6,7. We now know that they are more closely related to tetrapods than to ray-finned fish. Of the extant lungfish species (of which there are only six), four live in Africa, one in South America and one (N. forsteri) in Australia. Lungfish appeared in the fossil record in the Devonian period, around 400 million years ago (Ma)¹. Some scholarship has discussed lungfish as 'living fossils', because their morphology barely changed over millions of years: for example, >100-million-year-old fossils from Australia strongly resemble the surviving species (which represents one of the oldest known animal genera, discovered exactly 150 years ago)². Owing to the ancestral characters (such as body shape, large scales and paddle-shaped fins) of N. forsteri, it resembles 'archetypal' extinct lungfish much more than the two other lineages of extant lungfish. The South American and, in particular, the African lungfish have almost completely lost their scales secondarily and have simplified their fin morphology into thin filaments, although they do show the alternating gaits that are typical of terrestrial locomotion.

Together with the coelacanths and tetrapods, lungfish are members of the Sarcopterygii (lobe-finned fish); however, owing to the short branch that separates these three ancient lineages it has remained difficult to resolve their relationships. Developments of powerful DNA sequencing and computational methods enable us to now revisit long-standing evolutionary questions regarding these relationships using whole-genome-derived datasets with more robust orthology inferences than have hitherto been possible. Previous analyses using large transcriptomic datasets have tended to support the hypothesis that lungfish are the closest living relatives of tetrapods^4,5. Lungfish are therefore crucial for understanding the evolution and preadaptations that accompanied the transition of vertebrate life from water to land. This major evolutionary event required a number of evolutionary innovations, including in respiration, limbs, posture, the prevention of desiccation, nitrogen excretion, reproduction and olfaction. Lungfish are known to have the largest animal genome (http://www.genomesize.com/search.php), but the mechanisms that led to and maintained their genome sizes are poorly understood. Therefore, the Australian lungfish might provide insights both into tetrapod innovations and evolution, and the structure of giant genomes.

Genome sequencing, assembly and annotation

The largest animal genome sequenced so far is the 32-Gb⁸ genome of the axolotl salamander (Ambystoma mexicanum). To overcome the challenges of sequencing and assembling the even-larger genomes of lungfish, we used long- and ultra-long-read Nanopore technology to generate 1.2 Tb in 3 batches: 601 Gb with an N50 read-length of 9 kb; 532 Gb with an N50 of 27 kb; and 1.5 Gb with an N50 of 46 kb, all from a juvenile Australian lungfish. We assembled these three batches into contigs using the MARVEL assembler⁸ (Extended Data Fig. 1a, Methods). This yielded a 37-Gb assembly with an N50 contig size 1.86 Mb (Supplementary Table 1). To correct for insertions and/or deletions, gaps, single-nucleotide polymorphisms and small local misalignments in the primary assembly, we used 1.4-Tb DNA and 499.8-Gb RNA Illumina reads. The genome-correction DNA data—sequenced at more than 30× coverage—were used to estimate genome size through frequencies of k-mers (Extended Data Fig. 2). We ascertained the high completeness of the 37-Gb assembly by observing that 88.2% of the DNA and 84% of the RNA sequencing (RNA-seq) reads aligned to the genome, which gives an estimated total genome size of 43 Gb (about 30% larger than the axolotl⁸). This matches the k-mer value but is smaller than that predicted by flow cytometry (52 Gb⁹) and Feulgen photometry (75 Gb¹⁰).

Next, we scaffolded the contigs using 271-Gb chromosome conformation capture (Hi-C) Illumina PE250 reads to a chromosome-scale assembly with an N50 of 1.75 Gb (Extended Data Fig. 1d, Methods). We also used Hi-C data to detect misjoins, by binning Hi-C contacts along the diagonal and identifying points that were depleted of contacts (Extended Data Fig. 1e). The largest scaffolds correspond to the 17 macrochromosomes arms of the karyotype of N. forsteri. We also assembled all ten microchromosomes into single scaffolds (Supplementary Information).

We constructed a comprehensive multi-tissue de novo transcriptome assembly (BUSCO score of over 98% core vertebrate genes) using RNA extracted from the same individual lungfish. For annotation of protein-coding genes, we combined evidence from transcript alignments and homology-based gene prediction. This resulted in 31,120 high-fidelity gene models. We assessed the completeness of the genome assembly using the predicted gene set and the BUSCO pipeline, detecting 91.4% of core vertebrate genes (233 genes) and 90.9% of vertebrate conserved genes (2,586 genes) (Supplementary Table 2). We predicted 17,095 noncoding RNAs (ncRNAs), including 1,042 transfer RNAs (tRNAs), 1,771 ribosomal RNAs (rRNAs) and 3,974 microRNAs (Supplementary Table 3, Supplementary Information).

Phylogeny of lungfish, coelacanth and tetrapods

Phylogenetic relationships among coelacanths, lungfishes and tetrapods have been debated^4,5,11. We used Bayesian phylogenomics (Fig. 1) with 697 one-to-one orthologues for 10 vertebrates, with a complex mixture model that can overcome long-branch attraction artefacts⁴ and also used noncoding conserved genomic elements (96,601 aligned sites) (Extended Data Fig. 3a). Both datasets unequivocally support lungfish^4,5 as the closest living relatives of land vertebrates, with which they shared a last common ancestor around 420 Ma (Extended Data Fig. 3b).

**Fig. 1: Bayesian phylogeny based on 697 one-to-one orthologues.**

Synteny conserved of macro- and microchromosomes

Lineage-specific polyploidy events are important evolutionary forces¹² that can also lead to genome expansions in lungfish^9,13. Despite the massive genome expansion in lungfish relative to other animals, the lungfish chromosomal scaffolds strongly resemble the ancestral chordate karyotype (Fig. 2a, Extended Data Fig. 4 a, b). On the basis of 17 chordate linkage groups (CLGs)^14,15 and 6,337 markers mapped onto the lungfish genome, we uncovered conserved syntenic correspondence between lungfish chromosomes and CLGs (Fig. 2a). The ancestor of vertebrates underwent two rounds of whole-genome duplication. Lungfish also retained more ancient CLG chromosomal fusions through these two rounds of vertebrate duplication¹⁵. In lungfish, CLG fusions from before the second round of whole-genome duplications are preserved intact but substantially expanded (Fig. 2b). Almost all additional CLG fusions happened recently, as indicated by sharp syntenic boundaries (Fig. 2b). This, along with the 'vertebrate-typical' gene number of N. forsteri, confirms the diploidy of the genome.

**Fig. 2: Conserved synteny and chromosomal expansion in lungfish.**

All ten lungfish microchromosomes (inferred from karyotype⁹ and our assembly (Extended Data Fig. 4)) could be homologized to the microchromosomes of chicken and gar (Fig. 2c, Extended Data Fig. 4c, d)—and even they mostly retained their co-linearity. This, along with the conservation of some microchromosomes in gar, chicken and green anole^15,16, suggests that microchromosomes may date back to the earliest vertebrates. The complete retention of microchromosomes in the massively expanded lungfish genome suggests that stabilizing selection maintains these ancestral units. In support of this, lungfish microchromosomes show—on average—higher gene densities and a lower density of long interspersed nuclear elements (LINEs), which are the major contributors to genome size (Extended Data Fig. 4b); this also suggests different expansion dynamics of vertebrate micro- and macrochromosomes.

Hallmarks of the giant lungfish genome

A maximum likelihood reconstruction of the ancestral genome sizes of vertebrates shows 2 major independent genome-expansion events in lungfish and salamander lineages (Extended Data Fig. 3c), initially at similar rates in both lineages (161–165 Mb per million years) but subsequently at slower rates in the Australian lungfish (about 39 Mb per million years), but possibly not in the other lineages of extant lungfishes. The genome expansion happened in early lungfishes (around 400–200 Ma), and slowed during the break up of Gondwana (from around 200 Ma to present) (Extended Data Fig. 3c). Independently, genome size increased in salamanders in two independent waves of DNA-repeat expansion (Fig. 3b, Extended Data Figs. 3c, 5). LINEs make up much of the recent genome growth of the lungfish (<15% divergence, around 9% (4 Gb), also in an earlier burst in lungfish but not axolotl) (Extended Data Fig. 5a). Because mobilized transposable elements can interrupt gene function, one might speculate that such bursts of activity of transposable elements might have caused novel gene functions.

**Fig. 3: Composition of repetitive elements in the lungfish genome.**

Although syntenically highly conserved, the lungfish genome has undergone extreme expansion through the accumulation of transposable elements. We performed standard repeat-masking procedures on the 37-Gb genome assembly, which identified 67.3% (24.65 Gb) as repetitive (Fig. 3a, Supplementary Table 4). To our knowledge, this is the highest repetitive DNA content in a genome found in the animal kingdom. We tested whether the remaining 13 Gb of the genome have signatures of repetitiveness that are obscured by genome size by applying a second round of repeat annotation on the hard-masked genome. This revealed an additional 23.92% of repetitive DNA (Fig. 3a), which was mostly classified as 'unknown' (adding 11% to the unknown portion of repetitive DNA) or 'LINE' (8.5%) (Supplementary Tables 5, 6). In total, around 90% of the lungfish genome is repetitive, and it expanded in two waves (Fig. 3a, Extended Data Fig. 5).

To investigate whether transposable elements are still active, we analysed poly(A)-RNA-derived RNA-seq data that probably relates to proteins relevant for transposition activity. All major categories of transposable elements (1,106 out of 1,821 (60.7%)) were expressed (Extended Data Fig. 6a). Transposable element families with higher copy numbers were also highly expressed in all three tissues we tested. This, and the finding of similar copies for many transposable element families, suggests that several types of transposable element remain active and contribute to the ongoing expansion of the lungfish genome. Identification of insertion polymorphisms between two, ideally relatively closely related lungfish species (such as Protopterus from Africa) are necessary to confirm transposable element activity. Apparently, the transposon silencing machinery did not adapt to reduce overabundant transposable elements by copy number expansion or structural changes (Supplementary Table 7).

The repeat landscape (proportions of major classes of transposable element) of lungfish resembles tetrapods (including axolotl), whereas the third extant sarcopterygian lineage (the coelacanths) is more 'fish'-like (Fig. 3b). The two largest animal genomes yet sequenced expanded through different temporal dynamics. Whereas long terminal repeat (LTR) elements are the most abundant class of transposable element (59%) in axolotl⁸, LINEs (25.7%; mostly CR1 and L2 elements) dominate in lungfish (Extended Data Figs. 5, 6). These two retrotransposon classes belong to the same copy-and-paste (and not cut-and-paste) category, but propagate via different mechanisms¹⁷. Although global repeat compositions differ between lungfish and axolotl, the same LTR class affects their genic regions (Extended Data Fig. 6, Supplementary Information).

To further understand genome growth in lungfish, we compared the genome structure of N. forsteri with that of other genomes (Extended Data Figs. 6c, d, 7). Although compact genomes have small introns, intragenic noncoding regions usually increase with genome size¹⁸. The largest intron of the lungfish is 5.8 Mb (in the dmbt1 gene) and average intron size is 50 kb as in axolotl, compared to 1 kb in fugu and 6 kb in human. Introns in the N. forsteri genome comprise about 8 Gb (21% of genome)—a similar proportion to that in human (21%), but half that of fugu (40%). This suggests that similar mechanisms affect the genic and intergenic compartments, following expectations for genome size evolution¹⁹.

In most genes, the first intron typically is the largest. The biological relevance of this remains unclear. The first introns in lungfish and axolotl are also much larger than downstream introns (Extended Data Fig. 7), which indicates that the relatively larger first introns in smaller genomes are probably not due to the space requirements of regulatory or structural motifs²⁰.

It has previously been suggested that the size of intragenic noncoding sequences and the extent of intron expansion are associated with organismal features (such as metabolic rate¹⁸) or functional categories of gene⁸ (for example, developmental or nondevelopmental genes). Similar to axolotl⁸, the introns in developmental genes in lungfish are smaller than in nondevelopmental genes (P = 2.166 × 10⁻⁸, Mann–Whitney U test) (Supplementary Table 8).

Genomic preadaptations in fish–tetrapod transition

Positive selection analysis uncovered 259 genes, many of which are related to oestrogen and categories related to female reproduction (Supplementary Information, Supplementary Table 9). We compared these rate dynamics (16,471 gene families) (Supplementary Tables 10,11), and found that in the lungfish lineage 24 families have contracted and 107 families have expanded—possibly related to evolutionary innovations.

Air breathing and the evolution of lungs

All land-living vertebrates and adult lungfish are air breathers. The pulmonary surfactant protein B family of genes has expanded considerably in the lungfish genome. Surfactants are necessary components of the lipoprotein mixture that covers the lung surface and ensures proper pulmonary function. In lungfish, the number of surfactant genes increased to a number typical for tetrapods (2–3× more than in cartilaginous and bony fish) (Supplementary Table 12). This may indicate an adaptation to air breathing in lungfish. We further investigated the expression of shh, which encodes an important regulator of lung development²¹, during lungfish embryogenesis (Extended Data Fig. 8a). shh is strongly expressed in the developing lungs (embryos at stages 43–48), visualizing the development of the right-sided lung (Neoceratodus has a unilateral lung). This lung develops in a manner notably similar to those of amphibians²². Altogether, this highlights molecular signatures of lungs that were necessary for the conquest of land by sarcopterygians.

Olfaction and evolution of the vomeronasal organ

We also noted expansions of genes involved in olfaction. The gene complement of receptors for airborne odorants (which is large and complex in tetrapods and small in fish) is considerably expanded in lungfish, whereas several receptor classes for waterborne odours have shrunk—in particular, zeta and eta receptors, which abound in teleost fishes (Supplementary Table 13). The vomeronasal organ (VNO) is present in most tetrapods^23,24, being linked to pheromone reception and expressing a large repertoire of vomeronasal receptor genes (particularly in amphibians). In N. forsteri, the vomeronasal receptor gene family—known from fish and even lampreys, although its function in these species is unknown—has expanded considerably. Lungfish possess a 'VNO primordium'²⁵. The notable expansion of the vomeronasal receptor gene family (especially V2R genes) in N. forsteri (Supplementary Table 14) shows that the VNO is a tetrapod innovation, which emerged in the water-to-land transition.

Lobed fins and evolution of terrestrial locomotion

Sarcopterygians have elaborated endochondral skeletons: lobed fins that are distally branched, forming digits that are suitable for substrate-based locomotion. Our analysis indicates sarcopterygian origins for 31 conserved tetrapod limb-enhancer elements²⁶ (Fig. 4a, Extended Data Fig. 8b). The hs72 (refs. ^27,28) enhancer (related to sall1) drives autopodal expression (Fig. 4b). We found sall1 strongly expressed in lungfish embryos, in expression patterns similar to those reported for tetrapods²⁹ (Fig. 4b) but absent during zebrafish fin development³⁰. Similar functions of sall1 during mouse limb development²⁹ suggest that this gene contributed to the acquisition of sarcopterygian lobed fins already in lungfish.

**Fig. 4: Regulatory preadaptation of lobed fin and hoxd gene regulation.**

Hox clusters and te fin-to-limb transition

The 4 clusters of hox genes in Neoceratodus (hoxa, hoxb, hoxc and hoxd) comprise 43 genes (Extended Data Fig. 9); the presence of hoxb10 and hoxa14 in lungfish confirms their loss at the fish-to-tetrapod transition¹¹. Our RNA-seq analysis of the expression of hox genes in the fins of larval Neoceratodus (Extended Data Fig. 8c) showed an unexpected expression of hoxc genes. The expression of hoxc genes in paired fins or limbs has previously been reported only for mammals³¹, related to the nail bed. We observed hoxc13 expression in axolotl limbs (Fig. 4c), but it was absent in the pectoral fins of ray-finned fish (Extended Data Fig. 8d). Transcript localization in Neoceratodus embryos showed expression of hoxc13 in the distal fin (Fig. 4c). This indicates an early gain of hoxc13 expression in sarcopterygians, suggesting co-option of this domain in tetrapods to pattern dermal limb elements (such as nails, hooves and claws). Together with sall1, this demonstrates an early sarcopterygian origin of limb-like gene expression that was ready for tetrapod co-option, facilitating the fin-to-limb transition and colonization of the land.

Hox cluster expansion versus regulation

Consistent with the overall genome expansion, the hox clusters of Neoceratodus are larger than in mouse, chicken and Xenopus, but have an uneven pattern of expansion (Extended Data Fig. 9). The clustering of hoxd genes results in their coregulation by enhancers 3′and 5′ of the cluster, leading to co-expression of hoxd9, hoxd10, hoxd11, hoxd12 and hoxd13 in the distal appendages^32,33,34,35. During fin development in Neoceratodus, expression of hoxd11 is nearly absent from the hoxd13 territory³⁶ (Fig. 4d) whereas in axolotl hoxd9, hoxd10 and hoxd11 are excluded from the hoxd13 digit domain³⁷ (Extended Data Fig. 8e). Such apparent loss of coregulation between hoxd13 and hoxd9, hoxd10 and hoxd11 is similar to that caused by experimentally increased distances in the hoxd cluster³², and suggests a disruption of enhancer sharing caused by the expansion of the intergenic regions between hoxd11 and hoxd13 (Fig. 4e). We performed additional analyses in mouse, Xenopus, lungfish and axolotl, which showed that—despite 5–10× differences in the size of the hoxd cluster—the region comprising hoxd8, hoxd9, hoxd10 and hoxd11 remained fixed at around 25 kb (Fig. 4e). This apparent constraint is probably due to sharing of enhancers located at the 3′ end of the cluster³³. Altogether, this indicates that hoxd expansion has partially disrupted long-range enhancer sharing, but that—conversely—such mechanisms have locally also constrained intergenic distances.

We have sequenced and assembled at the chromosome level (Supplementary Table 15) the largest animal genome, and have substantiated the hypothesis that lungfish are the closest living relatives of tetrapods. Despite the unique genome expansion history of lungfish, genic organization and chromosomal homology is maintained even at the level of microchromosomes. Genomic preadaptations in lungfish for the water-to-land transition of vertebrates include a larger complement of lung-expressed surfactant genes, which might have facilitated the evolution of air-breathing through a lung. In addition, the number of VNO olfactory receptors (as well as other receptor gene families that permit detection of airborne odours) increased in the lineage that led to air-breathing lungfish. The uneven expansion of hox clusters demonstrates the regulatory consequences of, and constraints on, genome expansion. The evolutionary trajectory of limb enhancers shows an early-fish origin of the limb regulatory program, with important changes towards preadaptations for terrestrialization preceding the fin-to-limb transition. Gene expression domains that characterize the tetrapod limb, but which were previously presumed to be absent from fins (such as those of sall1 and hoxc13), appeared in the lobe-finned lineage. Such novelties might have predisposed the sarcopterygians to conquer the land, demonstrating how the lungfish genome can contribute to a better understanding of this major transition in vertebrate evolution.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.

Biological materials

Biopsy material for DNA and RNA isolation was obtained from a juvenile Australian lungfish (N. fosteri) imported from Australia (CITES permit no.: PWS 2017-AU-000242). Owing to the immature status of the gonad, the sex could not be determined. The same specimen was used for genome sequencing (muscle), construction of the Hi-C library (spleen) and transcriptome sequencing of brain, gonad and liver. The second set of reads was generated from lungfish embryos (embryonic stage 52, GenBank accession numbers SRR6297462–6297470)³⁶. Embryos were bred and collected under permit ARA 2009.039 at Macquarie University.

DNA extraction, genome sequencing and assembly

High molecular weight (HMW) and ultra-HMW DNA was prepared by FutureGenomics and Nextomics, and sequenced using Nanopore technology (for statistics, see Supplementary Table 1).

gDNA for genome correction from snap-frozen lungfish muscle tissue (0.3 g) was isolated by a standard gDNA isolation protocol. Library preparation was performed using the Westburg NGS DNA library kit. The final library was excised by Pippin prep with 400-bp DNA size and sequenced (Illumina Nova-seq S2; PE150) at Vienna Bio Center NGS facility.

Hi‐C library was generated as previously described^38,39, with modifications detailed in Supplementary Methods. Final Hi‐C libraries were sequenced (Illumina Nova-seq SP; PE150) at Vienna Bio Center NGS facility.

Genome assembly

Ninety-six million reads comprising 1.2 Tb were assembled using the MARVEL genome assembler⁸. We first aligned 1% of the reads against all other reads. From these 1%-against-all alignments, we derived information on the repetitive elements present in the reads and used transitive transfer to repeat-annotate all reads used in the assembly. Regions were deemed repetitive when the depth of the alignments for a given read exceeded the expected depth fourfold. Given the alignment of the 1% against every other read in the assembly, we then transferred the repeat annotation of the 1% using the alignments to the respective position in the aligned reads. Here, the assumption is that when region (a, b) in read A aligns to (c, d) in read B and for a ≤ rb ≤ re ≤ b (in which rb and re are repetitive elements); this than can be mapped using the alignment to a corresponding region in B, which then can be tagged as repetitive as well. The final repeat-masking track covered 28.7% of the 1.2 Tb.

We then processed with an all-against-all alignment with repeat masking in place, yielding five billion alignments. On the basis of these alignments, we derived read qualities at 100-bp resolution, highlighting low sequencing quality regions in the reads. Using the alignments and the read qualities structural weaknesses (chimeric breaks, high-noise regions and other sequencing artefacts) in the reads were repaired (Supplementary Methods, Extended Data Fig. 10).

Repaired reads were then used for a new round of alignments, again with repeat masking, in place. After alignment, the default MARVEL assembly pipeline proceeded as shown in the included examples of the source distribution (Extended Data Fig. 1).

For the current MARVEL source code repository, see https://github.com/schloi/MARVEL. For sample execution scripts, see https://github.com/schloi/MARVEL/tree/master/examples.

Scaffolding

We used an agglomerative hierarchical-clustering-based scaffolding approach using various normalizations (Extended Data Fig. 1). For details, see Supplementary Methods.

We created initial clusters by selecting the largest contigs with the fewest contacts between them, each contig serving as a single cluster. We then added contigs on the basis of unique assignability to clusters. This was followed by scaffolding the cluster separately, visual inspection of an approximate contact map derived during the scaffolding process and return of wrongly assigned contigs to the set of unassigned contigs. We created contact maps for all clusters and merged or split clusters on the basis of the signal within those. The process of assigning contigs, scaffolding, merging and splitting clusters was repeated until no more useful changes could be made to the clusters (Supplementary Table 15 for comparison of chromosome and scaffold DNA content).

For the public source code repository, see https://github.com/schloi/MARVEL/.

The MARVEL assembler and scaffolder has previously been used to obtain a chromosome-scale axolotl genome assembly, which has been validated in comparison to the previously published chromosome-scale meiotic scaffolding⁴⁰ and is available as previously described⁴¹.

Genome assembly correction

For correction of errors (insertions and/or deletions (indels), base substitutions and small gaps) remaining after the genome assembly, we applied a two-step procedure using DNA-sequencing and RNA-seq reads separately. In brief, we sequenced the same genomic DNA sample and generated 4,693,324,032 high-quality read pairs (2 × 150 bp) (30× coverage). Additionally, we used the RNA-seq reads from the de novo transcriptome assembly to correct indels, but not base substitutions, in transcribed regions (Supplementary Methods, Supplementary Results, Extended Data Fig. 10).

Transcriptome assembly

RNA was isolated from brain, spinal cord, eyes, gut, gonad, liver, jaw, gills, pectoral fin, caudal fin, trunk muscles and larval fin. Libraries were constructed using NEBNext Ultra II Directional RNA library preparation kit (New England Biolabs), Illumina TruSeq RNA sample preparation kit (Illumina) or Lexogen Total RNA-seq Library Prep Kit V2 (Lexogen). Paired-end sequencing, performed with Illumina platforms, yielded approximately 1,150 million raw reads.

Raw reads, filtered and corrected using Trimmomatic v.0.36⁴² and RCorrector v.1.0.2⁴³, were assembled using de novo and reference-guided approaches. For de novo assembly, only reads derived from poly(A)-selected RNA were processed using the Oyster River Protocol (ORP) v.2.2.8⁴⁴. In brief, reads were assembled using Trinity v.2.8.4 (k-mer = 25), SPAdes v.3.13.3⁴⁵ (k-mer = 55), SPAdes (k-mer = 75) and Trans-Abyss v.2.0.1⁴⁶ (k-mer = 32). The four different assemblies were then merged using the OrthoFuser module^47,48 implemented in ORP. Completeness of the de novo-assembled transcriptome was assessed with BUSCO v.3⁴⁹ using core vertebrate genes and Vertebrata genes (vertebrata_odb9 database) in the gVolante webserver⁵⁰. For reference-guided assembly, all reads were aligned to the N. forsteri genome (each sample independently) using the program HISAT2 v.2.1.0⁵¹ (maximum intron length set to 3 Mb). The resulting mapping files were parsed by StringTie v.1.3.6⁵² and transcripts reconstructed from each aligned sample were merged in a single consensus .gtf file.

Repeats and transposable elements annotation

Neoceratodus forsteri repeat sequences were predicted using RepeatMasker (v.4.0.7) with default transposable element Dfam database and a de novo repeat library constructed using RepeatModeler (v.1.0.10), including the RECON (v.1.0.8), RepeatScout (v.1.0.5) and rmblast (v.2.6.0), with default parameters. Transposable elements not classified by RepeatModeler were analysed using PASTEC (https://urgi.versailles.inra.fr/Tools/) and DeepTE⁵³. Repeat sequences of A. mexicanum (AmexG_v3.0.0, https://www.axolotl-omics.org/) were predicted using the same approach. Repetitive sequences of Anolis carolinensis (GenBank accession GCA_000090745.2), Xenopus tropicalis (GCA_000004195.4), Rhinatrema bivittatum (GCA_901001135.1), Latimeria chalumnae (GCA_000325985.2), Lepisosteus oculatus (GCA_000242695.1), Danio rerio (GCA_000002035.4) and Amblyraja radiata (GCF_010909815.1)) were identified using Dfam TE Tools Container (https://github.com/Dfam-consortium/TETools) including RepeatModeler (v.2.0.1) and RepeatMasker (v.4.1.0). To further examine the remaining intergenic sequences, we predicted repetitive sequences again using the same workflow on the genome hard-masked with repeats already predicted by RepeatMasker.

Kimura distance-based distribution analysis and transposable-element-composition principal component analysis

Kimura substitution levels between the repeat consensus to its copies were calculated using a utility script calcDivergenceFromAlign.pl bundled in RepeatMasker. Repeat landscape plots were produced with the R script nf_all_age_plot.R and nf_am_rb_age_plots.R, using the divsum output from calcDivergenceFromAlign.pl. Principal component analysis on repetitive element composition was performed in R (v.3.6) using factoextra package (v.1.0.6). Repetitive element compositions (SINE, LINE, DNA, LTR and unknown) were calculated f...

Search This Blog

Acuarium