Proposal of a new nomenclature for introns in protein-coding genes in fungal mitogenomes

Fungal mitochondrial genes are often invaded by group I or II introns, which represent an ideal marker for understanding fungal evolution. A standard nomenclature of mitochondrial introns is needed to avoid confusion when comparing different fungal mitogenomes. Currently, there has been a standard nomenclature for introns present in rRNA genes, but there is a lack of a standard nomenclature for introns present in protein-coding genes. In this study, we propose a new nomenclature system for introns in fungal mitochondrial protein-coding genes based on (1) three-letter abbreviation of host scientific name, (2) host gene name, (3), one capital letter P (for group I introns), S (for group II introns), or U (for introns with unknown types), and (4) intron insertion site in the host gene according to the cyclosporin-producing fungus Tolypocladium inflatum. The suggested nomenclature was proved feasible by naming introns present in mitogenomes of 16 fungi of different phyla, including both basal and higher fungal lineages although minor adjustment of the nomenclature is needed to fit certain special conditions. The nomenclature also had the potential to name plant/protist/animal mitochondrial introns. We hope future studies follow the proposed nomenclature to ensure direct comparison across different studies. Electronic supplementary material The online version of this article (10.1186/s43008-019-0015-5) contains supplementary material, which is available to authorized users.


INTRODUCTION
Fungi constitute a huge group of highly diverse organisms, with 2.2-3.8 million estimated species and 144,000 currently known species on Earth (Hawksworth and Lücking 2017;Cannon et al. 2018). They were traditionally divided into four groups: chytridiomycetes, zygomycetes, ascomycetes, and basidiomycetes according to morphological traits associated with reproduction. Molecular phylogenetics and more recently phylogenomics recognized eight phyla in Fungi, namely Microsporidia, Cryptomycota, Blastocladiomycota, Chytridiomycota, Zoopagomycota, Mucoromycota, Ascomycota, and Basidiomycota (Spatafora et al. 2017). Aside from a few early divergent lineages and anaerobic organisms, almost all fungi contain mitochondria and mitogenomes in their cells (Bullerwell and Lang 2005;van der Giezen et al. 2005). Over recent years, mitogenomes of an increasing number of fungal species are sequenced. As of July 2019, mitogenomes from at least 300 fungal species are available with representatives from all major fungal groups. Fungal mitogenomes typically contain 15 standard protein-coding genes, two rRNA genes and a variable number of tRNA genes. These protein-coding genes are atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, and rps3 (Lang 2018), and some of them may be absent from certain fungal mitogenomes (Koszul et al. 2003).
Introns as mobile elements are frequently observed in mitochondrial protein-coding and/or rRNA genes of fungi. One gene may also be simultaneously invaded by multiple introns (e.g., four introns in cob and seven introns in cox1 in Isaria cicadae) (Fan et al. 2019). Mitochondrial introns are divided into two groups (I and II) based on their secondary structure and splicing mechanism (Saldanha et al. 1993), with group I introns being abundant in fungal mitogenomes. Different fungal species or even different individuals of a particular fungus may show diversity in number and insertion position of mitochondrial introns (Kosa et al. 2006;Zhang et al. 2015;Zhang et al. 2017a;Wang et al. 2018;Fan et al. 2019;Nie et al. 2019). Introns contribute to fungal mitogenome expansion/variability and represent an ideal marker for understanding fungal evolution (Zhang et al. 2015).
Currently, there has been a nomenclature for introns present in rRNA genes (Johansen and Haugen 2001). According to the nomenclature, introns are often found at a limited number of insertion sites in highly conserved regions of rRNA genes from nuclei, mitochondria, and chloroplasts, and therefore, a given rRNA sequence can be aligned with the chosen standard rRNA sequences of Escherichia coli to locate and name potential introns. For mitochondrial protein-coding genes, however, it is difficult to align their sequences with corresponding E. coli sequences due to high sequence divergence. In most literatures, introns in protein-coding genes are generally named serially according to their appearance in a particular host gene (e.g., cox1-i1, cox1-i2, and cox1-i3) (Deng et al. 2016;Zhang et al. 2017b;Zhang et al. 2017c). This naming strategy is not convenient for scientific communication and comparison of introns across different mitogenomes. A standard nomenclature of mitochondrial introns is needed to avoid confusion when comparing different fungal mitogenomes.
In our previous studies, we have tried to designate introns based on their insertion positions, but a mitogenome is arbitrarily selected from species under investigation (Fan et al. 2019;. In this study, we aim to propose a standard nomenclature for introns in protein-coding genes in fungal mitogenomes and test its applicability using fungal species from a broad range of taxonomic classification. To know if the suggested nomenclature can apply to "cross-kingdom" mitochondrial introns, some plant/protist/animal introns are also examined.

METHODS
In order to establish a standard nomenclature for introns in protein-coding genes across the kingdom Fungi, it is necessary to find an appropriate reference mitogenome. By looking at fungal species with available mitogenomes, we choose the mitogenome of the cyclosporin-producing fungus Tolypocladium inflatum ARSEF 3280 (accession number NC_036382) as the reference mitogenome. The 25,328-bp mitogenome of T. inflatum contains all the 15 protein-coding genes typically found in fungal mitogenomes, and there is no intron in any of these proteincoding genes (Zhang et al. 2017d). We did not choose the best-understood model fungi: 'baker's yeast' Saccharomyces cerevisiae, the fission yeast Schizosaccharomyces pombe, the opportunistic fungal pathogen Candida albicans, the filamentous euascomycete Neurospora crassa, etc. This is because the yeasts Sa. cerevisiae and Sc. pombe both lack genes coding for NADH dehydrogenases in their mitogenomes (Foury et al. 1998), and C. albicans and N. crassa contain introns in many different protein-coding genes (Borkovich et al. 2004;Bartelli et al. 2013). We also did not choose the human mitochondrial genome, which was selected as the reference to name introns found in nad5 and cox1 in certain metazoans (Emblem et al. 2011). This is because the human mitogenome contains only 13 standard protein-coding genes without atp9 and rps3. The latter two genes are known to harbor introns in fungal mitogenomes.
Both basal and higher fungi may contain introns in their mitogenomes. We randomly selected representative species in each fungal phylum to locate and name possible introns ( Table 1). Determination of the insertion position of an intron relies on alignment between sequences of its host gene and corresponding gene sequences of T. inflatum (Additional file 1). Although there are many sequence alignment programs available, we recommend using MAFFT (https://mafft.cbrc.jp/alignment/software/), which is fast when aligning long sequences containing many introns and can always generate satisfactory alignment according to our experience. The default setting of MAFFT works well in most cases. If exon-intron boundaries are not correctly identified (probably due to the interference of intron sequences or presence of short exons) under the default settings, one may consider adjusting the alignment parameters (e.g., try 'Unalignlevel > 0' and possibly 'Leave gappy regions' by selecting the G-INS-1 or G-INS-i alignment strategy) and/or importing additional sequences to align from a species closely related the test species. In addition, it is always advisable to refer to known annotation results and/or characteristic nucleotides at splice sites of group I/II introns (Cech 1988) to ensure correct alignment and identification of exon-intron boundaries.

RESULTS AND DISCUSSION
We propose a new nomenclature system for introns in fungal mitochondrial protein-coding genes based on (1) three-letter abbreviation of host scientific name, (2) host gene name, (3) one capital letter P (for group I introns, meaning position or primary for easy memorization), S (for group II introns, meaning site or secondary), or U (for introns with unknown types), and (4) intron insertion site in the host gene according to T. inflatum (Additional file 1). When there is no ambiguity (e.g., when just talking about introns in a particular species or in a particular host gene of a species), host scientific name and/or host gene name may be omitted. In any case, however, the letter P/S/U and insertion site of an intron should never be omitted. Using the nomenclature, previously reported introns could be renamed. Examples of renaming are the group II intron Sce.cox1S169 (former aI1) from Saccharomyces cerevisiae cox1 at site 169, and the group I intron Cgl.cox1P240 (former CgCox1.1) from Candida glabrata cox1 at position Zhang and Zhang IMA Fungus (2019)  We hope future studies follow this proposed nomenclature to ensure direct comparison across different studies. The suggested nomenclature is flexible to fit some special conditions. Firstly, although we suggest three-letter abbreviation of host scientific name, four-or-more-letter abbreviation may be used in cases where the three-letter abbreviation cannot discriminate among all species under investigation. An example is introns at position 717 in nad5 in Candida pseudojiufengensis (Cpse.nad5U717) and Candida psychrophila (Cpsy.nad5P717) ( Table 2, lines 11-12). Secondly, twintrons (twin introns) have been described from some fungal mitogenomes with various combinations of group I or II introns nested inside each other or situated next to each other (Hafez and Hausner 2015;Deng et al. 2016). The internal/external or upstream/ downstream members of a twintron could be named alphabetically. An example is the side-by-side twintron in cox3 in Hypomyces aurantius, where two group IA introns are arranged in tandem (Deng et al. 2016). The upstream intron of the twintron can be named as Hau.cox3P640a and the downstream one as Hau.cox3P640b (Table 2, lines  13-14). Finally, although introns present at an identical insertion site among different strains of a particular species are generally conserved, distantly related introns are sometimes detected among different strains. Introns of this kind can be named numerically. For example, Hth.cobP429 in different strains of Hirsutella thompsonii showed length variations (e.g., 2.7 kb in ARSEF 9457 and 4.8 kb in ARSEF 1947) (Wang et al. 2018), and the two variants may be named as Hth.cobP429-1 in ARSEF 9457 and Hth.cobP429-2 in ARSEF 1947 (Table 2, lines 15-16).
The suggested nomenclature has been successfully applied to name introns in 16 fungi from different phyla, including both basal and higher fungal lineages (Table 3). These fungi contain introns in all proteincoding genes except atp8, nad2, and nad6, and cob and cox1 are most frequently invaded by introns. These introns are mostly group I introns, but we also find few group II introns as well as few introns with undetermined types. There are a total of 149 introns at 74 insertion sites in these fungi. Using the suggested nomenclature, intron positions in a particular gene can be directly observed and compared across different species. We find some points frequently inserted by introns in different species (e.g., cobP490, cox1P386, cox1P720, cox1P1107). From the intron insertion site numbers, one can also easily understand the phase of     an intron, which is phase 0 when an intron inserts between two codons (e.g., cobP393), and phase 1 or 2 when an intron inserts within a codon (e.g., cox1S205, cox1P386). These introns are often found at highly conserved regions (Additional file 2). In addition to fungi, plants and protists (but rarely in animals) also contain group I or II introns in their mitochondrial genes (Oda et al. 1992;Ogawa et al. 2000;Burger et al. 2003;Chi and Johansen 2017). The nomenclature suggested in this study could potentially apply to plant/protist/ animal mitochondrial introns ( Table 2, lines 17-22; Additional file 2). Plant mitogenomes, however, are also known to encode several intron-containing protein genes (e.g., nad7, ccmC, rps10, rpl2) that are absent in fungal mitogenomes Sloan et al. 2018). Introns are even found in tRNA-coding genes in plant mitogenomes (Smith et al. 2011). An additional plant reference is necessary to name introns unique to plant mitogenomes.

CONCLUSIONS
A standard nomenclature was suggested for introns in protein-coding genes in fungal mitogenomes. It was proved feasible by naming introns present in mitogenomes of 16 fungi from a broad range of taxonomic classification, and it also had the potential to name introns in plant/protist/animal mitogenomes. Future studies should follow the proposed nomenclature to ensure direct comparison across different studies.

Additional files
Additional file 1: Sequences of protein-coding genes of Tolypocladium inflatum ARSEF 3280 (accession number NC_036382). Insertion site of group I introns are shown in red, group II introns in green, and introns with undetermined intron types in shade. (DOCX 21 kb) Additional file 2: Intron insertion sites for 22 common introns. Exon sequences of cob, cox1, cox2, nad1, and nad5 of different fungal taxa plus few non-fungal taxa were aligned by MAFFT, and visualization of the aligned sequences was performed using ESPript 3.0 (Robert and Gouet 2014) under default settings. Refer to Tables 1 and 2 for organisms represented by accession numbers, and the accession numbers of non-fungal taxa are marked in red boxes. Insertion sites of introns are shown using upward arrows. For phase 0 introns, conserved amino acids before and after insertion sites are listed. The amino acid glycine (G) is frequently seen before insertion sites of phase 0 introns. For phase 1 or 2 introns, conserved amino acids at insertion sites are given, and corresponding triplet codons are marked by a horizontal line. (PPTX 2235 kb)