-
Catherine Chaison
Biography
Understanding the origin of new genes is a primary goal of evolutionary biology. New genes arise through a variety of processes, a particularly notable route being that of de novo gene birth. This mechanism of gene formation is driven by specific mutations that transform originally non-genic DNA into transcribed sequences that encode for proteins and have a biological function, turning these sequences into protein-coding genes. Given that novel proteins are likely to be functionally disruptive, either through toxic aggregation or interference with existing cellular networks, it has been hypothesized that most de novo genes should be deleterious and subject to purifying selection, limiting their frequency in genomes. Alternatively, some may provide adaptive benefits, facilitating their retention and fixation in genomes over evolutionary timescales and supporting how hundreds of these genes are found across a wide array of species studied. A key factor in determining the likelihood of this fixation for de novo genes is the strength of natural selection, which is influenced by effective population size (Nₑ). In species with large Nₑ, selection is more efficient at removing deleterious alleles and fixing beneficial ones, whereas in species with small Nₑ, genetic drift plays a larger role, allowing mildly deleterious mutations to persist. If de novo genes are primarily advantageous, we expect to see a positive correlation between their fixation rate and Ne, as selection would favor their retention in large populations. Conversely, if de novo genes are predominantly deleterious, they should be more common in species with small Ne, where weaker selection fails to eliminate them. A lack of correlation between Nₑ and de novo gene prevalence would suggest a largely neutral.
To investigate this relationship, we developed LINGUA, LINeage-Specific Gene Universal Annotator, a bioinformatics pipeline designed to systematically identify and analyze de novo gene emergence across a variety of genomes using both sequence homology and synteny. We aim for LINGUA to be capable of large-scale analysis across both mammalian and plant genomes for broad application in de novo gene analysis. Using a dataset encompassing a broad range of mammalian species with well-characterized genomes and varying population sizes, we quantified the de novo gene fixation rate in relation to evolutionary time and selection strength by relating the results of LINGUA to the dN/dS of each species, which represent a robust proxy of Nₑ. Our approach integrates comparative genomics, transcriptomic validation, and evolutionary modeling to assess the prevalence, expression patterns, and potential functional roles of these genes.
By implementing stringent criteria for de novo gene identification, we are able to distinguish potential candidates for previously-unannotated de novo genes across a broad range of species and use this to explain the content and evolutionary dynamics of de novo gene formation driving their evolution, which will be discussed in results. Understanding these selective pressures through this research will further highlight de novo gene’s functional significance in the genome.