Towards computational science with phenotype descriptions: From large-scale candidate gene discovery to reproducible data synthesis for morphological trait evolution
As the principal target of natural selection, phenotype takes a central role in biology and the evolution of organismic biodiversity. Much of the phenome, whether occurring naturally or the result of genetic mutations, has been recorded in natural language descriptions, resulting in a vast amount of phenotype observations reported in the literature that have remained largely refractory to computational data science. This is because in contrast to human experts, who implicitly use their domain knowledge as the context in which to evaluate and relate the meaning of different phenotype descriptions, machines only see opaque strings of letters. In this talk I will highlight a variety of breakthrough developments over the last 10 years in overcoming this challenge, with a focus on the Phenoscape project. Phenoscape was first funded by NSF in 2007 and started as a collaboration between (zebrafish) model organism geneticists, developmental biologists, systematists, and bioinformatics experts. The project has since assembled a large database of evolutionary phenotype knowledge curated from the literature in a form that allows machines to compute with its semantics based on what we know about vertebrate morphology and systematics. Integrated into this database are similarly curated descriptions of the phenotypes of mutant model organism genes. This has allowed the development of algorithms that use machine reasoning and statistics to relate phenotype descriptions to each other quantitatively by the similarity of their semantics, even though the descriptions come from different fields of inquiry with disparate conventions and divergent terminologies. Our results suggest that these methods can discriminate between candidate and non-candidate genes for evolutionary phenotype change; generate and score candidate gene hypotheses for evolutionary phenotype transitions; and discover taxa whose descendants show trait variation semantically similar to mutant gene phenotypes. We have also used machine reasoning over the assembled database to infer presence/absence traits on a large scale that are implied by, but not expressly asserted in published phenotype descriptions. I will conclude with describing the next phase of the project, which has just been funded by NSF, and which aims to demonstrate how AI-type services based on machine-processable semantics can help to address long-standing challenges in the comparative analysis of morphological trait evolution.