Semantic Web
Building an Experimental German User Interface Terminology Linked to SNOMED CT.
Building an Experimental German User Interface Terminology Linked to SNOMED CT.
Stud Health Technol Inform. 2019 Aug 21;264:153-157
Authors: Hashemian Nik D, Kasáč Z, Goda Z, Semlitsch A, Schulz S
Abstract
We describe the process of creating a User Interface Terminology (UIT) with the goal to generate a maximum of German language interface terms that are mapped to the reference terminology SNOMED CT. The purpose is to offer a high coverage of medical jargon in order to optimise semantic annotations of clinical documents by text mining systems. The first step consisted in the creation of an n-gram table to which words and short phrases from the English SNOMED CT description table were automatically extracted and entered. The second step was to fill up the n-gram table with human and machine translations, manually enriched by POS tags. Top-down and bottom-up methods for manual terminology population were used. Grammar rules were formulated and embedded into a term generator, which then created one-to-many German variants per SNOMED CT description. Currently, the German user interface terminology contains 4,425,948 entries, created out of 111,605 German n-grams, assigned to 95,298 English n-grams. With 341,105 active concepts and 542,462 (non FSN) descriptions, it corresponds to an average of 13 interface terms per concept and 8.2 per description. An analysis of the current quality of this resource by blinded human assessment terminology states equivalence regarding term understandability compared to a fully automated Web-based translator, which, however does not yield any synonyms, so that there are good reasons to further develop this semi-automated terminology engineering method and recommend it for other language pairs.
PMID: 31437904 [PubMed - in process]
Compatible Data Models at Design Stage of Medical Information Systems: Leveraging Related Data Elements from the MDM Portal.
Compatible Data Models at Design Stage of Medical Information Systems: Leveraging Related Data Elements from the MDM Portal.
Stud Health Technol Inform. 2019 Aug 21;264:113-117
Authors: Dugas M, Hegselmann S, Riepenhausen S, Neuhaus P, Greulich L, Meidt A, Varghese J
Abstract
Compatible data models are key for data integration. Data transformation after data collection has many limitations. Therefore compatible data structures should be addressed already during the design of information systems. The portal of Medical Data Models (MDM), which contains 20.000+ models and 495.000+ data items, was enhanced with a web service to identify data elements, which are frequently collected together in real information systems. Using Apache Solr, a fast search functionality to identify those elements with semantic annotations was implemented. This service was integrated into the metadata registry (MDR) component of MDM to make it available to the scientific community. It can be used to build intelligent data model editors, which suggest and import frequent data element definitions according to the current medical context.
PMID: 31437896 [PubMed - in process]
Romedi: An Open Data Source About French Drugs on the Semantic Web.
Romedi: An Open Data Source About French Drugs on the Semantic Web.
Stud Health Technol Inform. 2019 Aug 21;264:79-82
Authors: Cossin S, Lebrun L, Lobre G, Loustau R, Jouhet V, Griffier R, Mougin F, Diallo G, Thiessard F
Abstract
The W3C project, "Linking Open Drug Data" (LODD), linked several publicly available sources of drug data together. So far, French data, like marketed drugs and their summary of product characteristics, were not integrated and remained difficult to query. In this paper, we present Romedi (Référentiel Ouvert du Médicament), an open dataset that links French data on drugs to international resources. The principles and standard recommendations created by the W3C for sharing information were adopted. Romedi was connected to the Unified Medical Language System and DrugBank, two central resources of the LODD project. A SPARQL endpoint is available to query Romedi and services are provided to annotate textual content with Romedi terms. This paper describes its content, its services, its links to external resources, and expected future developments.
PMID: 31437889 [PubMed - in process]
Closed-loop cycles of experiment design, execution, and learning accelerate systems biology model development in yeast.
Closed-loop cycles of experiment design, execution, and learning accelerate systems biology model development in yeast.
Proc Natl Acad Sci U S A. 2019 Aug 16;:
Authors: Coutant A, Roper K, Trejo-Banos D, Bouthinon D, Carpenter M, Grzebyta J, Santini G, Soldano H, Elati M, Ramon J, Rouveirol C, Soldatova LN, King RD
Abstract
One of the most challenging tasks in modern science is the development of systems biology models: Existing models are often very complex but generally have low predictive performance. The construction of high-fidelity models will require hundreds/thousands of cycles of model improvement, yet few current systems biology research studies complete even a single cycle. We combined multiple software tools with integrated laboratory robotics to execute three cycles of model improvement of the prototypical eukaryotic cellular transformation, the yeast (Saccharomyces cerevisiae) diauxic shift. In the first cycle, a model outperforming the best previous diauxic shift model was developed using bioinformatic and systems biology tools. In the second cycle, the model was further improved using automatically planned experiments. In the third cycle, hypothesis-led experiments improved the model to a greater extent than achieved using high-throughput experiments. All of the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for automatic execution, and the results stored on the semantic web for reuse. The final model adds a substantial amount of knowledge about the yeast diauxic shift: 92 genes (+45%), and 1,048 interactions (+147%). This knowledge is also relevant to understanding cancer, the immune system, and aging. We conclude that systems biology software tools can be combined and integrated with laboratory robots in closed-loop cycles.
PMID: 31420515 [PubMed - as supplied by publisher]
Formal Medical Knowledge Representation Supports Deep Learning Algorithms, Bioinformatics Pipelines, Genomics Data Analysis, and Big Data Processes.
Formal Medical Knowledge Representation Supports Deep Learning Algorithms, Bioinformatics Pipelines, Genomics Data Analysis, and Big Data Processes.
Yearb Med Inform. 2019 Aug;28(1):152-155
Authors: Dhombres F, Charlet J, Section Editors for the IMIA Yearbook Section on Knowledge Representation and Management
Abstract
OBJECTIVE: To select, present, and summarize the best papers published in 2018 in the field of Knowledge Representation and Management (KRM).
METHODS: A comprehensive and standardized review of the medical informatics literature was performed to select the most interesting papers published in 2018 in KRM, based on PubMed and ISI Web Of Knowledge queries.
RESULTS: Four best papers were selected among the 962 publications retrieved following the Yearbook review process. The research areas in 2018 were mainly related to the ontology-based data integration for phenotype-genotype association mining, the design of ontologies and their application, and the semantic annotation of clinical texts.
CONCLUSION: In the KRM selection for 2018, research on semantic representations demonstrated their added value for enhanced deep learning approaches in text mining and for designing novel bioinformatics pipelines based on graph databases. In addition, the ontology structure can enrich the analyses of whole genome expression data. Finally, semantic representations demonstrated promising results to process phenotypic big data.
PMID: 31419827 [PubMed - in process]
Enhancing Clinical Data and Clinical Research Data with Biomedical Ontologies - Insights from the Knowledge Representation Perspective.
Enhancing Clinical Data and Clinical Research Data with Biomedical Ontologies - Insights from the Knowledge Representation Perspective.
Yearb Med Inform. 2019 Aug;28(1):140-151
Authors: Bona JP, Prior FW, Zozus MN, Brochhausen M
Abstract
OBJECTIVES: There exists a communication gap between the biomedical informatics community on one side and the computer science/artificial intelligence community on the other side regarding the meaning of the terms "semantic integration" and "knowledge representation". This gap leads to approaches that attempt to provide one-to-one mappings between data elements and biomedical ontologies. Our aim is to clarify the representational differences between traditional data management and semantic-web-based data management by providing use cases of clinical data and clinical research data re-representation. We discuss how and why one-to-one mappings limit the advantages of using Semantic Web Technologies (SWTs).
METHODS: We employ commonly used SWTs, such as Resource Description Framework (RDF) and Ontology Web Language (OWL). We reuse pre-existing ontologies and ensure shared ontological commitment by selecting ontologies from a framework that fosters community-driven collaborative ontology development for biomedicine following the same set of principles.
RESULTS: We demonstrate the results of providing SWT-compliant re-representation of data elements from two independent projects managing clinical data and clinical research data. Our results show how one-to-one mappings would hinder the exploitation of the advantages provided by using SWT.
CONCLUSIONS: We conclude that SWT-compliant re-representation is an indispensable step, if using the full potential of SWT is the goal. Rather than providing one-to-one mappings, developers should provide documentation that links data elements to graph structures to specify the re-representation.
PMID: 31419826 [PubMed - in process]
Automatic Staging of Cancer Tumors Using AIM Image Annotations and Ontologies.
Automatic Staging of Cancer Tumors Using AIM Image Annotations and Ontologies.
J Digit Imaging. 2019 Aug 08;:
Authors: Luque EF, Miranda N, Rubin DL, Moreira DA
Abstract
A second opinion about cancer stage is crucial when clinicians assess patient treatment progress. Staging is a process that takes into account description, location, characteristics, and possible metastasis of tumors in a patient. It should follow standards, such as the TNM Classification of Malignant Tumors. However, in clinical practice, the implementation of this process can be tedious and error prone. In order to alleviate these problems, we intend to assist radiologists by providing a second opinion in the evaluation of cancer stage. For doing this, we developed a TNM classifier based on semantic annotations, made by radiologists, using the ePAD tool. It transforms the annotations (stored using the AIM format), using axioms and rules, into AIM4-O ontology instances. From then, it automatically calculates the liver TNM cancer stage. The AIM4-O ontology was developed, as part of this work, to represent annotations in the Web Ontology Language (OWL). A dataset of 51 liver radiology reports with staging data, from NCI's Genomic Data Commons (GDC), were used to evaluate our classifier. When compared with the stages attributed by physicians, the classifier stages had a precision of 85.7% and recall of 81.0%. In addition, 3 radiologists from 2 different institutions manually reviewed a random sample of 4 of the 51 records and agreed with the tool staging. AIM4-O was also evaluated with good results. Our classifier can be integrated into AIM aware imaging tools, such as ePAD, to offer a second opinion about staging as part of the cancer treatment workflow.
PMID: 31396778 [PubMed - as supplied by publisher]
The sound of soft alcohol: Crossmodal associations between interjections and liquor.
The sound of soft alcohol: Crossmodal associations between interjections and liquor.
PLoS One. 2019;14(8):e0220449
Authors: Winter B, Pérez-Sobrino P, Brown L
Abstract
An increasing number of studies reveal crossmodal correspondences between speech sounds and perceptual features such as shape and size. In this study, we show that an interjection Koreans produce when downing a shot of liquor reliably triggers crossmodal associations in American English, German, Spanish, and Chinese listeners who do not speak Korean. Based on how this sound is used in advertising campaigns for the Korean liquor soju, we derive predictions for different crossmodal associations. Our experiments show that the same speech sound is reliably associated with various perceptual, affective, and social meanings. This demonstrates what we call the 'pluripotentiality' of iconicity, that is, the same speech sound is able to trigger a web of interrelated mental associations across different dimensions. We argue that the specific semantic associations evoked by iconic stimuli depend on the task, with iconic meanings having a 'latent' quality that becomes 'actual' in specific semantic contexts. We outline implications for theories of iconicity and advertising.
PMID: 31393912 [PubMed - in process]
SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming.
SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming.
Database (Oxford). 2019 Jan 01;2019:
Authors: Vogt L, Baum R, Bhatty P, Köhler C, Meid S, Quast B, Grobe P
Abstract
We introduce Semantic Ontology-Controlled application for web Content Management Systems (SOCCOMAS), a development framework for FAIR ('findable', 'accessible', 'interoperable', 'reusable') Semantic Web Content Management Systems (S-WCMSs). Each S-WCMS run by SOCCOMAS has its contents managed through a corresponding knowledge base that stores all data and metadata in the form of semantic knowledge graphs in a Jena tuple store. Automated procedures track provenance, user contributions and detailed change history. Each S-WCMS is accessible via both a graphical user interface (GUI), utilizing the JavaScript framework AngularJS, and a SPARQL endpoint. As a consequence, all data and metadata are maximally findable, accessible, interoperable and reusable and comply with the FAIR Guiding Principles. The source code of SOCCOMAS is written using the Semantic Programming Ontology (SPrO). SPrO consists of commands, attributes and variables, with which one can describe an S-WCMS. We used SPrO to describe all the features and workflows typically required by any S-WCMS and documented these descriptions in a SOCCOMAS source code ontology (SC-Basic). SC-Basic specifies a set of default features, such as provenance tracking and publication life cycle with versioning, which will be available in all S-WCMS run by SOCCOMAS. All features and workflows specific to a particular S-WCMS, however, must be described within an instance source code ontology (INST-SCO), defining, e.g. the function and composition of the GUI, with all its user interactions, the underlying data schemes and representations and all its workflow processes. The combination of descriptions in SC-Basic and a given INST-SCO specify the behavior of an S-WCMS. SOCCOMAS controls this S-WCMS through the Java-based middleware that accompanies SPrO, which functions as an interpreter. Because of the ontology-controlled design, SOCCOMAS allows easy customization with a minimum of technical programming background required, thereby seamlessly integrating conventional web page technologies with semantic web technologies. SOCCOMAS and the Java Interpreter are available from (https://github.com/SemanticProgramming).
PMID: 31392324 [PubMed - in process]
Architecture and usability of OntoKeeper, an ontology evaluation tool.
Architecture and usability of OntoKeeper, an ontology evaluation tool.
BMC Med Inform Decis Mak. 2019 Aug 08;19(Suppl 4):152
Authors: Amith M, Manion F, Liang C, Harris M, Wang D, He Y, Tao C
Abstract
BACKGROUND: The existing community-wide bodies of biomedical ontologies are known to contain quality and content problems. Past research has revealed various errors related to their semantics and logical structure. Automated tools may help to ease the ontology construction, maintenance, assessment and quality assurance processes. However, there are relatively few tools that exist that can provide this support to knowledge engineers.
METHOD: We introduce OntoKeeper as a web-based tool that can automate quality scoring for ontology developers. We enlisted 5 experienced ontologists to test the tool and then administered the System Usability Scale to measure their assessment.
RESULTS: In this paper, we present usability results from 5 ontologists revealing high system usability of OntoKeeper, and use-cases that demonstrate its capabilities in previous published biomedical ontology research.
CONCLUSION: To the best of our knowledge, OntoKeeper is the first of a few ontology evaluation tools that can help provide ontology evaluation functionality for knowledge engineers with good usability.
PMID: 31391056 [PubMed - in process]
Selected articles from the Third International Workshop on Semantics-Powered Data Analytics (SEPDA 2018).
Selected articles from the Third International Workshop on Semantics-Powered Data Analytics (SEPDA 2018).
BMC Med Inform Decis Mak. 2019 Aug 08;19(Suppl 4):148
Authors: He Z, Bian J, Tao C, Zhang R
Abstract
In this editorial, we first summarize the Third International Workshop on Semantics-Powered Data Analytics (SEPDA 2018) held on December 3, 2018 in conjunction with the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2018) in Madrid, Spain, and then briefly introduce five research articles included in this supplement issue, covering topics including Data Analytics, Data Visualization, Text Mining, and Ontology Evaluation.
PMID: 31391050 [PubMed - in process]
Using Controlled Vocabularies In Anatomical Terminology: A Case Study With Strumigenys (Hymenoptera: Formicidae).
Using Controlled Vocabularies In Anatomical Terminology: A Case Study With Strumigenys (Hymenoptera: Formicidae).
Arthropod Struct Dev. 2019 Jul 26;:100877
Authors: Silva TSR, Feitosa RM
Abstract
Morphological studies of insects can help us to understand the concomitant or sequential functionality of complex structures and may be used to hypothetize distinct levels of phylogenetic relationship among groups. Traditional morphological works, generally, have encompassed a set of elements, including descriptions of structures and their respective conditions, literature references and images, all combined in a single document. Fast forward to the digital era, it is now possible to release this information simultaneously but also independently as data sets linked to the original publication in an external environment. In order to link data from various fields of knowledge, disseminating morphological information in an open environment, it is important to use tools that enhance interoperability. For example, semantic annotations facilitate the dissemination and retrieval of phenotypic data in digital environments. The integration of semantic (i.e. web-based) components with anatomic treatments can be used to generate a traditional description in natural language along with a set of semantic annotations. The ant genus Strumigenys currently comprises about 840 described species distributed worldwide. In the Neotropical region, almost 200 species are currently known, but it is possible that much of the species' diversity there remains unexplored and undescribed. The morphological diversity in the genus is high, reflecting an extreme generic reclassification that occurred in the late 20th and early 21st centuries. Here we define the anatomical concepts in this highly diverse group of ants using semantic annotations to enrich the anatomical ontologies available online, focusing on the definition of terms through subjacent conceptualization.
PMID: 31357032 [PubMed - as supplied by publisher]
Corrigendum to: Drug-drug interaction discovery and demystification using Semantic Web technologies.
Corrigendum to: Drug-drug interaction discovery and demystification using Semantic Web technologies.
J Am Med Inform Assoc. 2019 Jul 26;:
Authors:
PMID: 31348497 [PubMed - as supplied by publisher]
Growth of linked hospital data use in Australia: a systematic review.
Growth of linked hospital data use in Australia: a systematic review.
Aust Health Rev. 2017 Aug;41(4):394-400
Authors: Tew M, Dalziel KM, Petrie DJ, Clarke PM
Abstract
Objective The aim of the present study was to quantify and understand the utilisation of linked hospital data for research purposes across Australia over the past two decades. Methods A systematic review was undertaken guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2009 checklist. Medline OVID, PsycINFO, Embase, EconLit and Scopus were searched to identify articles published from 1946 to December 2014. Information on publication year, state(s) involved, type of data linkage, disease area and purpose was extracted. Results The search identified 3314 articles, of which 606 were included; these generated 629 records of hospital data linkage use across all Australian states and territories. The major contributions were from Western Australia (WA; 51%) and New South Wales (NSW; 32%) with the remaining states and territories having significantly fewer publications (total contribution only 17%). WA's contribution resulted from a steady increase from the late 1990s, whereas NSW's contribution is mostly from a rapid increase from 2010. Current data linkage is primarily used in epidemiological research (73%). Conclusion More than 80% of publications were from WA and NSW, whereas other states significantly lag behind. The observable growth in these two states clearly demonstrates the underutilised opportunities for data linkage to add value in health services research in the other states. What is known about the topic? Linking administrative hospital data to other data has the potential to be a cost-effective method to significantly improve health policy. Over the past two decades, Australia has made significant investments in improving its data linkage capabilities. However, several articles have highlighted the many barriers involved in using linked hospital data. What does this paper add? This paper quantitatively evaluates the performance across all Australian states in terms of the use of their administrative hospital data for research purposes. The performance of states varies considerably, with WA and NSW the clear stand-out performers and limited outputs currently seen for the other Australian states and territories. What are the implications for practitioners? Given the significant investments made into data linkage, it is important to continue to evaluate and monitor the performance of the states in terms of translating this investment into outputs. Where the outputs do not match the investment, it is important to identify and overcome those barriers limiting the gains from this investment. More generally, there is a need to think about how we improve the effective and efficient use of data linkage investments in Australia.
PMID: 27444270 [PubMed - indexed for MEDLINE]
Biotea-2-Bioschemas, facilitating structured markup for semantically annotated scholarly publications.
Biotea-2-Bioschemas, facilitating structured markup for semantically annotated scholarly publications.
Genomics Inform. 2019 Jun;17(2):e14
Authors: Garcia L, Giraldo O, Garcia A, Rebholz-Schuhmann D
Abstract
The total number of scholarly publications grows day by day, making it necessary to explore and use simple yet effective ways to expose their metadata. Schema.org supports adding structured metadata to web pages via markup, making it easier for data providers but also for search engines to provide the right search results. Bioschemas is based on the standards of schema.org, providing new types, properties and guidelines for metadata, i.e., providing metadata profiles tailored to the Life Sciences domain. Here we present our proposed contribution to Bioschemas (from the project "Biotea"), which supports metadata contributions for scholarly publications via profiles and web components. Biotea comprises a semantic model to represent publications together with annotated elements recognized from the scientific text; our Biotea model has been mapped to schema.org following Bioschemas standards.
PMID: 31307129 [PubMed]
Fully connecting the Observational Health Data Science and Informatics (OHDSI) initiative with the world of linked open data.
Fully connecting the Observational Health Data Science and Informatics (OHDSI) initiative with the world of linked open data.
Genomics Inform. 2019 Jun;17(2):e13
Authors: Banda JM
Abstract
The usage of controlled biomedical vocabularies is the cornerstone that enables seamless interoperability when using a common data model across multiple data sites. The Observational Health Data Science and Informatics (OHDSI) initiative combines over 100 controlled vocabularies into its own. However, the OHDSI vocabulary is limited in the sense that it combines multiple terminologies and does not provide a direct way to link them outside of their own self-contained scope. This issue makes the tasks of enriching feature sets by using external resources extremely difficult. In order to address these shortcomings, we have created a linked data version of the OHDSI vocabulary, connecting it with already established linked resources like bioportal, bio2rdf, etc. with the ultimate purpose of enabling the interoperability of resources previously foreign to the OHDSI universe.
PMID: 31307128 [PubMed]
Semalytics: a semantic analytics platform for the exploration of distributed and heterogeneous cancer data in translational research.
Semalytics: a semantic analytics platform for the exploration of distributed and heterogeneous cancer data in translational research.
Database (Oxford). 2019 Jan 01;2019:
Authors: Mignone A, Grand A, Fiori A, Medico E, Bertotti A
Abstract
Each cancer is a complex system with unique molecular features determining its dynamics, such as its prognosis and response to therapies. Understanding the role of these biological traits is fundamental in order to personalize cancer clinical care according to the characteristics of each patient's disease. To achieve this, translational researchers propagate patients' samples through in vivo and in vitro cultures to test different therapies on the same tumor and to compare their outcomes with the molecular profile of the disease. This in turn generates information that can be subsequently translated into the development of predictive biomarkers for clinical use. These large-scale experiments generate huge collections of hierarchical data (i.e. experimental trees) with relative annotations that are extremely difficult to analyze. To address such issues in data analyses, we came up with the Semalytics data framework, the core of an analytical platform that processes experimental information through Semantic Web technologies. Semalytics allows (i) the efficient exploration of experimental trees with irregular structures together with their annotations. Moreover, (ii) the platform links its data to a wider open knowledge base (i.e. Wikidata) to add an extended knowledge layer without the need to manage and curate those data locally. Altogether, Semalytics provides augmented perspectives on experimental data, allowing the generation of new hypotheses, which were not anticipated by the user a priori. In this work, we present the data core we created for Semalytics, focusing on its semantic nucleus and on how it exploits semantic reasoning and data integration to tackle issues of this kind of analyses. Finally, we describe a proof-of-concept study based on the examination of several dozen cases of metastatic colorectal cancer in order to illustrate how Semalytics can help researchers generate hypotheses about the role of genes alterations in causing resistance or sensitivity of cancer cells to specific drugs.
PMID: 31287543 [PubMed - in process]
Semantic Integration and Enrichment of Heterogeneous Biological Databases.
Semantic Integration and Enrichment of Heterogeneous Biological Databases.
Methods Mol Biol. 2019;1910:655-690
Authors: Sima AC, Stockinger K, de Farias TM, Gil M
Abstract
Biological databases are growing at an exponential rate, currently being among the major producers of Big Data, almost on par with commercial generators, such as YouTube or Twitter. While traditionally biological databases evolved as independent silos, each purposely built by a different research group in order to answer specific research questions; more recently significant efforts have been made toward integrating these heterogeneous sources into unified data access systems or interoperable systems using the FAIR principles of data sharing. Semantic Web technologies have been key enablers in this process, opening the path for new insights into the unified data, which were not visible at the level of each independent database. In this chapter, we first provide an introduction into two of the most used database models for biological data: relational databases and RDF stores. Next, we discuss ontology-based data integration, which serves to unify and enrich heterogeneous data sources. We present an extensive timeline of milestones in data integration based on Semantic Web technologies in the field of life sciences. Finally, we discuss some of the remaining challenges in making ontology-based data access (OBDA) systems easily accessible to a larger audience. In particular, we introduce natural language search interfaces, which alleviate the need for database users to be familiar with technical query languages. We illustrate the main theoretical concepts of data integration through concrete examples, using two well-known biological databases: a gene expression database, Bgee, and an orthology database, OMA.
PMID: 31278681 [PubMed - in process]
FunSet: an open-source software and web server for performing and displaying Gene Ontology enrichment analysis.
FunSet: an open-source software and web server for performing and displaying Gene Ontology enrichment analysis.
BMC Bioinformatics. 2019 Jun 27;20(1):359
Authors: Hale ML, Thapa I, Ghersi D
Abstract
BACKGROUND: Gene Ontology enrichment analysis provides an effective way to extract meaningful information from complex biological datasets. By identifying terms that are significantly overrepresented in a gene set, researchers can uncover biological features shared by genes. In addition to extracting enriched terms, it is also important to visualize the results in a way that is conducive to biological interpretation.
RESULTS: Here we present FunSet, a new web server to perform and visualize enrichment analysis. The web server identifies Gene Ontology terms that are statistically overrepresented in a target set with respect to a background set. The enriched terms are displayed in a 2D plot that captures the semantic similarity between terms, with the option to cluster terms via spectral clustering and identify a representative term for each cluster. FunSet can be used interactively or programmatically, and allows users to download the enrichment results both in tabular form and in graphical form as SVG files or in data format as JSON or csv. To enhance reproducibility of the analyses, users have access to historical data for the ontology and the annotations. The source code for the standalone program and the web server are made available with an open-source license.
PMID: 31248361 [PubMed - in process]
edge2vec: Representation learning using edge semantics for biomedical knowledge discovery.
edge2vec: Representation learning using edge semantics for biomedical knowledge discovery.
BMC Bioinformatics. 2019 Jun 10;20(1):306
Authors: Gao Z, Fu G, Ouyang C, Tsutsui S, Liu X, Yang J, Gessner C, Foote B, Wild D, Ding Y, Yu Q
Abstract
BACKGROUND: Representation learning provides new and powerful graph analytical approaches and tools for the highly valued data science challenge of mining knowledge graphs. Since previous graph analytical methods have mostly focused on homogeneous graphs, an important current challenge is extending this methodology for richly heterogeneous graphs and knowledge domains. The biomedical sciences are such a domain, reflecting the complexity of biology, with entities such as genes, proteins, drugs, diseases, and phenotypes, and relationships such as gene co-expression, biochemical regulation, and biomolecular inhibition or activation. Therefore, the semantics of edges and nodes are critical for representation learning and knowledge discovery in real world biomedical problems.
RESULTS: In this paper, we propose the edge2vec model, which represents graphs considering edge semantics. An edge-type transition matrix is trained by an Expectation-Maximization approach, and a stochastic gradient descent model is employed to learn node embedding on a heterogeneous graph via the trained transition matrix. edge2vec is validated on three biomedical domain tasks: biomedical entity classification, compound-gene bioactivity prediction, and biomedical information retrieval. Results show that by considering edge-types into node embedding learning in heterogeneous graphs, edge2vec significantly outperforms state-of-the-art models on all three tasks.
CONCLUSIONS: We propose this method for its added value relative to existing graph analytical methodology, and in the real world context of biomedical knowledge discovery applicability.
PMID: 31238875 [PubMed - in process]