Drug-induced Adverse Events

A hybrid model for automatic identification of risk factors for heart disease.
A hybrid model for automatic identification of risk factors for heart disease.
J Biomed Inform. 2015 Dec;58 Suppl:S171-82
Authors: Yang H, Garibaldi JM
Abstract
Coronary artery disease (CAD) is the leading cause of death in both the UK and worldwide. The detection of related risk factors and tracking their progress over time is of great importance for early prevention and treatment of CAD. This paper describes an information extraction system that was developed to automatically identify risk factors for heart disease in medical records while the authors participated in the 2014 i2b2/UTHealth NLP Challenge. Our approaches rely on several nature language processing (NLP) techniques such as machine learning, rule-based methods, and dictionary-based keyword spotting to cope with complicated clinical contexts inherent in a wide variety of risk factors. Our system achieved encouraging performance on the challenge test data with an overall micro-averaged F-measure of 0.915, which was competitive to the best system (F-measure of 0.927) of this challenge task.
PMID: 26375492 [PubMed - indexed for MEDLINE]
Coronary artery disease risk assessment from unstructured electronic health records using text mining.
Coronary artery disease risk assessment from unstructured electronic health records using text mining.
J Biomed Inform. 2015 Dec;58 Suppl:S203-10
Authors: Jonnagaddala J, Liaw ST, Ray P, Kumar M, Chang NW, Dai HJ
Abstract
Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD.
PMID: 26319542 [PubMed - indexed for MEDLINE]
Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes.
Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes.
J Biomed Inform. 2015 Dec;58 Suppl:S128-32
Authors: Khalifa A, Meystre S
Abstract
The 2014 i2b2 natural language processing shared task focused on identifying cardiovascular risk factors such as high blood pressure, high cholesterol levels, obesity and smoking status among other factors found in health records of diabetic patients. In addition, the task involved detecting medications, and time information associated with the extracted data. This paper presents the development and evaluation of a natural language processing (NLP) application conceived for this i2b2 shared task. For increased efficiency, the application main components were adapted from two existing NLP tools implemented in the Apache UIMA framework: Textractor (for dictionary-based lookup) and cTAKES (for preprocessing and smoking status detection). The application achieved a final (micro-averaged) F1-measure of 87.5% on the final evaluation test set. Our attempt was mostly based on existing tools adapted with minimal changes and allowed for satisfying performance with limited development efforts.
PMID: 26318122 [PubMed - indexed for MEDLINE]
Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models.
Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models.
J Biomed Inform. 2015 Dec;58 Suppl:S143-9
Authors: Urbain J
Abstract
We present the design, and analyze the performance of a multi-stage natural language processing system employing named entity recognition, Bayesian statistics, and rule logic to identify and characterize heart disease risk factor events in diabetic patients over time. The system was originally developed for the 2014 i2b2 Challenges in Natural Language in Clinical Data. The system's strengths included a high level of accuracy for identifying named entities associated with heart disease risk factor events. The system's primary weakness was due to inaccuracies when characterizing the attributes of some events. For example, determining the relative time of an event with respect to the record date, whether an event is attributable to the patient's history or the patient's family history, and differentiating between current and prior smoking status. We believe these inaccuracies were due in large part to the lack of an effective approach for integrating context into our event detection model. To address these inaccuracies, we explore the addition of a distributional semantic model for characterizing contextual evidence of heart disease risk factor events. Using this semantic model, we raise our initial 2014 i2b2 Challenges in Natural Language of Clinical data F1 score of 0.838 to 0.890 and increased precision by 10.3% without use of any lexicons that might bias our results.
PMID: 26305514 [PubMed - indexed for MEDLINE]
Automatic detection of protected health information from clinic narratives.
Automatic detection of protected health information from clinic narratives.
J Biomed Inform. 2015 Dec;58 Suppl:S30-8
Authors: Yang H, Garibaldi JM
Abstract
This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub-categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule-based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task-specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F-measure of 93.6%, which was the winner of this de-identification challenge.
PMID: 26231070 [PubMed - indexed for MEDLINE]
Combining knowledge- and data-driven methods for de-identification of clinical narratives.
Combining knowledge- and data-driven methods for de-identification of clinical narratives.
J Biomed Inform. 2015 Dec;58 Suppl:S53-9
Authors: Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G
Abstract
A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.
PMID: 26210359 [PubMed - indexed for MEDLINE]
Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.
Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.
J Biomed Inform. 2015 Dec;58 Suppl:S120-7
Authors: Cormack J, Nath C, Milward D, Raja K, Jonnalagadda SR
Abstract
This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.
PMID: 26209007 [PubMed - indexed for MEDLINE]
Using local lexicalized rules to identify heart disease risk factors in clinical notes.
Using local lexicalized rules to identify heart disease risk factors in clinical notes.
J Biomed Inform. 2015 Dec;58 Suppl:S183-8
Authors: Karystianis G, Dehghan A, Kovacevic A, Keane JA, Nenadic G
Abstract
Heart disease is the leading cause of death globally and a significant part of the human population lives with it. A number of risk factors have been recognized as contributing to the disease, including obesity, coronary artery disease (CAD), hypertension, hyperlipidemia, diabetes, smoking, and family history of premature CAD. This paper describes and evaluates a methodology to extract mentions of such risk factors from diabetic clinical notes, which was a task of the i2b2/UTHealth 2014 Challenge in Natural Language Processing for Clinical Data. The methodology is knowledge-driven and the system implements local lexicalized rules (based on syntactical patterns observed in notes) combined with manually constructed dictionaries that characterize the domain. A part of the task was also to detect the time interval in which the risk factors were present in a patient. The system was applied to an evaluation set of 514 unseen notes and achieved a micro-average F-score of 88% (with 86% precision and 90% recall). While the identification of CAD family history, medication and some of the related disease factors (e.g. hypertension, diabetes, hyperlipidemia) showed quite good results, the identification of CAD-specific indicators proved to be more challenging (F-score of 74%). Overall, the results are encouraging and suggested that automated text mining methods can be used to process clinical notes to identify risk factors and monitor progression of heart disease on a large-scale, providing necessary data for clinical and epidemiological studies.
PMID: 26133479 [PubMed - indexed for MEDLINE]
Networks Models of Actin Dynamics during Spermatozoa Postejaculatory Life: A Comparison among Human-Made and Text Mining-Based Models.
Networks Models of Actin Dynamics during Spermatozoa Postejaculatory Life: A Comparison among Human-Made and Text Mining-Based Models.
Biomed Res Int. 2016;2016:9795409
Authors: Bernabò N, Ordinelli A, Ramal Sanchez M, Mattioli M, Barboni B
Abstract
Here we realized a networks-based model representing the process of actin remodelling that occurs during the acquisition of fertilizing ability of human spermatozoa (HumanMade_ActinSpermNetwork, HM_ASN). Then, we compared it with the networks provided by two different text mining tools: Agilent Literature Search (ALS) and PESCADOR. As a reference, we used the data from the online repository Kyoto Encyclopaedia of Genes and Genomes (KEGG), referred to the actin dynamics in a more general biological context. We found that HM_ALS and the networks from KEGG data shared the same scale-free topology following the Barabasi-Albert model, thus suggesting that the information is spread within the network quickly and efficiently. On the contrary, the networks obtained by ALS and PESCADOR have a scale-free hierarchical architecture, which implies a different pattern of information transmission. Also, the hubs identified within the networks are different: HM_ALS and KEGG networks contain as hubs several molecules known to be involved in actin signalling; ALS was unable to find other hubs than "actin," whereas PESCADOR gave some nonspecific result. This seems to suggest that the human-made information retrieval in the case of a specific event, such as actin dynamics in human spermatozoa, could be a reliable strategy.
PMID: 27642606 [PubMed - in process]
Constructing a molecular interaction network for thyroid cancer via large-scale text mining of gene and pathway events.
Constructing a molecular interaction network for thyroid cancer via large-scale text mining of gene and pathway events.
BMC Syst Biol. 2015;9 Suppl 6:S5
Authors: Wu C, Schwartz JM, Brabant G, Peng SL, Nenadic G
Abstract
BACKGROUND: Biomedical studies need assistance from automated tools and easily accessible data to address the problem of the rapidly accumulating literature. Text-mining tools and curated databases have been developed to address such needs and they can be applied to improve the understanding of molecular pathogenesis of complex diseases like thyroid cancer.
RESULTS: We have developed a system, PWTEES, which extracts pathway interactions from the literature utilizing an existing event extraction tool (TEES) and pathway named entity recognition (PathNER). We then applied the system on a thyroid cancer corpus and systematically extracted molecular interactions involving either genes or pathways. With the extracted information, we constructed a molecular interaction network taking genes and pathways as nodes. Using curated pathway information and network topological analyses, we highlight key genes and pathways involved in thyroid carcinogenesis.
CONCLUSIONS: Mining events involving genes and pathways from the literature and integrating curated pathway knowledge can help improve the understanding of molecular interactions of complex diseases. The system developed for this study can be applied in studies other than thyroid cancer. The source code is freely available online at https://github.com/chengkun-wu/PWTEES.
PMID: 26679379 [PubMed - indexed for MEDLINE]
Pharmacovigilance through the development of text mining and natural language processing techniques.
Pharmacovigilance through the development of text mining and natural language processing techniques.
J Biomed Inform. 2015 Dec;58:288-91
Authors: Segura-Bedmar I, Martínez P
PMID: 26547007 [PubMed - indexed for MEDLINE]
A research framework for pharmacovigilance in health social media: Identification and evaluation of patient adverse drug event reports.
A research framework for pharmacovigilance in health social media: Identification and evaluation of patient adverse drug event reports.
J Biomed Inform. 2015 Dec;58:268-79
Authors: Liu X, Chen H
Abstract
Social media offer insights of patients' medical problems such as drug side effects and treatment failures. Patient reports of adverse drug events from social media have great potential to improve current practice of pharmacovigilance. However, extracting patient adverse drug event reports from social media continues to be an important challenge for health informatics research. In this study, we develop a research framework with advanced natural language processing techniques for integrated and high-performance patient reported adverse drug event extraction. The framework consists of medical entity extraction for recognizing patient discussions of drug and events, adverse drug event extraction with shortest dependency path kernel based statistical learning method and semantic filtering with information from medical knowledge bases, and report source classification to tease out noise. To evaluate the proposed framework, a series of experiments were conducted on a test bed encompassing about postings from major diabetes and heart disease forums in the United States. The results reveal that each component of the framework significantly contributes to its overall effectiveness. Our framework significantly outperforms prior work.
PMID: 26518315 [PubMed - indexed for MEDLINE]
Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification.
Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification.
J Biomed Inform. 2015 Dec;58:122-32
Authors: Ben Abacha A, Chowdhury MF, Karanasiou A, Mrabet Y, Lavelli A, Zweigenbaum P
Abstract
Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities related to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem. An essential aspect in PV is to acquire knowledge about Drug-Drug Interactions (DDIs). The shared tasks on DDI-Extraction organized in 2011 and 2013 have pointed out the importance of this issue and provided benchmarks for: Drug Name Recognition, DDI extraction and DDI classification. In this paper, we present our text mining systems for these tasks and evaluate their results on the DDI-Extraction benchmarks. Our systems rely on machine learning techniques using both feature-based and kernel-based methods. The obtained results for drug name recognition are encouraging. For DDI-Extraction, our hybrid system combining a feature-based method and a kernel-based method was ranked second in the DDI-Extraction-2011 challenge, and our two-step system for DDI detection and classification was ranked first in the DDI-Extraction-2013 task at SemEval. We discuss our methods and results and give pointers to future work.
PMID: 26432353 [PubMed - indexed for MEDLINE]
Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology.
Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology.
Biomed Res Int. 2016;2016:4130861
Authors: Hu Y, Zhou W, Ren J, Dong L, Wang Y, Jin S, Cheng L
Abstract
Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e - 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e - 14) in GeneRIFs and GOA shows our annotation resource is very reliable.
PMID: 27635398 [PubMed - in process]
Stacked Ensemble Combined with Fuzzy Matching for Biomedical Named Entity Recognition of Diseases.
Stacked Ensemble Combined with Fuzzy Matching for Biomedical Named Entity Recognition of Diseases.
J Biomed Inform. 2016 Sep 12;
Authors: Bhasuran B, Murugesan G, Abdulkadhar S, Natarajan J
Abstract
Biomedical Named Entity Recognition (Bio-NER) is the crucial initial step in the information extraction process and a majorly focused research area in biomedical text mining. In the past years, several models and methodologies have been proposed for the recognition of semantic types related to gene, protein, chemical, drug and other biological relevant named entities. In this paper, we implementeda stacked ensembleapproachcombined with fuzzy matching for biomedical named entity recognition of disease names. The underlying concept of stacked generalizationisto combines the outputs of base-levelclassifiersusing a second-level meta-classifier in an ensemble. We used Conditional Random Field (CRF) as the underlying classification methods that makeuse of a diverse set of features, mostly based on domain specific, orthographic and morphologically relevant. In addition, we used fuzzy string matching to tag rare diseases names from our in-house disease dictionary. For fuzzy matching, we incorporated two best fuzzy search algorithms Rabin Karp and Tuned Boyer Moore. Our proposed approach shows promised result of 94.66%, 89.12% and 84.10%, 76.71% of F-measure while on evaluating training and testing set of both NCBI disease and BioCreative V CDR Corpora.
PMID: 27634494 [PubMed - as supplied by publisher]
Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.
Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.
Database (Oxford). 2016 Jul;2016
Authors: Le HQ, Tran MV, Dang TH, Ha QT, Collier N
Abstract
The BioCreative V chemical-disease relation (CDR) track was proposed to accelerate the progress of text mining in facilitating integrative understanding of chemicals, diseases and their relations. In this article, we describe an extension of our system (namely UET-CAM) that participated in the BioCreative V CDR. The original UET-CAM system's performance was ranked fourth among 18 participating systems by the BioCreative CDR track committee. In the Disease Named Entity Recognition and Normalization (DNER) phase, our system employed joint inference (decoding) with a perceptron-based named entity recognizer (NER) and a back-off model with Semantic Supervised Indexing and Skip-gram for named entity normalization. In the chemical-induced disease (CID) relation extraction phase, we proposed a pipeline that includes a coreference resolution module and a Support Vector Machine relation extraction model. The former module utilized a multi-pass sieve to extend entity recall. In this article, the UET-CAM system was improved by adding a 'silver' CID corpus to train the prediction model. This silver standard corpus of more than 50 thousand sentences was automatically built based on the Comparative Toxicogenomics Database (CTD) database. We evaluated our method on the CDR test set. Results showed that our system could reach the state of the art performance with F1 of 82.44 for the DNER task and 58.90 for the CID task. Analysis demonstrated substantial benefits of both the multi-pass sieve coreference resolution method (F1 + 4.13%) and the silver CID corpus (F1 +7.3%).Database URL: SilverCID-The silver-standard corpus for CID relation extraction is freely online available at: https://zenodo.org/record/34530 (doi:10.5281/zenodo.34530).
PMID: 27630201 [PubMed - as supplied by publisher]
Understanding factors affecting patient and public engagement and recruitment to digital health interventions: a systematic review of qualitative studies.
Understanding factors affecting patient and public engagement and recruitment to digital health interventions: a systematic review of qualitative studies.
BMC Med Inform Decis Mak. 2016;16(1):120
Authors: O'Connor S, Hanlon P, O'Donnell CA, Garcia S, Glanville J, Mair FS
Abstract
BACKGROUND: Numerous types of digital health interventions (DHIs) are available to patients and the public but many factors affect their ability to engage and enrol in them. This systematic review aims to identify and synthesise the qualitative literature on barriers and facilitators to engagement and recruitment to DHIs to inform future implementation efforts.
METHODS: PubMed, MEDLINE, CINAHL, Embase, Scopus and the ACM Digital Library were searched for English language qualitative studies from 2000 - 2015 that discussed factors affecting engagement and enrolment in a range of DHIs (e.g. 'telemedicine', 'mobile applications', 'personal health record', 'social networking'). Text mining and additional search strategies were used to identify 1,448 records. Two reviewers independently carried out paper screening, quality assessment, data extraction and analysis. Data was analysed using framework synthesis, informed by Normalization Process Theory, and Burden of Treatment Theory helped conceptualise the interpretation of results.
RESULTS: Nineteen publications were included in the review. Four overarching themes that affect patient and public engagement and enrolment in DHIs emerged; 1) personal agency and motivation; 2) personal life and values; 3) the engagement and recruitment approach; and 4) the quality of the DHI. The review also summarises engagement and recruitment strategies used. A preliminary DIgital Health EnGagement MOdel (DIEGO) was developed to highlight the key processes involved. Existing knowledge gaps are identified and a number of recommendations made for future research. Study limitations include English language publications and exclusion of grey literature.
CONCLUSION: This review summarises and highlights the complexity of digital health engagement and recruitment processes and outlines issues that need to be addressed before patients and the public commit to digital health and it can be implemented effectively. More work is needed to create successful engagement strategies and better quality digital solutions that are personalised where possible and to gain clinical accreditation and endorsement when appropriate. More investment is also needed to improve computer literacy and ensure technologies are accessible and affordable for those who wish to sign up to them.
SYSTEMATIC REVIEW REGISTRATION: International Prospective Register of Systematic Reviews CRD42015029846.
PMID: 27630020 [PubMed - as supplied by publisher]
Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.
Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.
Front Pharmacol. 2016;7:284
Authors: Papamokos G, Silins I
Abstract
There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.
PMID: 27625608 [PubMed]
Weakly supervised learning of biomedical information extraction from curated data.
Weakly supervised learning of biomedical information extraction from curated data.
BMC Bioinformatics. 2016;17 Suppl 1:1
Authors: Jain S, Tumkur KR, Kuo TT, Bhargava S, Lin G, Hsu CN
Abstract
BACKGROUND: Numerous publicly available biomedical databases derive data by curating from literatures. The curated data can be useful as training examples for information extraction, but curated data usually lack the exact mentions and their locations in the text required for supervised machine learning. This paper describes a general approach to information extraction using curated data as training examples. The idea is to formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee of weak classifiers that consider both curated data and the text.
RESULTS: We test the idea on two information extraction tasks of Genome-Wide Association Studies (GWAS). The first task is to extract target phenotypes (diseases or traits) of a study and the second is to extract ethnicity backgrounds of study subjects for different stages (initial or replication). Experimental results show that our approach can achieve 87% of Precision-at-2 (P@2) for disease/trait extraction, and 0.83 of F1-Score for stage-ethnicity extraction, both outperforming their cost-insensitive baseline counterparts.
CONCLUSIONS: The results show that curated biomedical databases can potentially be reused as training examples to train information extractors without expert annotation or refinement, opening an unprecedented opportunity of using "big data" in biomedical text mining.
PMID: 26817711 [PubMed - indexed for MEDLINE]
Cell line name recognition in support of the identification of synthetic lethality in cancer from text.
Cell line name recognition in support of the identification of synthetic lethality in cancer from text.
Bioinformatics. 2016 Jan 15;32(2):276-82
Authors: Kaewphan S, Van Landeghem S, Ohta T, Van de Peer Y, Ginter F, Pyysalo S
Abstract
MOTIVATION: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
RESULTS: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.
AVAILABILITY AND IMPLEMENTATION: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/.
CONTACT: sukaew@utu.fi.
PMID: 26428294 [PubMed - indexed for MEDLINE]