Drug-induced Adverse Events

Link Prediction on a Network of Co-occurring MeSH Terms: Towards Literature-based Discovery.
Link Prediction on a Network of Co-occurring MeSH Terms: Towards Literature-based Discovery.
Methods Inf Med. 2016 Jul 20;55(4)
Authors: Kastrin A, Rindflesch TC, Hristovski D
Abstract
OBJECTIVES: Literature-based discovery (LBD) is a text mining methodology for automatically generating research hypotheses from existing knowledge. We mimic the process of LBD as a classification problem on a graph of MeSH terms. We employ unsupervised and supervised link prediction methods for predicting previously unknown connections between biomedical concepts.
METHODS: We evaluate the effectiveness of link prediction through a series of experiments using a MeSH network that contains the history of link formation between biomedical concepts. We performed link prediction using proximity measures, such as common neighbor (CN), Jaccard coefficient (JC), Adamic / Adar index (AA) and preferential attachment (PA). Our approach relies on the assumption that similar nodes are more likely to establish a link in the future.
RESULTS: Applying an unsupervised approach, the AA measure achieved the best performance in terms of area under the ROC curve (AUC = 0.76), followed by CN, JC, and PA. In a supervised approach, we evaluate whether proximity measures can be combined to define a model of link formation across all four predictors. We applied various classifiers, including decision trees, k-nearest neighbors, logistic regression, multilayer perceptron, naïve Bayes, and random forests. Random forest classifier accomplishes the best performance (AUC = 0.87).
CONCLUSIONS: The link prediction approach proved to be effective for LBD processing. Supervised statistical learning approaches clearly outperform an unsupervised approach to link prediction.
PMID: 27435341 [PubMed - as supplied by publisher]
Comprehensive Map of Molecules Implicated in Obesity.
Comprehensive Map of Molecules Implicated in Obesity.
PLoS One. 2016;11(2):e0146759
Authors: Jagannadham J, Jaiswal HK, Agrawal S, Rawal K
Abstract
Obesity is a global epidemic affecting over 1.5 billion people and is one of the risk factors for several diseases such as type 2 diabetes mellitus and hypertension. We have constructed a comprehensive map of the molecules reported to be implicated in obesity. A deep curation strategy was complemented by a novel semi-automated text mining system in order to screen 1,000 full-length research articles and over 90,000 abstracts that are relevant to obesity. We obtain a scale free network of 804 nodes and 971 edges, composed of 510 proteins, 115 genes, 62 complexes, 23 RNA molecules, 83 simple molecules, 3 phenotype and 3 drugs in "bow-tie" architecture. We classify this network into 5 modules and identify new links between the recently discovered fat mass and obesity associated FTO gene with well studied examples such as insulin and leptin. We further built an automated docking pipeline to dock orlistat as well as other drugs against the 24,000 proteins in the human structural proteome to explain the therapeutics and side effects at a network level. Based upon our experiments, we propose that therapeutic effect comes through the binding of one drug with several molecules in target network, and the binding propensity is both statistically significant and different in comparison with any other part of human structural proteome.
PMID: 26886906 [PubMed - indexed for MEDLINE]
Detecting themes of public concern: a text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat.
Detecting themes of public concern: a text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat.
Am J Infect Control. 2015 Oct 1;43(10):1109-11
Authors: Lazard AJ, Scheinfeld E, Bernhardt JM, Wilcox GB, Suran M
Abstract
A diagnosis of Ebola on US soil triggered widespread panic. In response, the Centers for Disease Control and Prevention held a live Twitter chat to address public concerns. This study applied a textual analytics method to reveal insights from these tweets that can inform communication strategies. User-generated tweets were collected, sorted, and analyzed to reveal major themes. The public was concerned with symptoms and lifespan of the virus, disease transfer and contraction, safe travel, and protection of one's body.
PMID: 26138998 [PubMed - indexed for MEDLINE]
Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction.
Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction.
Artif Intell Med. 2016 Jun;70:77-83
Authors: Napolitano G, Marshall A, Hamilton P, Gavin AT
Abstract
BACKGROUND AND AIMS: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
MATERIALS AND METHODS: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: 'semi-structured' and 'unstructured'. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
RESULTS: The best result of 99.4% accuracy - which included only one semi-structured report predicted as unstructured - was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
CONCLUSIONS: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
PMID: 27431038 [PubMed - in process]
Ebola virus disease and social media: A systematic review.
Ebola virus disease and social media: A systematic review.
Am J Infect Control. 2016 Jul 14;
Authors: Fung IC, Duke CH, Finch KC, Snook KR, Tseng PL, Hernandez AC, Gambhir M, Fu KW, Tse ZT
Abstract
OBJECTIVES: We systematically reviewed existing research pertinent to Ebola virus disease and social media, especially to identify the research questions and the methods used to collect and analyze social media.
METHODS: We searched 6 databases for research articles pertinent to Ebola virus disease and social media. We extracted the data using a standardized form. We evaluated the quality of the included articles.
RESULTS: Twelve articles were included in the main analysis: 7 from Twitter with 1 also including Weibo, 1 from Facebook, 3 from YouTube, and 1 from Instagram and Flickr. All the studies were cross-sectional. Eleven of the 12 articles studied ≥ 1of these 3 elements of social media and their relationships: themes or topics of social media contents, meta-data of social media posts (such as frequency of original posts and reposts, and impressions) and characteristics of the social media accounts that made these posts (such as whether they are individuals or institutions). One article studied how news videos influenced Twitter traffic. Twitter content analysis methods included text mining (n = 3) and manual coding (n = 1). Two studies involved mathematical modeling. All 3 YouTube studies and the Instagram/Flickr study used manual coding of videos and images, respectively.
CONCLUSIONS: Published Ebola virus disease-related social media research focused on Twitter and YouTube. The utility of social media research to public health practitioners is warranted.
PMID: 27425009 [PubMed - as supplied by publisher]
Overview of major molecular alterations during progression from Barrett's esophagus to esophageal adenocarcinoma.
Overview of major molecular alterations during progression from Barrett's esophagus to esophageal adenocarcinoma.
Ann N Y Acad Sci. 2016 Jul 14;
Authors: Kalatskaya I
Abstract
Esophageal adenocarcinoma (EAC) develops in the sequential transformation of normal epithelium into metaplastic epithelium, called Barrett's esophagus (BE), then to dysplasia, and finally cancer. BE is a common condition in which normal stratified squamous epithelium of the esophagus is replaced with an intestine-like columnar epithelium, and it is the most prominent risk factor for EAC. This review aims to impartially systemize the knowledge from a large number of publications that describe the molecular and biochemical alterations occurring over this progression sequence. In order to provide an unbiased extraction of the knowledge from the literature, a text-mining methodology was used to select genes that are involved in the BE progression, with the top candidate genes found to be TP53, CDKN2A, CTNNB1, CDH1, GPX3, and NOX5. In addition, sample frequencies across analyzed patient cohorts at each stage of disease progression are summarized. All six genes are altered in the majority of EAC patients, and accumulation of alterations correlates well with the sequential progression of BE to cancer, indicating that the text-mining method is a valid approach for gene prioritization. This review discusses how, besides being cancer drivers, these genes are functionally interconnected and might collectively be considered a central hub of BE progression.
PMID: 27415609 [PubMed - as supplied by publisher]
Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals.
Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals.
PLoS One. 2016;11(2):e0149102
Authors: Druzinsky RE, Balhoff JP, Crompton AW, Done J, German RZ, Haendel MA, Herrel A, Herring SW, Lapp H, Mabee PM, Muller HM, Mungall CJ, Sternberg PW, Van Auken K, Vinyard CJ, Williams SH, Wall CE
Abstract
BACKGROUND: In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required.
DEVELOPMENT AND TESTING OF THE ONTOLOGY: Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar.
RESULTS AND SIGNIFICANCE: Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.
PMID: 26870952 [PubMed - indexed for MEDLINE]
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
Nucleic Acids Res. 2016 Jan 4;44(D1):D435-46
Authors: Huang KY, Su MG, Kao HJ, Hsieh YC, Jhong JH, Cheng KH, Huang HD, Lee TY
Abstract
Owing to the importance of the post-translational modifications (PTMs) of proteins in regulating biological processes, the dbPTM (http://dbPTM.mbc.nctu.edu.tw/) was developed as a comprehensive database of experimentally verified PTMs from several databases with annotations of potential PTMs for all UniProtKB protein entries. For this 10th anniversary of dbPTM, the updated resource provides not only a comprehensive dataset of experimentally verified PTMs, supported by the literature, but also an integrative interface for accessing all available databases and tools that are associated with PTM analysis. As well as collecting experimental PTM data from 14 public databases, this update manually curates over 12 000 modified peptides, including the emerging S-nitrosylation, S-glutathionylation and succinylation, from approximately 500 research articles, which were retrieved by text mining. As the number of available PTM prediction methods increases, this work compiles a non-homologous benchmark dataset to evaluate the predictive power of online PTM prediction tools. An increasing interest in the structural investigation of PTM substrate sites motivated the mapping of all experimental PTM peptides to protein entries of Protein Data Bank (PDB) based on database identifier and sequence identity, which enables users to examine spatially neighboring amino acids, solvent-accessible surface area and side-chain orientations for PTM substrate sites on tertiary structures. Since drug binding in PDB is annotated, this update identified over 1100 PTM sites that are associated with drug binding. The update also integrates metabolic pathways and protein-protein interactions to support the PTM network analysis for a group of proteins. Finally, the web interface is redesigned and enhanced to facilitate access to this resource.
PMID: 26578568 [PubMed - indexed for MEDLINE]
BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.
BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.
Database (Oxford). 2016;2016
Authors: Rinaldi F, Ellendorff TR, Madan S, Clematide S, van der Lek A, Mevissen T, Fluck J
Abstract
Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text.
PMID: 27402677 [PubMed - in process]
Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures.
Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures.
J Med Syst. 2016 Aug;40(8):191
Authors: Figueroa RL, Flores CA
Abstract
Obesity is a chronic disease with an increasing impact on the world's population. In this work, we present a method of identifying obesity automatically using text mining techniques and information related to body weight measures and obesity comorbidities. We used a dataset of 3015 de-identified medical records that contain labels for two classification problems. The first classification problem distinguishes between obesity, overweight, normal weight, and underweight. The second classification problem differentiates between obesity types: super obesity, morbid obesity, severe obesity and moderate obesity. We used a Bag of Words approach to represent the records together with unigram and bigram representations of the features. We implemented two approaches: a hierarchical method and a nonhierarchical one. We used Support Vector Machine and Naïve Bayes together with ten-fold cross validation to evaluate and compare performances. Our results indicate that the hierarchical approach does not work as well as the nonhierarchical one. In general, our results show that Support Vector Machine obtains better performances than Naïve Bayes for both classification problems. We also observed that bigram representation improves performance compared with unigram representation.
PMID: 27402260 [PubMed - in process]
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.
Springerplus. 2016;5(1):942
Authors: Agnihotri D, Verma K, Tripathi P
Abstract
The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2).
PMID: 27386386 [PubMed - as supplied by publisher]
Text Mining the History of Medicine.
Text Mining the History of Medicine.
PLoS One. 2016;11(1):e0144717
Authors: Thompson P, Batista-Navarro RT, Kontonatsios G, Carter J, Toon E, McNaught J, Timmermann C, Worboys M, Ananiadou S
Abstract
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
PMID: 26734936 [PubMed - indexed for MEDLINE]
"Initial investigation into computer scoring of candidate essays for personnel selection": Correction to Campion et al. (2016).
"Initial investigation into computer scoring of candidate essays for personnel selection": Correction to Campion et al. (2016).
J Appl Psychol. 2016 Jul;101(7):975
Authors:
Abstract
Reports an error in "Initial Investigation Into Computer Scoring of Candidate Essays for Personnel Selection" by Michael C. Campion, Michael A. Campion, Emily D. Campion and Matthew H. Reider (Journal of Applied Psychology, Advanced Online Publication, Apr 14, 2016, np). In the article the affiliations for Emily D. Campion and Matthew H. Reider were originally incorrect. All versions of this article have been corrected. (The following abstract of the original article appeared in record 2016-18130-001.) Emerging advancements including the exponentially growing availability of computer-collected data and increasingly sophisticated statistical software have led to a "Big Data Movement" wherein organizations have begun attempting to use large-scale data analysis to improve their effectiveness. Yet, little is known regarding how organizations can leverage these advancements to develop more effective personnel selection procedures, especially when the data are unstructured (text-based). Drawing on literature on natural language processing, we critically examine the possibility of leveraging advances in text mining and predictive modeling computer software programs as a surrogate for human raters in a selection context. We explain how to "train" a computer program to emulate a human rater when scoring accomplishment records. We then examine the reliability of the computer's scores, provide preliminary evidence of their construct validity, demonstrate that this practice does not produce scores that disadvantage minority groups, illustrate the positive financial impact of adopting this practice in an organization (N ∼ 46,000 candidates), and discuss implementation issues. Finally, we discuss the potential implications of using computer scoring to address the adverse impact-validity dilemma. We suggest that it may provide a cost-effective means of using predictors that have comparable validity but have previously been too expensive for large-scale screening. (PsycINFO Database Record
PMID: 27379396 [PubMed - as supplied by publisher]
Transfer Learning for Class Imbalance Problems with Inadequate Data.
Transfer Learning for Class Imbalance Problems with Inadequate Data.
Knowl Inf Syst. 2016 Jul;48(1):201-228
Authors: Al-Stouhi S, Reddy CK
Abstract
A fundamental problem in data mining is to effectively build robust classifiers in the presence of skewed data distributions. Class imbalance classifiers are trained specifically for skewed distribution datasets. Existing methods assume an ample supply of training examples as a fundamental prerequisite for constructing an effective classifier. However, when sufficient data is not readily available, the development of a representative classification algorithm becomes even more difficult due to the unequal distribution between classes. We provide a unified framework that will potentially take advantage of auxiliary data using a transfer learning mechanism and simultaneously build a robust classifier to tackle this imbalance issue in the presence of few training samples in a particular target domain of interest. Transfer learning methods use auxiliary data to augment learning when training examples are not sufficient and in this paper we will develop a method that is optimized to simultaneously augment the training data and induce balance into skewed datasets. We propose a novel boosting based instance-transfer classifier with a label-dependent update mechanism that simultaneously compensates for class imbalance and incorporates samples from an auxiliary domain to improve classification. We provide theoretical and empirical validation of our method and apply to healthcare and text classification applications.
PMID: 27378821 [PubMed - as supplied by publisher]
Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics.
Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics.
Biomed Inform Insights. 2016;8(Suppl 1):1-11
Authors: Torii M, Tilak SS, Doan S, Zisook DS, Fan JW
Abstract
In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.
PMID: 27375358 [PubMed]
Coreference resolution improves extraction of Biological Expression Language statements from texts.
Coreference resolution improves extraction of Biological Expression Language statements from texts.
Database (Oxford). 2016;2016
Authors: Choi M, Liu H, Baumgartner W, Zobel J, Verspoor K
Abstract
We describe a system that automatically extracts biological events from biomedical journal articles, and translates those events into Biological Expression Language (BEL) statements. The system incorporates existing text mining components for coreference resolution, biological event extraction and a previously formally untested strategy for BEL statement generation. Although addressing the BEL track (Track 4) at BioCreative V (2015), we also investigate how incorporating coreference resolution might impact event extraction in the biomedical domain. In this paper, we report that our system achieved the best performance of 20.2 and 35.2 in F-score for the full BEL statement level on both stage 1, and stage 2 using provided gold standard entities, respectively. We also report that our results evaluated on the training dataset show benefit from integrating coreference resolution with event extraction.
PMID: 27374122 [PubMed - as supplied by publisher]
Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts.
Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts.
J Biomed Inform. 2016 Jun 27;
Authors: Korkontzelos I, Nikfarjam A, Shardlow M, Sarker A, Ananiadou S, Gonzalez GH
Abstract
OBJECTIVE: The abundance of text available in social media and health related forums along with the rich expression of public opinion have recently attracted the interest of the public health community to use these sources for pharmacovigilance. Based on the intuition that patients post about Adverse Drug Reactions (ADRs) expressing negative sentiments, we investigate the effect of sentiment analysis features in locating ADR mentions.
METHODS: We enrich the feature space of a state-of-the-art ADR identification method with sentiment analysis features. Using a corpus of posts from the DailyStrength forum and tweets annotated for ADR and indication mentions, we evaluate the extent to which sentiment analysis features help in locating ADR mentions and distinguishing them from indication mentions.
RESULTS: Evaluation results show that sentiment analysis features marginally improve ADR identification in tweets and health related forum posts. Adding sentiment analysis features achieved a statistically significant F-measure increase from 72.14% to 73.22% in the Twitter part of an existing corpus using its original train/test split. Using stratified 10 × 10-fold cross-validation, statistically significant F-measure increases were shown in the DailyStrength part of the corpus, from 79.57% to 80.14%, and in the Twitter part of the corpus, from 66.91% to 69.16%. Moreover, sentiment analysis features are shown to reduce the number of ADRs being recognised as indications.
CONCLUSION: This study shows that adding sentiment analysis features can marginally improve the performance of even a state-of-the-art ADR identification method. This improvement can be of use to pharmacovigilance practice, due to the rapidly increasing popularity of social media and health forums.
PMID: 27363901 [PubMed - as supplied by publisher]
Protein-protein interaction identification using a hybrid model.
Protein-protein interaction identification using a hybrid model.
Artif Intell Med. 2015 Jul;64(3):185-93
Authors: Niu Y, Wang Y
Abstract
BACKGROUND: Most existing systems that identify protein-protein interaction (PPI) in literature make decisions solely on evidence within a single sentence and ignore the rich context of PPI descriptions in large corpora. Moreover, they often suffer from the heavy burden of manual annotation.
METHODS: To address these problems, a new relational-similarity (RS)-based approach exploiting context in large-scale text is proposed. A basic RS model is first established to make initial predictions. Then word similarity matrices that are sensitive to the PPI identification task are constructed using a corpus-based approach. Finally, a hybrid model is developed to integrate the word similarity model with the basic RS model.
RESULTS: The experimental results show that the basic RS model achieves F-scores much higher than a baseline of random guessing on interactions (from 50.6% to 75.0%) and non-interactions (from 49.4% to 74.2%). The hybrid model further improves F-score by about 2% on interactions and 3% on non-interactions.
CONCLUSION: The experimental evaluations conducted with PPIs in well-known databases showed the effectiveness of our approach that explores context information in PPI identification. This investigation confirmed that within the framework of relational similarity, the word similarity model relieves the data sparseness problem in similarity calculation.
PMID: 26054427 [PubMed - indexed for MEDLINE]
Dense Annotation of Free-Text Critical Care Discharge Summaries from an Indian Hospital and Associated Performance of a Clinical NLP Annotator.
Dense Annotation of Free-Text Critical Care Discharge Summaries from an Indian Hospital and Associated Performance of a Clinical NLP Annotator.
J Med Syst. 2016 Aug;40(8):187
Authors: Ramanan SV, Radhakrishna K, Waghmare A, Raj T, Nathan SP, Sreerama SM, Sampath S
Abstract
Electronic Health Record (EHR) use in India is generally poor, and structured clinical information is mostly lacking. This work is the first attempt aimed at evaluating unstructured text mining for extracting relevant clinical information from Indian clinical records. We annotated a corpus of 250 discharge summaries from an Intensive Care Unit (ICU) in India, with markups for diseases, procedures, and lab parameters, their attributes, as well as key demographic information and administrative variables such as patient outcomes. In this process, we have constructed guidelines for an annotation scheme useful to clinicians in the Indian context. We evaluated the performance of an NLP engine, Cocoa, on a cohort of these Indian clinical records. We have produced an annotated corpus of roughly 90 thousand words, which to our knowledge is the first tagged clinical corpus from India. Cocoa was evaluated on a test corpus of 50 documents. The overlap F-scores across the major categories, namely disease/symptoms, procedures, laboratory parameters and outcomes, are 0.856, 0.834, 0.961 and 0.872 respectively. These results are competitive with results from recent shared tasks based on US records. The annotated corpus and associated results from the Cocoa engine indicate that unstructured text mining is a viable method for cohort analysis in the Indian clinical context, where structured EHR records are largely absent.
PMID: 27342107 [PubMed - in process]
Identification of a thienopyrimidine derivatives target by a kinome and chemical biology approach.
Identification of a thienopyrimidine derivatives target by a kinome and chemical biology approach.
Arch Pharm Res. 2015 Sep;38(9):1575-81
Authors: Lee C, Yang JS, Han G
Abstract
Target identification through chemical biology has been considered one of the most efficient approaches for drug discovery. Thienopyrimidine derivatives were designed to discover potent IκB kinase β (IKKβ) inhibitors based on a known IKKβ inhibitor library. Most of the thienopyrimidine derivatives inhibited nitric oxide and tumor necrosis factor alpha, which are downstream of the NF-κB signaling pathway, but not IKKβ. To identify the appropriate targets of thienopyrimidine analogues, chemical biology approaches, including text mining and a subsequent kinase panel assay from the kinome profiling were used. Based on the results, Fms-like tyrosine kinase 3 was found to be the target for thienopyrimidine derivatives, and was confirmed to be a potent inhibitor for acute myeloid leukemia.
PMID: 26186885 [PubMed - indexed for MEDLINE]