Drug-induced Adverse Events

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.
Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.
BMC Med Inform Decis Mak. 2016;16 Suppl 1:68
Authors: Verspoor KM, Heo GE, Kang KY, Song M
Abstract
BACKGROUND: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems.
METHODS: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus.
RESULTS: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations.
CONCLUSIONS: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.
PMID: 27454860 [PubMed - in process]
Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features.
Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features.
BMC Bioinformatics. 2016;17 Suppl 7:246
Authors: Thuy Phan TT, Ohkawa T
Abstract
BACKGROUND: Protein-protein interaction (PPI) extraction from published scientific articles is one key issue in biological research due to its importance in grasping biological processes. Despite considerable advances of recent research in automatic PPI extraction from articles, demand remains to enhance the performance of the existing methods.
RESULTS: Our feature-based method incorporates the strength of many kinds of diverse features, such as lexical and word context features derived from sentences, syntactic features derived from parse trees, and features using existing patterns to extract PPIs automatically from articles. Among these abundant features, we assemble the related features into four groups and define the contribution level (CL) for each group, which consists of related features. Our method consists of two steps. First, we divide the training set into subsets based on the structure of the sentence and the existence of significant keywords (SKs) and apply the sentence patterns given in advance to each subset. Second, we automatically perform feature selection based on the CL values of the four groups that consist of related features and the k-nearest neighbor algorithm (k-NN) through three approaches: (1) focusing on the group with the best contribution level (BEST1G); (2) unoptimized combination of three groups with the best contribution levels (U3G); (3) optimized combination of two groups with the best contribution levels (O2G).
CONCLUSIONS: Our method outperforms other state-of-the-art PPI extraction systems in terms of F-score on the HPRD50 corpus and achieves promising results that are comparable with these PPI extraction systems on other corpora. Further, our method always obtains the best F-score on all the corpora than when using k-NN only without exploiting the CLs of the groups of related features.
PMID: 27454611 [PubMed - in process]
CLASH: Complementary Linkage with Anchoring and Scoring for Heterogeneous biomolecular and clinical data.
CLASH: Complementary Linkage with Anchoring and Scoring for Heterogeneous biomolecular and clinical data.
BMC Med Inform Decis Mak. 2016;16 Suppl 3:72
Authors: Nam Y, Kim M, Lee K, Shin H
Abstract
BACKGROUND: The study on disease-disease association has been increasingly viewed and analyzed as a network, in which the connections between diseases are configured using the source information on interactome maps of biomolecules such as genes, proteins, metabolites, etc. Although abundance in source information leads to tighter connections between diseases in the network, for a certain group of diseases, such as metabolic diseases, the connections do not occur much due to insufficient source information; a large proportion of their associated genes are still unknown. One way to circumvent the difficulties in the lack of source information is to integrate available external information by using one of up-to-date integration or fusion methods. However, if one wants a disease network placing huge emphasis on the original source of data but still utilizing external sources only to complement it, integration may not be pertinent. Interpretation on the integrated network would be ambiguous: meanings conferred on edges would be vague due to fused information.
METHODS: In this study, we propose a network based algorithm that complements the original network by utilizing external information while preserving the network's originality. The proposed algorithm links the disconnected node to the disease network by using complementary information from external data source through four steps: anchoring, connecting, scoring, and stopping.
RESULTS: When applied to the network of metabolic diseases that is sourced from protein-protein interaction data, the proposed algorithm recovered connections by 97%, and improved the AUC performance up to 0.71 (lifted from 0.55) by using the external information outsourced from text mining results on PubMed comorbidity literatures. Experimental results also show that the proposed algorithm is robust to noisy external information.
CONCLUSION: This research has novelty in which the proposed algorithm preserves the network's originality, but at the same time, complements it by utilizing external information. Furthermore it can be utilized for original association recovery and novel association discovery for disease network.
PMID: 27454118 [PubMed - in process]
Reflection of successful anticancer drug development processes in the literature.
Reflection of successful anticancer drug development processes in the literature.
Drug Discov Today. 2016 Jul 18;
Authors: Heinemann F, Huber T, Meisel C, Bundschus M, Leser U
Abstract
The development of cancer drugs is time-consuming and expensive. In particular, failures in late-stage clinical trials are a major cost driver for pharmaceutical companies. This puts a high demand on methods that provide insights into the success chances of new potential medicines. In this study, we systematically analyze publication patterns emerging along the drug discovery process of targeted cancer therapies, starting from basic research to drug approval-or failure. We find clear differences in the patterns of approved drugs compared with those that failed in Phase II/III. Feeding these features into a machine learning classifier allows us to predict the approval or failure of a targeted cancer drug significantly better than educated guessing. We believe that these findings could lead to novel measures for supporting decision making in drug development.
PMID: 27443674 [PubMed - as supplied by publisher]
A literature-driven method to calculate similarities among diseases.
A literature-driven method to calculate similarities among diseases.
Comput Methods Programs Biomed. 2015 Nov;122(2):108-22
Authors: Kim H, Yoon Y, Ahn J, Park S
Abstract
BACKGROUND: "Our lives are connected by a thousand invisible threads and along these sympathetic fibers, our actions run as causes and return to us as results". It is Herman Melville's famous quote describing connections among human lives. To paraphrase the Melville's quote, diseases are connected by many functional threads and along these sympathetic fibers, diseases run as causes and return as results. The Melville's quote explains the reason for researching disease-disease similarity and disease network. Measuring similarities between diseases and constructing disease network can play an important role in disease function research and in disease treatment. To estimate disease-disease similarities, we proposed a novel literature-based method.
METHODS AND RESULTS: The proposed method extracted disease-gene relations and disease-drug relations from literature and used the frequencies of occurrence of the relations as features to calculate similarities among diseases. We also constructed disease network with top-ranking disease pairs from our method. The proposed method discovered a larger number of answer disease pairs than other comparable methods and showed the lowest p-value.
CONCLUSIONS: We presume that our method showed good results because of using literature data, using all possible gene symbols and drug names for features of a disease, and determining feature values of diseases with the frequencies of co-occurrence of two entities. The disease-disease similarities from the proposed method can be used in computational biology researches which use similarities among diseases.
PMID: 26212477 [PubMed - indexed for MEDLINE]
Expert-Guided Generative Topographical Modeling with Visual to Parametric Interaction.
Expert-Guided Generative Topographical Modeling with Visual to Parametric Interaction.
PLoS One. 2016;11(2):e0129122
Authors: Han C, House L, Leman SC
Abstract
Introduced by Bishop et al. in 1996, Generative Topographic Mapping (GTM) is a powerful nonlinear latent variable modeling approach for visualizing high-dimensional data. It has shown useful when typical linear methods fail. However, GTM still suffers from drawbacks. Its complex parameterization of data make GTM hard to fit and sensitive to slight changes in the model. For this reason, we extend GTM to a visual analytics framework so that users may guide the parameterization and assess the data from multiple GTM perspectives. Specifically, we develop the theory and methods for Visual to Parametric Interaction (V2PI) with data using GTM visualizations. The result is a dynamic version of GTM that fosters data exploration. We refer to the new version as V2PI-GTM. In this paper, we develop V2PI-GTM in stages and demonstrate its benefits within the context of a text mining case study.
PMID: 26905728 [PubMed - indexed for MEDLINE]
DESM: portal for microbial knowledge exploration systems.
DESM: portal for microbial knowledge exploration systems.
Nucleic Acids Res. 2016 Jan 4;44(D1):D624-33
Authors: Salhi A, Essack M, Radovanovic A, Marchand B, Bougouffa S, Antunes A, Simoes MF, Lafi FF, Motwalli OA, Bokhari A, Malas T, Amoudi SA, Othum G, Allam I, Mineta K, Gao X, Hoehndorf R, C Archer JA, Gojobori T, Bajic VB
Abstract
Microorganisms produce an enormous variety of chemical compounds. It is of general interest for microbiology and biotechnology researchers to have means to explore information about molecular and genetic basis of functioning of different microorganisms and their ability for bioproduction. To enable such exploration, we compiled 45 topic-specific knowledgebases (KBs) accessible through DESM portal (www.cbrc.kaust.edu.sa/desm). The KBs contain information derived through text-mining of PubMed information and complemented by information data-mined from various other resources (e.g. ChEBI, Entrez Gene, GO, KOBAS, KEGG, UniPathways, BioGrid). All PubMed records were indexed using 4,538,278 concepts from 29 dictionaries, with 1 638 986 records utilized in KBs. Concepts used are normalized whenever possible. Most of the KBs focus on a particular type of microbial activity, such as production of biocatalysts or nutraceuticals. Others are focused on specific categories of microorganisms, e.g. streptomyces or cyanobacteria. KBs are all structured in a uniform manner and have a standardized user interface. Information exploration is enabled through various searches. Users can explore statistically most significant concepts or pairs of concepts, generate hypotheses, create interactive networks of associated concepts and export results. We believe DESM will be a useful complement to the existing resources to benefit microbiology and biotechnology research.
PMID: 26546514 [PubMed - indexed for MEDLINE]
Link Prediction on a Network of Co-occurring MeSH Terms: Towards Literature-based Discovery.
Link Prediction on a Network of Co-occurring MeSH Terms: Towards Literature-based Discovery.
Methods Inf Med. 2016 Jul 20;55(4)
Authors: Kastrin A, Rindflesch TC, Hristovski D
Abstract
OBJECTIVES: Literature-based discovery (LBD) is a text mining methodology for automatically generating research hypotheses from existing knowledge. We mimic the process of LBD as a classification problem on a graph of MeSH terms. We employ unsupervised and supervised link prediction methods for predicting previously unknown connections between biomedical concepts.
METHODS: We evaluate the effectiveness of link prediction through a series of experiments using a MeSH network that contains the history of link formation between biomedical concepts. We performed link prediction using proximity measures, such as common neighbor (CN), Jaccard coefficient (JC), Adamic / Adar index (AA) and preferential attachment (PA). Our approach relies on the assumption that similar nodes are more likely to establish a link in the future.
RESULTS: Applying an unsupervised approach, the AA measure achieved the best performance in terms of area under the ROC curve (AUC = 0.76), followed by CN, JC, and PA. In a supervised approach, we evaluate whether proximity measures can be combined to define a model of link formation across all four predictors. We applied various classifiers, including decision trees, k-nearest neighbors, logistic regression, multilayer perceptron, naïve Bayes, and random forests. Random forest classifier accomplishes the best performance (AUC = 0.87).
CONCLUSIONS: The link prediction approach proved to be effective for LBD processing. Supervised statistical learning approaches clearly outperform an unsupervised approach to link prediction.
PMID: 27435341 [PubMed - as supplied by publisher]
Comprehensive Map of Molecules Implicated in Obesity.
Comprehensive Map of Molecules Implicated in Obesity.
PLoS One. 2016;11(2):e0146759
Authors: Jagannadham J, Jaiswal HK, Agrawal S, Rawal K
Abstract
Obesity is a global epidemic affecting over 1.5 billion people and is one of the risk factors for several diseases such as type 2 diabetes mellitus and hypertension. We have constructed a comprehensive map of the molecules reported to be implicated in obesity. A deep curation strategy was complemented by a novel semi-automated text mining system in order to screen 1,000 full-length research articles and over 90,000 abstracts that are relevant to obesity. We obtain a scale free network of 804 nodes and 971 edges, composed of 510 proteins, 115 genes, 62 complexes, 23 RNA molecules, 83 simple molecules, 3 phenotype and 3 drugs in "bow-tie" architecture. We classify this network into 5 modules and identify new links between the recently discovered fat mass and obesity associated FTO gene with well studied examples such as insulin and leptin. We further built an automated docking pipeline to dock orlistat as well as other drugs against the 24,000 proteins in the human structural proteome to explain the therapeutics and side effects at a network level. Based upon our experiments, we propose that therapeutic effect comes through the binding of one drug with several molecules in target network, and the binding propensity is both statistically significant and different in comparison with any other part of human structural proteome.
PMID: 26886906 [PubMed - indexed for MEDLINE]
Detecting themes of public concern: a text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat.
Detecting themes of public concern: a text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat.
Am J Infect Control. 2015 Oct 1;43(10):1109-11
Authors: Lazard AJ, Scheinfeld E, Bernhardt JM, Wilcox GB, Suran M
Abstract
A diagnosis of Ebola on US soil triggered widespread panic. In response, the Centers for Disease Control and Prevention held a live Twitter chat to address public concerns. This study applied a textual analytics method to reveal insights from these tweets that can inform communication strategies. User-generated tweets were collected, sorted, and analyzed to reveal major themes. The public was concerned with symptoms and lifespan of the virus, disease transfer and contraction, safe travel, and protection of one's body.
PMID: 26138998 [PubMed - indexed for MEDLINE]
Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction.
Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction.
Artif Intell Med. 2016 Jun;70:77-83
Authors: Napolitano G, Marshall A, Hamilton P, Gavin AT
Abstract
BACKGROUND AND AIMS: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
MATERIALS AND METHODS: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: 'semi-structured' and 'unstructured'. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
RESULTS: The best result of 99.4% accuracy - which included only one semi-structured report predicted as unstructured - was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
CONCLUSIONS: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
PMID: 27431038 [PubMed - in process]
Ebola virus disease and social media: A systematic review.
Ebola virus disease and social media: A systematic review.
Am J Infect Control. 2016 Jul 14;
Authors: Fung IC, Duke CH, Finch KC, Snook KR, Tseng PL, Hernandez AC, Gambhir M, Fu KW, Tse ZT
Abstract
OBJECTIVES: We systematically reviewed existing research pertinent to Ebola virus disease and social media, especially to identify the research questions and the methods used to collect and analyze social media.
METHODS: We searched 6 databases for research articles pertinent to Ebola virus disease and social media. We extracted the data using a standardized form. We evaluated the quality of the included articles.
RESULTS: Twelve articles were included in the main analysis: 7 from Twitter with 1 also including Weibo, 1 from Facebook, 3 from YouTube, and 1 from Instagram and Flickr. All the studies were cross-sectional. Eleven of the 12 articles studied ≥ 1of these 3 elements of social media and their relationships: themes or topics of social media contents, meta-data of social media posts (such as frequency of original posts and reposts, and impressions) and characteristics of the social media accounts that made these posts (such as whether they are individuals or institutions). One article studied how news videos influenced Twitter traffic. Twitter content analysis methods included text mining (n = 3) and manual coding (n = 1). Two studies involved mathematical modeling. All 3 YouTube studies and the Instagram/Flickr study used manual coding of videos and images, respectively.
CONCLUSIONS: Published Ebola virus disease-related social media research focused on Twitter and YouTube. The utility of social media research to public health practitioners is warranted.
PMID: 27425009 [PubMed - as supplied by publisher]
Overview of major molecular alterations during progression from Barrett's esophagus to esophageal adenocarcinoma.
Overview of major molecular alterations during progression from Barrett's esophagus to esophageal adenocarcinoma.
Ann N Y Acad Sci. 2016 Jul 14;
Authors: Kalatskaya I
Abstract
Esophageal adenocarcinoma (EAC) develops in the sequential transformation of normal epithelium into metaplastic epithelium, called Barrett's esophagus (BE), then to dysplasia, and finally cancer. BE is a common condition in which normal stratified squamous epithelium of the esophagus is replaced with an intestine-like columnar epithelium, and it is the most prominent risk factor for EAC. This review aims to impartially systemize the knowledge from a large number of publications that describe the molecular and biochemical alterations occurring over this progression sequence. In order to provide an unbiased extraction of the knowledge from the literature, a text-mining methodology was used to select genes that are involved in the BE progression, with the top candidate genes found to be TP53, CDKN2A, CTNNB1, CDH1, GPX3, and NOX5. In addition, sample frequencies across analyzed patient cohorts at each stage of disease progression are summarized. All six genes are altered in the majority of EAC patients, and accumulation of alterations correlates well with the sequential progression of BE to cancer, indicating that the text-mining method is a valid approach for gene prioritization. This review discusses how, besides being cancer drivers, these genes are functionally interconnected and might collectively be considered a central hub of BE progression.
PMID: 27415609 [PubMed - as supplied by publisher]
Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals.
Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals.
PLoS One. 2016;11(2):e0149102
Authors: Druzinsky RE, Balhoff JP, Crompton AW, Done J, German RZ, Haendel MA, Herrel A, Herring SW, Lapp H, Mabee PM, Muller HM, Mungall CJ, Sternberg PW, Van Auken K, Vinyard CJ, Williams SH, Wall CE
Abstract
BACKGROUND: In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required.
DEVELOPMENT AND TESTING OF THE ONTOLOGY: Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar.
RESULTS AND SIGNIFICANCE: Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.
PMID: 26870952 [PubMed - indexed for MEDLINE]
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
Nucleic Acids Res. 2016 Jan 4;44(D1):D435-46
Authors: Huang KY, Su MG, Kao HJ, Hsieh YC, Jhong JH, Cheng KH, Huang HD, Lee TY
Abstract
Owing to the importance of the post-translational modifications (PTMs) of proteins in regulating biological processes, the dbPTM (http://dbPTM.mbc.nctu.edu.tw/) was developed as a comprehensive database of experimentally verified PTMs from several databases with annotations of potential PTMs for all UniProtKB protein entries. For this 10th anniversary of dbPTM, the updated resource provides not only a comprehensive dataset of experimentally verified PTMs, supported by the literature, but also an integrative interface for accessing all available databases and tools that are associated with PTM analysis. As well as collecting experimental PTM data from 14 public databases, this update manually curates over 12 000 modified peptides, including the emerging S-nitrosylation, S-glutathionylation and succinylation, from approximately 500 research articles, which were retrieved by text mining. As the number of available PTM prediction methods increases, this work compiles a non-homologous benchmark dataset to evaluate the predictive power of online PTM prediction tools. An increasing interest in the structural investigation of PTM substrate sites motivated the mapping of all experimental PTM peptides to protein entries of Protein Data Bank (PDB) based on database identifier and sequence identity, which enables users to examine spatially neighboring amino acids, solvent-accessible surface area and side-chain orientations for PTM substrate sites on tertiary structures. Since drug binding in PDB is annotated, this update identified over 1100 PTM sites that are associated with drug binding. The update also integrates metabolic pathways and protein-protein interactions to support the PTM network analysis for a group of proteins. Finally, the web interface is redesigned and enhanced to facilitate access to this resource.
PMID: 26578568 [PubMed - indexed for MEDLINE]
BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.
BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.
Database (Oxford). 2016;2016
Authors: Rinaldi F, Ellendorff TR, Madan S, Clematide S, van der Lek A, Mevissen T, Fluck J
Abstract
Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text.
PMID: 27402677 [PubMed - in process]
Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures.
Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures.
J Med Syst. 2016 Aug;40(8):191
Authors: Figueroa RL, Flores CA
Abstract
Obesity is a chronic disease with an increasing impact on the world's population. In this work, we present a method of identifying obesity automatically using text mining techniques and information related to body weight measures and obesity comorbidities. We used a dataset of 3015 de-identified medical records that contain labels for two classification problems. The first classification problem distinguishes between obesity, overweight, normal weight, and underweight. The second classification problem differentiates between obesity types: super obesity, morbid obesity, severe obesity and moderate obesity. We used a Bag of Words approach to represent the records together with unigram and bigram representations of the features. We implemented two approaches: a hierarchical method and a nonhierarchical one. We used Support Vector Machine and Naïve Bayes together with ten-fold cross validation to evaluate and compare performances. Our results indicate that the hierarchical approach does not work as well as the nonhierarchical one. In general, our results show that Support Vector Machine obtains better performances than Naïve Bayes for both classification problems. We also observed that bigram representation improves performance compared with unigram representation.
PMID: 27402260 [PubMed - in process]
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.
Springerplus. 2016;5(1):942
Authors: Agnihotri D, Verma K, Tripathi P
Abstract
The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2).
PMID: 27386386 [PubMed - as supplied by publisher]
Text Mining the History of Medicine.
Text Mining the History of Medicine.
PLoS One. 2016;11(1):e0144717
Authors: Thompson P, Batista-Navarro RT, Kontonatsios G, Carter J, Toon E, McNaught J, Timmermann C, Worboys M, Ananiadou S
Abstract
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
PMID: 26734936 [PubMed - indexed for MEDLINE]
"Initial investigation into computer scoring of candidate essays for personnel selection": Correction to Campion et al. (2016).
"Initial investigation into computer scoring of candidate essays for personnel selection": Correction to Campion et al. (2016).
J Appl Psychol. 2016 Jul;101(7):975
Authors:
Abstract
Reports an error in "Initial Investigation Into Computer Scoring of Candidate Essays for Personnel Selection" by Michael C. Campion, Michael A. Campion, Emily D. Campion and Matthew H. Reider (Journal of Applied Psychology, Advanced Online Publication, Apr 14, 2016, np). In the article the affiliations for Emily D. Campion and Matthew H. Reider were originally incorrect. All versions of this article have been corrected. (The following abstract of the original article appeared in record 2016-18130-001.) Emerging advancements including the exponentially growing availability of computer-collected data and increasingly sophisticated statistical software have led to a "Big Data Movement" wherein organizations have begun attempting to use large-scale data analysis to improve their effectiveness. Yet, little is known regarding how organizations can leverage these advancements to develop more effective personnel selection procedures, especially when the data are unstructured (text-based). Drawing on literature on natural language processing, we critically examine the possibility of leveraging advances in text mining and predictive modeling computer software programs as a surrogate for human raters in a selection context. We explain how to "train" a computer program to emulate a human rater when scoring accomplishment records. We then examine the reliability of the computer's scores, provide preliminary evidence of their construct validity, demonstrate that this practice does not produce scores that disadvantage minority groups, illustrate the positive financial impact of adopting this practice in an organization (N ∼ 46,000 candidates), and discuss implementation issues. Finally, we discuss the potential implications of using computer scoring to address the adverse impact-validity dilemma. We suggest that it may provide a cost-effective means of using predictors that have comparable validity but have previously been too expensive for large-scale screening. (PsycINFO Database Record
PMID: 27379396 [PubMed - as supplied by publisher]