Drug-induced Adverse Events

Data Mining of Web-Based Documents on Social Networking Sites That Included Suicide-Related Words Among Korean Adolescents.
Data Mining of Web-Based Documents on Social Networking Sites That Included Suicide-Related Words Among Korean Adolescents.
J Adolesc Health. 2016 Sep 29;:
Authors: Song J, Song TM, Seo DC, Jin JH
Abstract
PURPOSE: To investigate online search activity of suicide-related words in South Korean adolescents through data mining of social media Web sites as the suicide rate in South Korea is one of the highest in the world.
METHODS: Out of more than 2.35 billion posts for 2 years from January 1, 2011 to December 31, 2012 on 163 social media Web sites in South Korea, 99,693 suicide-related documents were retrieved by Crawler and analyzed using text mining and opinion mining. These data were further combined with monthly employment rate, monthly rental prices index, monthly youth suicide rate, and monthly number of reported bully victims to fit multilevel models as well as structural equation models.
RESULTS: The link from grade pressure to suicide risk showed the largest standardized path coefficient (beta = .357, p < .001) in structural models and a significant random effect (p < .01) in multilevel models. Depression was a partial mediator between suicide risk and grade pressure, low body image, victims of bullying, and concerns about disease. The largest total effect was observed in the grade pressure to depression to suicide risk. The multilevel models indicate about 27% of the variance in the daily suicide-related word search activity is explained by month-to-month variations. A lower employment rate, a higher rental prices index, and more bullying were associated with an increased suicide-related word search activity.
CONCLUSIONS: Academic pressure appears to be the biggest contributor to Korean adolescents' suicide risk. Real-time suicide-related word search activity monitoring and response system needs to be developed.
PMID: 27693129 [PubMed - as supplied by publisher]
Sentiment prediction by text mining medical documents using optimized swarm search-based feature selection.
Sentiment prediction by text mining medical documents using optimized swarm search-based feature selection.
Comput Med Imaging Graph. 2016 Aug 5;:
Authors: Zeng D, Peng J, Fong S, Qiu Y, Wong R, Mon YJ
Abstract
Sentiment prediction emerged as an important machine learning topic to gain insights from unstructured texts, recently gained popularity in health-care industries. Text mining has long been a fundamental data analytic for sentiment prediction. A popular pre-processing step in text mining is transforming text strings to word vectors which form a high-dimensional sparse matrix. This sparse matrix poses computational challenges to induction of accurate sentiment prediction model. Feature selection has been a popular dimensionality reduction technique that finds a subset of features from all the original features from the sparse matrix, in order to enhance the accuracy of the prediction model. In this paper, a new feature selection method called Optimized Swarm Search-based Feature Selection (OSS-FS) is applied. OSS-FS is a swarm-type of searching function that selects an ideal subset of features for enhanced classification accuracy. The swarm search in OSS-FS is optimized by a simple feature evaluation technique called Clustering-by-Coefficient-of-Variation (CCV). The proposed scheme is applied and verified via a case scenario where 279 medical articles related to 'meaningful use functionalities on health care quality, safety, and efficiency' from a systematic review of the health IT literature from January 2010 to August 2013. A multi-class of sentiments, positive, mixed-positive, neutral and negative would have to be recognized from the document contents, by computer using text mining. The results show superiority of OSS-FS over the traditional feature selection methods. The proposed sentiment prediction model will be useful for estimating the sentiments of the readers from some medical literatures. Authors may gauge the potential sentiments of their articles before they get published out.
PMID: 27693005 [PubMed - as supplied by publisher]
SparkText: Biomedical Text Mining on Big Data Framework.
SparkText: Biomedical Text Mining on Big Data Framework.
PLoS One. 2016;11(9):e0162721
Authors: Ye Z, Tafti AP, He KY, Wang K, He MM
Abstract
BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.
RESULTS: In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.
CONCLUSIONS: This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
PMID: 27685652 [PubMed - as supplied by publisher]
Life priorities in the HIV-positive Asians: a text-mining analysis in young vs. old generation.
Life priorities in the HIV-positive Asians: a text-mining analysis in young vs. old generation.
AIDS Care. 2016 Aug 12;:1-4
Authors: Chen WT, Barbour R
Abstract
HIV/AIDS is one of the most urgent and challenging public health issues, especially since it is now considered a chronic disease. In this project, we used text mining techniques to extract meaningful words and word patterns from 45 transcribed in-depth interviews of people living with HIV/AIDS (PLWHA) conducted in Taipei, Beijing, Shanghai, and San Francisco from 2006 to 2013. Text mining analysis can predict whether an emerging field will become a long-lasting source of academic interest or whether it is simply a passing source of interest that will soon disappear. The data were analyzed by age group (45 and older vs. 44 and younger). The highest ranking fragments in the order of frequency were: "care", "daughter", "disease", "family", "HIV", "hospital", "husband", "medicines", "money", "people", "son", "tell/disclosure", "thought", "want", and "years". Participants in the 44-year-old and younger group were focused mainly on disease disclosure, their families, and their financial condition. In older PLWHA, social supports were one of the main concerns. In this study, we learned that different age groups perceive the disease differently. Therefore, when designing intervention, researchers should consider to tailor an intervention to a specific population and to help PLWHA achieve a better quality of life. Promoting self-management can be an effective strategy for every encounter with HIV-positive individuals.
PMID: 27684610 [PubMed - as supplied by publisher]
The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.
The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.
J Trauma Stress. 2015 Dec;28(6):505-14
Authors: Hammond KW, Ben-Ari AY, Laundry RJ, Boyko EJ, Samore MH
Abstract
Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research.
PMID: 26579624 [PubMed - indexed for MEDLINE]
Automatic semantic classification of scientific literature according to the hallmarks of cancer.
Automatic semantic classification of scientific literature according to the hallmarks of cancer.
Bioinformatics. 2016 Feb 1;32(3):432-40
Authors: Baker S, Silins I, Guo Y, Ali I, Högberg J, Stenius U, Korhonen A
Abstract
MOTIVATION: The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of human tumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research.
RESULTS: We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future.
AVAILABILITY AND IMPLEMENTATION: The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html.
CONTACT: simon.baker@cl.cam.ac.uk.
PMID: 26454282 [PubMed - indexed for MEDLINE]
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs.
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs.
J Biomed Semantics. 2016;7(1):58
Authors: Ahltorp M, Skeppstedt M, Kitajima S, Henriksson A, Rzepka R, Araki K
Abstract
BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs.
METHODS: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies.
RESULTS: Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding.
CONCLUSIONS: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.
PMID: 27671202 [PubMed - as supplied by publisher]
Using Text Analytics of AJPE Article Titles to Reveal Trends In Pharmacy Education Over the Past Two Decades.
Using Text Analytics of AJPE Article Titles to Reveal Trends In Pharmacy Education Over the Past Two Decades.
Am J Pharm Educ. 2016 Aug 25;80(6):104
Authors: Pedrami F, Asenso P, Devi S
Abstract
Objective. To identify trends in pharmacy education during last two decades using text mining. Methods. Articles published in the American Journal of Pharmaceutical Education (AJPE) in the past two decades were compiled in a database. Custom text analytics software was written using Visual Basic programming language in the Visual Basic for Applications (VBA) editor of Excel 2007. Frequency of words appearing in article titles was calculated using the custom VBA software. Data were analyzed to identify the emerging trends in pharmacy education. Results. Three educational trends emerged: active learning, interprofessional, and cultural competency. Conclusion. The text analytics program successfully identified trends in article topics and may be a useful compass to predict the future course of pharmacy education.
PMID: 27667841 [PubMed - in process]
The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation.
The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation.
J Biomed Semantics. 2016;7(1):57
Authors: Buttigieg PL, Pafilis E, Lewis SE, Schildhauer MP, Walls RL, Mungall CJ
Abstract
BACKGROUND: The Environment Ontology (ENVO; http://www.environmentontology.org/ ), first described in 2013, is a resource and research target for the semantically controlled description of environmental entities. The ontology's initial aim was the representation of the biomes, environmental features, and environmental materials pertinent to genomic and microbiome-related investigations. However, the need for environmental semantics is common to a multitude of fields, and ENVO's use has steadily grown since its initial description. We have thus expanded, enhanced, and generalised the ontology to support its increasingly diverse applications.
METHODS: We have updated our development suite to promote expressivity, consistency, and speed: we now develop ENVO in the Web Ontology Language (OWL) and employ templating methods to accelerate class creation. We have also taken steps to better align ENVO with the Open Biological and Biomedical Ontologies (OBO) Foundry principles and interoperate with existing OBO ontologies. Further, we applied text-mining approaches to extract habitat information from the Encyclopedia of Life and automatically create experimental habitat classes within ENVO.
RESULTS: Relative to its state in 2013, ENVO's content, scope, and implementation have been enhanced and much of its existing content revised for improved semantic representation. ENVO now offers representations of habitats, environmental processes, anthropogenic environments, and entities relevant to environmental health initiatives and the global Sustainable Development Agenda for 2030. Several branches of ENVO have been used to incubate and seed new ontologies in previously unrepresented domains such as food and agronomy. The current release version of the ontology, in OWL format, is available at http://purl.obolibrary.org/obo/envo.owl .
CONCLUSIONS: ENVO has been shaped into an ontology which bridges multiple domains including biomedicine, natural and anthropogenic ecology, 'omics, and socioeconomic development. Through continued interactions with our users and partners, particularly those performing data archiving and sythesis, we anticipate that ENVO's growth will accelerate in 2017. As always, we invite further contributions and collaboration to advance the semantic representation of the environment, ranging from geographic features and environmental materials, across habitats and ecosystems, to everyday objects in household settings.
PMID: 27664130 [PubMed - as supplied by publisher]
Identifying the Uncertainty in Physician Practice Location through Spatial Analytics and Text Mining.
Identifying the Uncertainty in Physician Practice Location through Spatial Analytics and Text Mining.
Int J Environ Res Public Health. 2016;13(9)
Authors: Shi X, Xue B, Xierali IM
Abstract
In response to the widespread concern about the adequacy, distribution, and disparity of access to a health care workforce, the correct identification of physicians' practice locations is critical to access public health services. In prior literature, little effort has been made to detect and resolve the uncertainty about whether the address provided by a physician in the survey is a practice address or a home address. This paper introduces how to identify the uncertainty in a physician's practice location through spatial analytics, text mining, and visual examination. While land use and zoning code, embedded within the parcel datasets, help to differentiate resident areas from other types, spatial analytics may have certain limitations in matching and comparing physician and parcel datasets with different uncertainty issues, which may lead to unforeseen results. Handling and matching the string components between physicians' addresses and the addresses of the parcels could identify the spatial uncertainty and instability to derive a more reasonable relationship between different datasets. Visual analytics and examination further help to clarify the undetectable patterns. This research will have a broader impact over federal and state initiatives and policies to address both insufficiency and maldistribution of a health care workforce to improve the accessibility to public health services.
PMID: 27657100 [PubMed - as supplied by publisher]
Extracting kinetic information from literature with KineticRE.
Extracting kinetic information from literature with KineticRE.
J Integr Bioinform. 2015;12(4):282
Authors: Freitas AA, Costa H, Rocha M, Rocha I
Abstract
To better understand the dynamic behavior of metabolic networks in a wide variety of conditions, the field of Systems Biology has increased its interest in the use of kinetic models. The different databases, available these days, do not contain enough data regarding this topic. Given that a significant part of the relevant information for the development of such models is still wide spread in the literature, it becomes essential to develop specific and powerful text mining tools to collect these data. In this context, this work has as main objective the development of a text mining tool to extract, from scientific literature, kinetic parameters, their respective values and their relations with enzymes and metabolites. The approach proposed integrates the development of a novel plug-in over the text mining framework @Note2. In the end, the pipeline developed was validated with a case study on Kluyveromyces lactis, spanning the analysis and results of 20 full text documents.
PMID: 26673933 [PubMed - indexed for MEDLINE]
RetroMine, or how to provide in-depth retrospective studies from Medline in a glance: the hepcidin use-case.
RetroMine, or how to provide in-depth retrospective studies from Medline in a glance: the hepcidin use-case.
J Integr Bioinform. 2015;12(3):275
Authors: Ameline de Cadeville B, Loréal O, Moussouni-Marzolf F
Abstract
The rapid expansion of biomedical literature has provoked an increased development of advanced text mining tools to rapidly extract relevant events from the continuously increasing amount of knowledge published periodically in PubMed. However, bioinvestigators are still reluctant to use these tools for two reasons: i) a large volume of events is often extracted upon a query, and this volume is hard to manage, and ii) background events dominate search results and overshadow more pertinent published information, especially for domain experts. In this paper, we propose an approach that incorporates the temporal dimension of published events to the process of information extraction to improve data selection and prioritize more pertinent periodically published knowledge for scientists. Indeed, instead of providing the total knowledge associated with a PubMed query, which is usually a mix of trivial background information and non-background information, we propose a method that incorporates time and selects non background and highly relevant biological entities and events published over time for bioinvestigators. Before excluding background events from the total knowledge extracted, a quantification of their amount is also provided. This work is illustrated by a case study regarding Hepcidin gene publications over a decade, a duration that is sufficiently long enough to generate alternative views on the overall data extracted.
PMID: 26673791 [PubMed - indexed for MEDLINE]
A corpus for plant-chemical relationships in the biomedical domain.
A corpus for plant-chemical relationships in the biomedical domain.
BMC Bioinformatics. 2016;17(1):386
Authors: Choi W, Kim B, Cho H, Lee D, Lee H
Abstract
BACKGROUND: Plants are natural products that humans consume in various ways including food and medicine. They have a long empirical history of treating diseases with relatively few side effects. Based on these strengths, many studies have been performed to verify the effectiveness of plants in treating diseases. It is crucial to understand the chemicals contained in plants because these chemicals can regulate activities of proteins that are key factors in causing diseases. With the accumulation of a large volume of biomedical literature in various databases such as PubMed, it is possible to automatically extract relationships between plants and chemicals in a large-scale way if we apply a text mining approach. A cornerstone of achieving this task is a corpus of relationships between plants and chemicals.
RESULTS: In this study, we first constructed a corpus for plant and chemical entities and for the relationships between them. The corpus contains 267 plant entities, 475 chemical entities, and 1,007 plant-chemical relationships (550 and 457 positive and negative relationships, respectively), which are drawn from 377 sentences in 245 PubMed abstracts. Inter-annotator agreement scores for the corpus among three annotators were measured. The simple percent agreement scores for entities and trigger words for the relationships were 99.6 and 94.8 %, respectively, and the overall kappa score for the classification of positive and negative relationships was 79.8 %. We also developed a rule-based model to automatically extract such plant-chemical relationships. When we evaluated the rule-based model using the corpus and randomly selected biomedical articles, overall F-scores of 68.0 and 61.8 % were achieved, respectively.
CONCLUSION: We expect that the corpus for plant-chemical relationships will be a useful resource for enhancing plant research. The corpus is available at http://combio.gist.ac.kr/plantchemicalcorpus .
PMID: 27650402 [PubMed - as supplied by publisher]
A hybrid model for automatic identification of risk factors for heart disease.
A hybrid model for automatic identification of risk factors for heart disease.
J Biomed Inform. 2015 Dec;58 Suppl:S171-82
Authors: Yang H, Garibaldi JM
Abstract
Coronary artery disease (CAD) is the leading cause of death in both the UK and worldwide. The detection of related risk factors and tracking their progress over time is of great importance for early prevention and treatment of CAD. This paper describes an information extraction system that was developed to automatically identify risk factors for heart disease in medical records while the authors participated in the 2014 i2b2/UTHealth NLP Challenge. Our approaches rely on several nature language processing (NLP) techniques such as machine learning, rule-based methods, and dictionary-based keyword spotting to cope with complicated clinical contexts inherent in a wide variety of risk factors. Our system achieved encouraging performance on the challenge test data with an overall micro-averaged F-measure of 0.915, which was competitive to the best system (F-measure of 0.927) of this challenge task.
PMID: 26375492 [PubMed - indexed for MEDLINE]
Coronary artery disease risk assessment from unstructured electronic health records using text mining.
Coronary artery disease risk assessment from unstructured electronic health records using text mining.
J Biomed Inform. 2015 Dec;58 Suppl:S203-10
Authors: Jonnagaddala J, Liaw ST, Ray P, Kumar M, Chang NW, Dai HJ
Abstract
Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD.
PMID: 26319542 [PubMed - indexed for MEDLINE]
Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes.
Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes.
J Biomed Inform. 2015 Dec;58 Suppl:S128-32
Authors: Khalifa A, Meystre S
Abstract
The 2014 i2b2 natural language processing shared task focused on identifying cardiovascular risk factors such as high blood pressure, high cholesterol levels, obesity and smoking status among other factors found in health records of diabetic patients. In addition, the task involved detecting medications, and time information associated with the extracted data. This paper presents the development and evaluation of a natural language processing (NLP) application conceived for this i2b2 shared task. For increased efficiency, the application main components were adapted from two existing NLP tools implemented in the Apache UIMA framework: Textractor (for dictionary-based lookup) and cTAKES (for preprocessing and smoking status detection). The application achieved a final (micro-averaged) F1-measure of 87.5% on the final evaluation test set. Our attempt was mostly based on existing tools adapted with minimal changes and allowed for satisfying performance with limited development efforts.
PMID: 26318122 [PubMed - indexed for MEDLINE]
Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models.
Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models.
J Biomed Inform. 2015 Dec;58 Suppl:S143-9
Authors: Urbain J
Abstract
We present the design, and analyze the performance of a multi-stage natural language processing system employing named entity recognition, Bayesian statistics, and rule logic to identify and characterize heart disease risk factor events in diabetic patients over time. The system was originally developed for the 2014 i2b2 Challenges in Natural Language in Clinical Data. The system's strengths included a high level of accuracy for identifying named entities associated with heart disease risk factor events. The system's primary weakness was due to inaccuracies when characterizing the attributes of some events. For example, determining the relative time of an event with respect to the record date, whether an event is attributable to the patient's history or the patient's family history, and differentiating between current and prior smoking status. We believe these inaccuracies were due in large part to the lack of an effective approach for integrating context into our event detection model. To address these inaccuracies, we explore the addition of a distributional semantic model for characterizing contextual evidence of heart disease risk factor events. Using this semantic model, we raise our initial 2014 i2b2 Challenges in Natural Language of Clinical data F1 score of 0.838 to 0.890 and increased precision by 10.3% without use of any lexicons that might bias our results.
PMID: 26305514 [PubMed - indexed for MEDLINE]
Automatic detection of protected health information from clinic narratives.
Automatic detection of protected health information from clinic narratives.
J Biomed Inform. 2015 Dec;58 Suppl:S30-8
Authors: Yang H, Garibaldi JM
Abstract
This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub-categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule-based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task-specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F-measure of 93.6%, which was the winner of this de-identification challenge.
PMID: 26231070 [PubMed - indexed for MEDLINE]
Combining knowledge- and data-driven methods for de-identification of clinical narratives.
Combining knowledge- and data-driven methods for de-identification of clinical narratives.
J Biomed Inform. 2015 Dec;58 Suppl:S53-9
Authors: Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G
Abstract
A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.
PMID: 26210359 [PubMed - indexed for MEDLINE]
Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.
Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.
J Biomed Inform. 2015 Dec;58 Suppl:S120-7
Authors: Cormack J, Nath C, Milward D, Raja K, Jonnalagadda SR
Abstract
This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.
PMID: 26209007 [PubMed - indexed for MEDLINE]