Literature Watch
SparkText: Biomedical Text Mining on Big Data Framework.
SparkText: Biomedical Text Mining on Big Data Framework.
PLoS One. 2016;11(9):e0162721
Authors: Ye Z, Tafti AP, He KY, Wang K, He MM
Abstract
BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.
RESULTS: In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.
CONCLUSIONS: This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
PMID: 27685652 [PubMed - as supplied by publisher]
Life priorities in the HIV-positive Asians: a text-mining analysis in young vs. old generation.
Life priorities in the HIV-positive Asians: a text-mining analysis in young vs. old generation.
AIDS Care. 2016 Aug 12;:1-4
Authors: Chen WT, Barbour R
Abstract
HIV/AIDS is one of the most urgent and challenging public health issues, especially since it is now considered a chronic disease. In this project, we used text mining techniques to extract meaningful words and word patterns from 45 transcribed in-depth interviews of people living with HIV/AIDS (PLWHA) conducted in Taipei, Beijing, Shanghai, and San Francisco from 2006 to 2013. Text mining analysis can predict whether an emerging field will become a long-lasting source of academic interest or whether it is simply a passing source of interest that will soon disappear. The data were analyzed by age group (45 and older vs. 44 and younger). The highest ranking fragments in the order of frequency were: "care", "daughter", "disease", "family", "HIV", "hospital", "husband", "medicines", "money", "people", "son", "tell/disclosure", "thought", "want", and "years". Participants in the 44-year-old and younger group were focused mainly on disease disclosure, their families, and their financial condition. In older PLWHA, social supports were one of the main concerns. In this study, we learned that different age groups perceive the disease differently. Therefore, when designing intervention, researchers should consider to tailor an intervention to a specific population and to help PLWHA achieve a better quality of life. Promoting self-management can be an effective strategy for every encounter with HIV-positive individuals.
PMID: 27684610 [PubMed - as supplied by publisher]
The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.
The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.
J Trauma Stress. 2015 Dec;28(6):505-14
Authors: Hammond KW, Ben-Ari AY, Laundry RJ, Boyko EJ, Samore MH
Abstract
Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research.
PMID: 26579624 [PubMed - indexed for MEDLINE]
Automatic semantic classification of scientific literature according to the hallmarks of cancer.
Automatic semantic classification of scientific literature according to the hallmarks of cancer.
Bioinformatics. 2016 Feb 1;32(3):432-40
Authors: Baker S, Silins I, Guo Y, Ali I, Högberg J, Stenius U, Korhonen A
Abstract
MOTIVATION: The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of human tumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research.
RESULTS: We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future.
AVAILABILITY AND IMPLEMENTATION: The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html.
CONTACT: simon.baker@cl.cam.ac.uk.
PMID: 26454282 [PubMed - indexed for MEDLINE]
("orphan disease" OR "rare disease" OR "orphan diseases" OR "rare diseases"); +22 new citations
22 new pubmed citations were retrieved for your search. Click on the search hyperlink below to display the complete search results:
("orphan disease" OR "rare disease" OR "orphan diseases" OR "rare diseases")
These pubmed results were generated on 2016/09/28
PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.
"Cystic Fibrosis"; +6 new citations
6 new pubmed citations were retrieved for your search. Click on the search hyperlink below to display the complete search results:
These pubmed results were generated on 2016/09/28
PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.
"Systems Biology"[Title/Abstract] AND ("2005/01/01"[PDAT] : "3000"[PDAT]); +12 new citations
12 new pubmed citations were retrieved for your search. Click on the search hyperlink below to display the complete search results:
"Systems Biology"[Title/Abstract] AND ("2005/01/01"[PDAT] : "3000"[PDAT])
These pubmed results were generated on 2016/09/28
PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.
In search for geroprotectors: in silico screening and in vitro validation of signalome-level mimetics of young healthy state.
In search for geroprotectors: in silico screening and in vitro validation of signalome-level mimetics of young healthy state.
Aging (Albany NY). 2016 Sep 24;
Authors: Aliper A, Belikov AV, Garazha A, Jellen L, Artemov A, Suntsova M, Ivanova A, Venkova L, Borisov N, Buzdin A, Mamoshina P, Putin E, Swick AG, Moskalev A, Zhavoronkov A
Abstract
Populations in developed nations throughout the world are rapidly aging, and the search for geroprotectors, or anti-aging interventions, has never been more important. Yet while hundreds of geroprotectors have extended lifespan in animal models, none have yet been approved for widespread use in humans. GeroScope is a computational tool that can aid prediction of novel geroprotectors from existing human gene expression data. GeroScope maps expression differences between samples from young and old subjects to aging-related signaling pathways, then profiles pathway activation strength (PAS) for each condition. Known substances are then screened and ranked for those most likely to target differential pathways and mimic the young signalome. Here we used GeroScope and shortlisted ten substances, all of which have lifespan-extending effects in animal models, and tested 6 of them for geroprotective effects in senescent human fibroblast cultures. PD-98059, a highly selective MEK1 inhibitor, showed both life-prolonging and rejuvenating effects. Natural compounds like N-acetyl-L-cysteine, Myricetin and Epigallocatechin gallate also improved several senescence-associated properties and were further investigated with pathway analysis. This work not only highlights several potential geroprotectors for further study, but also serves as a proof-of-concept for GeroScope, Oncofinder and other PAS-based methods in streamlining drug prediction, repurposing and personalized medicine.
PMID: 27677171 [PubMed - as supplied by publisher]
Targeting PI4K for Radiosensitization: A Potential Model of Drug Repositioning.
Targeting PI4K for Radiosensitization: A Potential Model of Drug Repositioning.
Int J Radiat Oncol Biol Phys. 2016 Oct 1;96(2S):E558
Authors: Kim IA, Kwon J, Park YH, Kim DH, Park JM
PMID: 27675011 [PubMed - as supplied by publisher]
Polypharmacology in Precision Oncology: Current Applications and Future Prospects.
Polypharmacology in Precision Oncology: Current Applications and Future Prospects.
Curr Pharm Des. 2016 Sep 23;
Authors: Antolin AA, Workman P, Mestres J, Al-Lazikani B
Abstract
Over the past decade, a more comprehensive, large-scale approach to studying cancer genetics and biology has revealed the challenges of tumor heterogeneity, adaption, evolution and drug resistance, while systems-based pharmacology and chemical biology strategies have uncovered a much more complex interaction between drugs and the human proteome than was previously anticipated. In this mini-review we assess the progress and potential of drug polypharmacology in biomarker-driven precision oncology. Polypharmacology not only provides great opportunities for drug repurposing to exploit off-target effects in a new single-target indication but through simultaneous blockade of multiple targets or pathways offers exciting opportunities to slow, overcome or even prevent inherent or adaptive drug resistance. We highlight the many challenges associated with exploiting known or desired polypharmacology in drug design and development, and assess computational and experimental methods to uncover unknown polypharmacology. A comprehensive understanding of the intricate links between polypharmacology, efficacy and safety is urgently needed if we are to tackle the enduring challenge of cancer drug resistance and to fully exploit polypharmacology for the ultimate benefit of cancer patients.
PMID: 27669965 [PubMed - as supplied by publisher]
Busting the billion-dollar myth: how to slash the cost of drug development.
Busting the billion-dollar myth: how to slash the cost of drug development.
Nature. 2016 Aug 25;536(7617):388-90
Authors: Maxmen A
PMID: 27558048 [PubMed - indexed for MEDLINE]
Repurposed therapeutic agents targeting the Ebola virus: a protocol for a systematic review.
Repurposed therapeutic agents targeting the Ebola virus: a protocol for a systematic review.
Syst Rev. 2015;4:171
Authors: Sweiti H, Ekwunife O, Jaschinski T, Lhachimi SK
Abstract
BACKGROUND: The recent Ebola epidemic in western Africa developed into an acute public health emergency of unprecedented level in modern times. The treatment provided in most cases has been limited to supportive care, as no approved therapies are available to date. Several established, licenced drugs have been suggested as potential repurposed therapeutic agents for Ebola. However, scientific data on their efficacy in treating Ebola is limited. The purpose of this review is to systematically assess scientific evidence on potential drugs targeting Ebola. In specific, we aim to (1) identify drug library screens involving therapeutic agents targeting the Ebola virus, (2) list potential approved drugs identified from drug screens and review their mechanism of action against the Ebola virus and (3) summarise the outcome of preclinical and clinical trials investigating approved drugs targeting the Ebola virus.
METHODS/DESIGN: We will develop comprehensive systematic search strategies and will perform a systematic literature search in MEDLINE, Embase and Cochrane Central Register of Controlled Trials (CENTRAL). Two authors will independently screen the titles, abstracts and the references of all selected articles on the basis of inclusion criteria. These include any available drug screening, preclinical studies and clinical studies examining the efficacy of approved therapeutic agents targeting the Ebola virus. There will be no restrictions on the type of participants, the type of comparator, time or setting. Data extraction and quality assessment will be undertaken by two review authors working independently.
DISCUSSION: This systematic review will provide systematic knowledge on potential repurposed therapeutic agents targeting Ebola. It aims to help guide future investigations on repurposed drugs and avoid repetitive studies.
SYSTEMATIC REVIEW REGISTRATION: PROSPERO CRD42015024349.
PMID: 26607658 [PubMed - indexed for MEDLINE]
Moving toward personalized medicine in rheumatoid arthritis: SNPs in methotrexate intracellular pathways are associated with methotrexate therapeutic outcome.
Moving toward personalized medicine in rheumatoid arthritis: SNPs in methotrexate intracellular pathways are associated with methotrexate therapeutic outcome.
Pharmacogenomics. 2016 Sep 27;
Authors: Lima A, Bernardes M, Azevedo R, Seabra V, Medeiros R
Abstract
AIM: Evaluate the potential of selected SNPs as predictors of methotrexate (MTX) therapeutic outcome.
PATIENTS & METHODS: In total, 35 SNPs in 14 genes involved in MTX intracellular pathways and Phase II reactions were genotyped in 233 rheumatoid arthritis (RA) patients treated with MTX. Binary logistic regressions were performed by genotype/haplotype-based approaches. Non-Response- and Toxicity-Genetic Risk Indexes (Non-RespGRI and ToxGRI) were created.
RESULTS: MTX nonresponse was associated to eight genotypes and three haplotypes: MTHFR rs1801131 AA and rs1801133 TT; MS rs1805087 AA; MTRR rs1801394 A carriers; ATIC rs2372536 C carriers, rs4673993 T carriers, rs7563206 T carriers and rs12995526 T carriers; CC for GGH rs3758149 and rs12681874; CGTTT for ATIC combination 1; and CTTTC for ATIC combination 2. From overall Non-RespGRI patients with indexes 6-8 had more than sixfold increased risk for MTX nonresponse than those patients with indexes 0-5. MTX-related toxicity was associated to five genotypes and two haplotypes: ATIC rs2372536 G carriers, rs3821353 T carriers, rs7563206 CC and rs12995526 CC; ADORA2A rs2267076 T; CTTCC for ATIC combination 1; and TC for ADORA2A rs2267076 and rs2298383. From overall ToxGRI, patients with indexes 3-4 had more than sevenfold increased risk for MTX-related toxicity than those patients with indexes 1-2.
CONCLUSION: Genotyping may be helpful to identify which RA patients will not benefit from MTX treatment and, consequently, important to personalized medicine in RA. Nevertheless, further studies are required to validate these findings.
PMID: 27676277 [PubMed - as supplied by publisher]
Pharmacogenomic approaches to lipid-regulating trials.
Pharmacogenomic approaches to lipid-regulating trials.
Curr Opin Lipidol. 2016 Sep 26;
Authors: Bertrand MJ, Dubé MP, Tardif JC
Abstract
PURPOSE OF REVIEW: Randomized clinical outcome trials are costly, long, and often yield neutral or modestly positive results, and these issues have impeded cardiovascular drug development in the past decade. Despite the significant reduction of cardiovascular morbidity and mortality with statins, substantial residual risk of major cardiovascular events remains. This could be because of the difficulty of demonstrating benefits of new drugs in addition to the current standard of care in unselected populations as well as the interindividual variability in drug response. Pharmacogenomics is a promising avenue for the development of novel or failed drugs and for the repurposing of other medications.
RECENT FINDINGS: Several variants were identified in genes that were associated with the effects of statins on plasma lipids. Genomic studies of mutations in genes that encode drug targets have the potential to inform on the link between drug therapy acting on those targets and clinical outcomes. Recently, ADCY9 gene variants were shown to be significantly associated with responses to dalcetrapib in terms of clinical outcomes, atherosclerosis imaging, cholesterol efflux, and inflammation, which provided support for the conduct of a new prospective clinical trial in a genetically determined population.
SUMMARY: Pharmacogenomics hold great potential in future lipid trials to decrease failure rates in drug development and to identify patients who will respond with greater benefits and smaller risk.
PMID: 27676198 [PubMed - as supplied by publisher]
Integrating Pharmacovigilance and Pharmacogenomics: Croatian Experience.
Integrating Pharmacovigilance and Pharmacogenomics: Croatian Experience.
Clin Ther. 2016 Oct 6;38(10S):e23
Authors: Bozina N, Mirosevic Skvrce N, Ganoci L, Mas P, Klarica Domjanovic I, Simic I
PMID: 27673639 [PubMed - as supplied by publisher]
Genes differentially expressed by methylprednisolone in vivo in CD4 T lymphocytes from multiple sclerosis patients: potential biomarkers.
Genes differentially expressed by methylprednisolone in vivo in CD4 T lymphocytes from multiple sclerosis patients: potential biomarkers.
Pharmacogenomics J. 2016 Sep 27;
Authors: De Andres C, García MI, Goicoechea H, Martínez-Ginés ML, García-Domínguez JM, Martín ML, Romero-Delgado F, Benguría A, Sanjurjo M, López-Fernández LA
Abstract
Intravenous methylprednisolone (IVMP) is the gold standard treatment in acute relapses of multiple sclerosis. Knowing the response to IVMP in advance could facilitate earlier selection of patients for subsequent courses of therapy. However, molecular mechanisms and changes in gene expression induced by methylprednisolone remain unknown. The aim of the study was to identify in vivo differentially expressed genes in relapsing-remitting multiple sclerosis patients after 3-6 days of treatment with IVMP. For this purpose, whole-genome transcription profiling of CD4+ T lymphocytes was performed before and after treatment with IVMP in 8 relapsing-remitting multiple sclerosis patients during relapse using Human GE 4x44K v2 microarrays. Differentially expressed genes were identified using a paired t test on GeneSpring v13.0 software. A P-value <0.001 and a twofold change were considered significant. Microarray data were confirmed using real-time PCR. Microarray revealed changes in gene expression: four genes were downregulated (B3GNT3, ZNF683, IFNG and TNF) and seven upregulated (DEFA4, CTSG, DEFA8P, AZU1, MPO, ELANE and PRTN3). Pathway analysis revealed the transforming growth factor-β signaling pathway to be affected. Comparison with previously published data on in vitro methylprednisolone-regulated genes showed that SMAD7, TNF and CHI3L1 were also downregulated in vivo in relapsing-remitting multiple sclerosis patients. In summary, we performed the first in vivo transcriptome analysis in CD4+ T lymphocytes before and after the treatment with IVMP in patients with multiple sclerosis. Identification of differentially expressed genes in patients receiving IVMP could improve our understanding of the molecular mechanisms underlying the therapeutic effects of IVMP and highlight potential biomarkers of the response to IVMP.The Pharmacogenomics Journal advance online publication, 27 September 2016; doi:10.1038/tpj.2016.71.
PMID: 27670768 [PubMed - as supplied by publisher]
Genome-wide association study identifies pharmacogenomic loci linked with specific antihypertensive drug treatment and new-onset diabetes.
Genome-wide association study identifies pharmacogenomic loci linked with specific antihypertensive drug treatment and new-onset diabetes.
Pharmacogenomics J. 2016 Sep 27;
Authors: Chang SW, McDonough CW, Gong Y, Johnson TA, Tsunoda T, Gamazon ER, Perera MA, Takahashi A, Tanaka T, Kubo M, Pepine CJ, Johnson JA, Cooper-DeHoff RM
Abstract
We conducted a discovery genome-wide association study with expression quantitative trait loci (eQTL) annotation of new-onset diabetes (NOD) among European Americans, who were exposed to a calcium channel blocker-based strategy (CCB strategy) or a β-blocker-based strategy (β-blocker strategy) in the INternational VErapamil SR Trandolapril STudy. Replication of the top signal from the SNP*treatment interaction analysis was attempted in Hispanic and African Americans, and a joint meta-analysis was performed (total 334 NOD cases and 806 matched controls). PLEKHH2 rs11124945 at 2p21 interacted with antihypertensive exposure for NOD (meta-analysis P=5.3 × 10(-)(8)). rs11124945 G allele carriers had lower odds for NOD when exposed to the β-blocker strategy compared with the CCB strategy (Odds ratio OR=0.38(0.24-0.60), P=4.0 × 10(-)(5)), whereas A/A homozygotes exposed to the β-blocker strategy had increased odds for NOD compared with the CCB strategy (OR=2.02(1.39-2.92), P=2.0 × 10(-)(4)). eQTL annotation of the 2p21 locus provides functional support for regulating gene expression.The Pharmacogenomics Journal advance online publication, 27 September 2016; doi:10.1038/tpj.2016.67.
PMID: 27670767 [PubMed - as supplied by publisher]
IL17RA gene variants and anti-TNF response among psoriasis patients.
IL17RA gene variants and anti-TNF response among psoriasis patients.
Pharmacogenomics J. 2016 Sep 27;
Authors: Batalla A, Coto E, Gómez J, Eirís N, González-Fernández D, Gómez-De Castro C, Daudén E, Llamas-Velasco M, Prieto-Perez R, Abad-Santos F, Carretero G, García FS, Godoy YB, Cardo LF, Alonso B, Iglesias S, Coto-Segura P
Abstract
Polymorphisms at genes encoding proteins involved in the pathogenesis of psoriasis (Psor) or in the mechanism of action of biological drugs could influence the treatment response. Because the interleukin (IL)-17 family has a central role in the pathogenesis of Psor, we hypothesized that IL17RA variants could influence the response to anti-TNF drugs among Psor patients. To address this issue we performed a cross-sectional study of Psor patients who received the biological treatments for the first time, with a follow-up of at least 6 months. All of the patients were Caucasian, older than 18 years old, with chronic plaque Psor, and had completed at least 24 weeks of anti-TNF therapy (adalimumab, etanercept or infliximab). The treatment response to anti-TNF agents was evaluated according to the achievement of PASI50 and PASI75 at weeks 12 and 24. Those who achieved PASI75 at week 24 were considered good responders. All patients were genotyped for the selected single-nucleotide polymorphisms (SNPs) at IL17RA gene. A total of 238 patients were included (57% male, mean age 46 years). One hundred and five patients received adalimumab, 91 patients etanercept and 42 infliximab. The rs4819554 promoter SNP allele A was significantly more common among responders at weeks 12 (P=0.01) and 24 (P=0.04). We found a higher frequency of AA versus AG+GG among responders, but the difference was only significant at week 12 (P=0.03, odd ratio=1.86, 95% confidence of interval=1.05-3.27). Thus, in the study population, the SNP rs4819554 in the promoter region of IL17RA significantly influences the response to anti-TNF drugs at week 12.The Pharmacogenomics Journal advance online publication, 27 September 2016; doi:10.1038/tpj.2016.70.
PMID: 27670766 [PubMed - as supplied by publisher]
New polymorphisms associated with response to anti-TNF drugs in patients with moderate-to-severe plaque psoriasis.
New polymorphisms associated with response to anti-TNF drugs in patients with moderate-to-severe plaque psoriasis.
Pharmacogenomics J. 2016 Sep 27;
Authors: Prieto-Pérez R, Solano-López G, Cabaleiro T, Román M, Ochoa D, Talegón M, Baniandrés O, López-Estebaranz JL, de la Cueva P, Daudén E, Abad-Santos F
Abstract
Anti-tumor necrosis factor (anti-TNF) drugs are effective against psoriasis, although 20-30% of patients are nonresponders. Few pharmacogenomic studies have been performed to predict the response to anti-TNF drugs in psoriasis. We studied 173 polymorphisms to establish an association with the response to anti-TNF drugs in patients with moderate-to-severe plaque psoriasis (N=144). We evaluated the response using PASI75 at 3, 6 and 12 months. The results of the multivariate analysis showed an association between polymorphisms in PGLYR4, ZNF816A, CTNNA2, IL12B, MAP3K1 and HLA-C genes and the response at 3 months. Besides, the results for polymorphisms in IL12B and MAP3K1 were replicated at 6 months. We also obtained significant results for IL12B polymorphism at 1 year. Moreover, polymorphisms in FCGR2A, HTR2A and CDKAL1 were significant at 6 months. This is the first study to show an association with these polymorphisms. However, these biomarkers should be validated in large-scale studies before implementation in clinical practice.The Pharmacogenomics Journal advance online publication, 27 September 2016; doi:10.1038/tpj.2016.64.
PMID: 27670765 [PubMed - as supplied by publisher]
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs.
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs.
J Biomed Semantics. 2016;7(1):58
Authors: Ahltorp M, Skeppstedt M, Kitajima S, Henriksson A, Rzepka R, Araki K
Abstract
BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs.
METHODS: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies.
RESULTS: Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding.
CONCLUSIONS: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.
PMID: 27671202 [PubMed - as supplied by publisher]
Pages
