Deep learning
Deep learning based heat transfer simulation of the casting process
Sci Rep. 2024 Nov 23;14(1):29068. doi: 10.1038/s41598-024-80515-x.
ABSTRACT
To avoid the necessity of constitutional models, computational intensity, and the time-consuming nature inherent in numerical simulations, a pioneering approach utilizing deep learning techniques has been adopted to swiftly predict temperature fields during the solidification phase of casting processes. This methodology involves the development of rapid prediction models based on modified U-net network architectures, augmented by the integration of Inception and CBAM (Convolutional Block Attention Module) modules. The construction of the training set involved utilizing 200 diverse geometric models with each containing three kinds of components (casting, mold, and chill), where the temperature fields at a specific time, ti, were input data, while that of the subsequent time point, ti+1, served as the corresponding labels. The geometric models were generated by the erosion of 2D arbitrary shapes through an erosion method, and then their associated temperature fields were obtained via FDM-based numerical simulation. The trained deep learning models exhibit proficiency in promptly forecasting temperature fields during the solidification process for arbitrarily shaped castings at different times. The average accuracy of the predicted outcomes reaches 94.5% as the absolute temperature error set as 7 ℃ and the prediction just takes one second for a time step. Notably, these models are adept at handling multi-component with multi-materials within a geometry model, such as casting, chill, and mold corresponding to the intricate casting process.
PMID:39580492 | DOI:10.1038/s41598-024-80515-x
Integration of the bulk transcriptome and single-cell transcriptome reveals efferocytosis features in lung adenocarcinoma prognosis and immunotherapy by combining deep learning
Cancer Cell Int. 2024 Nov 23;24(1):388. doi: 10.1186/s12935-024-03571-3.
ABSTRACT
BACKGROUND: Efferocytosis (ER) refers to the process of phagocytic clearance of programmed dead cells, and studies have shown that it is closely related to tumor immune escape.
METHODS: This study was based on a comprehensive analysis of TCGA, GEO and CTRP databases. ER-related genes were collected from previous literature, univariate Cox regression was performed and consistent clustering was performed to categorize lung adenocarcinoma (LUAD) patients into two subgroups. Lasso regression and multivariate Cox regression analyses were used to construct ER-related prognostic features, and multiple immune infiltration algorithms were used to assess the correlation between the extracellular burial-related risk score (ERGRS) and tumor microenvironment (TME). And the key gene HAVCR1 was identified by deep learning, etc. Finally, pan-cancer analysis of the key genes was performed and in vitro experiments were conducted to verify the promotional effect of HAVCR1 on LUAD progression.
RESULTS: A total of 33 ER-related genes associated with the prognosis of LUAD were identified, and the prognostic signature of ERGRS was successfully constructed to predict the overall survival (OS) and treatment response of LUAD patients. The high-risk group was highly enriched in some oncogenic pathways, while the low-ERGRS group was highly enriched in some immune-related pathways. In addition, the high ERGRS group had higher TMB, TNB and TIDE scores and lower immune scores. The low-risk group had better immunotherapeutic response and less likelihood of immune escape. Drug sensitivity analysis revealed that BRD-K92856060, monensin and hexaminolevulinate may be potential therapeutic agents for the high-risk group. And ERGRS was validated in several cohorts. In addition, HAVCR1 is one of the key genes, and knockdown of HAVCR1 in vitro significantly reduced the proliferation, migration and invasion ability of lung adenocarcinoma cells.
CONCLUSION: Our study developed a novel prognostic signature of efferocytosis-related genes. This prognostic signature accurately predicted survival prognosis as well as treatment outcome in LUAD patients and explored the role of HAVCR1 in lung adenocarcinoma progression.
PMID:39580462 | DOI:10.1186/s12935-024-03571-3
Supervised multiple kernel learning approaches for multi-omics data integration
BioData Min. 2024 Nov 23;17(1):53. doi: 10.1186/s13040-024-00406-9.
ABSTRACT
BACKGROUND: Advances in high-throughput technologies have originated an ever-increasing availability of omics datasets. The integration of multiple heterogeneous data sources is currently an issue for biology and bioinformatics. Multiple kernel learning (MKL) has shown to be a flexible and valid approach to consider the diverse nature of multi-omics inputs, despite being an underused tool in genomic data mining.
RESULTS: We provide novel MKL approaches based on different kernel fusion strategies. To learn from the meta-kernel of input kernels, we adapted unsupervised integration algorithms for supervised tasks with support vector machines. We also tested deep learning architectures for kernel fusion and classification. The results show that MKL-based models can outperform more complex, state-of-the-art, supervised multi-omics integrative approaches.
CONCLUSION: Multiple kernel learning offers a natural framework for predictive models in multi-omics data. It proved to provide a fast and reliable solution that can compete with and outperform more complex architectures. Our results offer a direction for bio-data mining research, biomarker discovery and further development of methods for heterogeneous data integration.
PMID:39580456 | DOI:10.1186/s13040-024-00406-9
AI-powered detection and quantification of post-harvest physiological deterioration (PPD) in cassava using YOLO foundation models and K-means clustering
Plant Methods. 2024 Nov 23;20(1):178. doi: 10.1186/s13007-024-01309-w.
ABSTRACT
BACKGROUND: Post-harvest physiological deterioration (PPD) poses a significant challenge to the cassava industry, leading to substantial economic losses. This study aims to address this issue by developing a comprehensive framework in collaboration with cassava breeders.
RESULTS: Advanced deep learning (DL) techniques such as Segment Anything Model (SAM) and YOLO foundation models (YOLOv7, YOLOv8, YOLOv9, and YOLO-NAS), were used to accurately categorize PPD severity from RGB images captured by cameras or cell phones. YOLOv8 achieved the highest overall mean Average Precision (mAP) of 80.4%, demonstrating superior performance in detecting and classifying different PPD levels across all three models. Although YOLO-NAS had some instability during training, it demonstrated stronger performance in detecting the PPD_0 class, with a mAP of 91.3%. YOLOv7 exhibited the lowest performance across all classes, with an overall mAP of 75.5%. Despite challenges with similar color intensities in the image data, the combination of SAM, image processing techniques such as RGB color filtering, and machine learning (ML) algorithms was effective in removing yellow and gray color sections, significantly reducing the Mean Absolute Error (MAE) in PPD estimation from 20.01 to 15.50. Moreover, Artificial Intelligence (AI)-based algorithms allow for efficient analysis of large datasets, enabling rapid screening of cassava roots for PPD symptoms. This approach is much faster and more streamlined compared to the labor-intensive and time-consuming manual visual scoring methods.
CONCLUSION: These results highlight the significant advancements in PPD detection and quantification in cassava samples using cutting-edge AI techniques. The integration of YOLO foundation models, alongside SAM and image processing methods, has demonstrated promising precision even in scenarios where experts struggle to differentiate closely related classes. This AI-powered model not only effectively streamlines the PPD assessment in the pre-breeding pipeline but also enhances the overall effectiveness of cassava breeding programs, facilitating the selection of PPD-resistant varieties through controlled screening. By improving the precision of PPD assessments, this research contributes to the broader goal of enhancing cassava productivity, quality, and resilience, ultimately supporting global food security efforts.
PMID:39580444 | DOI:10.1186/s13007-024-01309-w
Screening for severe coronary stenosis in patients with apparently normal electrocardiograms based on deep learning
BMC Med Inform Decis Mak. 2024 Nov 22;24(1):355. doi: 10.1186/s12911-024-02764-0.
ABSTRACT
BACKGROUND: Patients with severe coronary arterystenosis may present with apparently normal electrocardiograms (ECGs), making it difficult to detect adverse health conditions during routine screenings or physical examinations. Consequently, these patients might miss the optimal window for treatment.
METHODS: We aimed to develop an effective model to distinguish severe coronary stenosis from no or mild coronary stenosis in patients with apparently normal ECGs. A total of 392 patients, including 138 with severe stenosis, were selected for the study. Deep learning (DL) models were trained from scratch and using pre-trained parameters via transfer learning. These models were evaluated based on ECG data alone and in combination with clinical information, including age, sex, hypertension, diabetes, dyslipidemia and smoking status.
RESULTS: We found that DL models trained from scratch using ECG data alone achieved a specificity of 74.6% but exhibited low sensitivity (54.5%), comparable to the performance of logistic regression using clinical data. Adding clinical information to the ECG DL model trained from scratch improved sensitivity (90.9%) but reduced specificity (42.3%). The best performance was achieved by combining clinical information with the ECG transfer learning model, resulting in an area under the receiver operating characteristic curve (AUC) of 0.847, with 84.8% sensitivity and 70.4% specificity.
CONCLUSIONS: The findings demonstrate the effectiveness of DL models in identifying severe coronary stenosis in patients with apparently normal ECGs and validate an efficient approach utilizing existing ECG models. By employing transfer learning techniques, we can extract "deep features" that summarize the inherent information of ECGs with relatively low computational expense.
PMID:39578851 | DOI:10.1186/s12911-024-02764-0
Multimodal machine learning for language and speech markers identification in mental health
BMC Med Inform Decis Mak. 2024 Nov 22;24(1):354. doi: 10.1186/s12911-024-02772-0.
ABSTRACT
BACKGROUND: There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.
METHODS: For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.
RESULTS: Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.
CONCLUSIONS: In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.
PMID:39578814 | DOI:10.1186/s12911-024-02772-0
miRStart 2.0: enhancing miRNA regulatory insights through deep learning-based TSS identification
Nucleic Acids Res. 2024 Nov 23:gkae1086. doi: 10.1093/nar/gkae1086. Online ahead of print.
ABSTRACT
MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression by binding to the 3'-untranslated regions of target mRNAs, influencing various biological processes at the post-transcriptional level. Identifying miRNA transcription start sites (TSSs) and transcription factors' (TFs) regulatory roles is crucial for elucidating miRNA function and transcriptional regulation. miRStart 2.0 integrates over 4500 high-throughput datasets across five data types, utilizing a multi-modal approach to annotate 28 828 putative TSSs for 1745 human and 1181 mouse miRNAs, supported by sequencing-based signals. Over 6 million tissue-specific TF-miRNA interactions, integrated from ChIP-seq data, are supplemented by DNase hypersensitivity and UCSC conservation data, with network visualizations. Our deep learning-based model outperforms existing tools in miRNA TSS prediction, achieving the most overlaps with both cell-specific and non-cell-specific validated TSSs. The user-friendly web interface and visualization tools make miRStart 2.0 easily accessible to researchers, enabling efficient identification of miRNA upstream regulatory elements in relation to their TSSs. This updated database provides systems-level insights into gene regulation and disease mechanisms, offering a valuable resource for translational research, facilitating the discovery of novel therapeutic targets and precision medicine strategies. miRStart 2.0 is now accessible at https://awi.cuhk.edu.cn/∼miRStart2.
PMID:39578697 | DOI:10.1093/nar/gkae1086
Human essential gene identification based on feature fusion and feature screening
IET Syst Biol. 2024 Nov 22. doi: 10.1049/syb2.12105. Online ahead of print.
ABSTRACT
Essential genes are necessary to sustain the life of a species under adequate nutritional conditions. These genes have attracted significant attention for their potential as drug targets, especially in developing broad-spectrum antibacterial drugs. However, studying essential genes remains challenging due to their variability in specific environmental conditions. In this study, the authors aim to develop a powerful prediction model for identifying essential genes in humans. The authors first obtained the essential gene data from human cancer cell lines and characterised gene sequences using 7 feature encoding methods such as Kmer, the Composition of K-spaced Nucleic Acid Pairs, and Z-curve. Subsequently, feature fusion and feature optimisation strategies were employed to select the impactful features. Finally, machine learning algorithms were applied to construct the prediction models and evaluate their performance. The single-feature-based model achieved the highest area under the Receiver Operating Characteristic curve (AUC) of 0.830. After fusing and filtering these features, the classical machine learning models achieved the highest AUC at 0.823 while the deep learning model reached 0.860. Results obtained by the authors show that compared to using individual features, feature fusion and feature optimisation strategies significantly improved model performance. Moreover, the study provided an advantageous method for essential gene identification compared to other methods.
PMID:39578676 | DOI:10.1049/syb2.12105
Crop classification in the middle reaches of the Hei River based on model transfer
Sci Rep. 2024 Nov 22;14(1):28964. doi: 10.1038/s41598-024-80327-z.
ABSTRACT
Crop classification using remote sensing technology is highly important for monitoring agricultural resources and managing water usage, especially in water-scarce regions like the Hei River. Crop classification requires a substantial number of labeled samples, but the collection of labeled samples demands significant resources and sample data may not be available for some years. To classify crops in sample-free years in the middle reaches of the Hei River, we generated multisource spectral data (MSSD) based on a spectral library and sample data. We pre-trained a model using labeled samples, followed by fine-tuning the model with MSSD to complete the crop classification for the years without samples. We conduct experiments using three CNN-based deep learning models and a machine learning model (RF). The experimental results indicate that in the model transfer experiments, using a fine-tuned model yields accurate classification results, with overall accuracy exceeding 90%. When the amount of labeled sample data is limited, fine-tuning the model based on MSSD can enhance the accuracy of crop classification. Overall, fine-tuning models based on MSSD can significantly enhance the accuracy of model transfer and reduce the reliance of deep learning models on large-scale sample datasets. The method to classify crops in the middle reaches of the Hei River can provide data support for local resource utilization and policy formulation.
PMID:39578651 | DOI:10.1038/s41598-024-80327-z
Heterogeneous virus classification using a functional deep learning model based on transmission electron microscopy images
Sci Rep. 2024 Nov 22;14(1):28954. doi: 10.1038/s41598-024-80013-0.
ABSTRACT
Viruses are submicroscopic agents that can infect other lifeforms and use their hosts' cells to replicate themselves. Despite having simplistic genetic structures among all living beings, viruses are highly adaptable, resilient, and capable of causing severe complications in their hosts' bodies. Due to their multiple transmission pathways, high contagion rate, and lethality, viruses pose the biggest biological threat both animal and plant species face. It is often challenging to promptly detect a virus in a host and accurately determine its type using manual examination techniques. However, computer-based automatic diagnosis methods, especially the ones using Transmission Electron Microscopy (TEM) images, have proven effective in instant virus identification. Using TEM images collected from a recent dataset, this article proposes a deep learning-based classification model to identify the virus type within those images. The methodology of this study includes two coherent image processing techniques to reduce the noise present in raw microscopy images and a functional Convolutional Neural Network (CNN) model for classification. Experimental results show that it can differentiate among 14 types of viruses with a maximum of 97.44% classification accuracy and F1-score, which asserts the effectiveness and reliability of the proposed method. Implementing this scheme will impart a fast and dependable virus identification scheme subsidiary to the thorough diagnostic procedures.
PMID:39578636 | DOI:10.1038/s41598-024-80013-0
CelloType: a unified model for segmentation and classification of tissue images
Nat Methods. 2024 Nov 22. doi: 10.1038/s41592-024-02513-1. Online ahead of print.
ABSTRACT
Cell segmentation and classification are critical tasks in spatial omics data analysis. Here we introduce CelloType, an end-to-end model designed for cell segmentation and classification for image-based spatial omics data. Unlike the traditional two-stage approach of segmentation followed by classification, CelloType adopts a multitask learning strategy that integrates these tasks, simultaneously enhancing the performance of both. CelloType leverages transformer-based deep learning techniques for improved accuracy in object detection, segmentation and classification. It outperforms existing segmentation methods on a variety of multiplexed fluorescence and spatial transcriptomic images. In terms of cell type classification, CelloType surpasses a model composed of state-of-the-art methods for individual tasks and a high-performance instance segmentation model. Using multiplexed tissue images, we further demonstrate the utility of CelloType for multiscale segmentation and classification of both cellular and noncellular elements in a tissue. The enhanced accuracy and multitask learning ability of CelloType facilitate automated annotation of rapidly growing spatial omics data.
PMID:39578628 | DOI:10.1038/s41592-024-02513-1
Attention-based multi-residual network for lung segmentation in diseased lungs with custom data augmentation
Sci Rep. 2024 Nov 22;14(1):28983. doi: 10.1038/s41598-024-79494-w.
ABSTRACT
Lung disease analysis in chest X-rays (CXR) using deep learning presents significant challenges due to the wide variation in lung appearance caused by disease progression and differing X-ray settings. While deep learning models have shown remarkable success in segmenting lungs from CXR images with normal or mildly abnormal findings, their performance declines when faced with complex structures, such as pulmonary opacifications. In this study, we propose AMRU++, an attention-based multi-residual UNet++ network designed for robust and accurate lung segmentation in CXR images with both normal and severe abnormalities. The model incorporates attention modules to capture relevant spatial information and multi-residual blocks to extract rich contextual and discriminative features of lung regions. To further enhance segmentation performance, we introduce a data augmentation technique that simulates the features and characteristics of CXR pathologies, addressing the issue of limited annotated data. Extensive experiments on public and private datasets comprising 350 cases of pneumoconiosis, COVID-19, and tuberculosis validate the effectiveness of our proposed framework and data augmentation technique.
PMID:39578613 | DOI:10.1038/s41598-024-79494-w
HDBind: encoding of molecular structure with hyperdimensional binary representations
Sci Rep. 2024 Nov 23;14(1):29025. doi: 10.1038/s41598-024-80009-w.
ABSTRACT
Traditional methods for identifying "hit" molecules from a large collection of potential drug-like candidates rely on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug and its protein target. These approaches have a significant limitation in that they require exceptional computing capabilities for even relatively small collections of molecules. Increasingly large and complex state-of-the-art deep learning approaches have gained popularity with the promise to improve the productivity of drug design, notorious for its numerous failures. However, as deep learning models increase in their size and complexity, their acceleration at the hardware level becomes more challenging. Hyperdimensional Computing (HDC) has recently gained attention in the computer hardware community due to its algorithmic simplicity relative to deep learning approaches. The HDC learning paradigm, which represents data with high-dimension binary vectors, allows the use of low-precision binary vector arithmetic to create models of the data that can be learned without the need for the gradient-based optimization required in many conventional machine learning and deep learning methods. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated in a range of application areas (computer vision, bioinformatics, mass spectrometery, remote sensing, edge devices, etc.). To the best of our knowledge, our work is the first to consider HDC for the task of fast and efficient screening of modern drug-like compound libraries. We also propose the first HDC graph-based encoding methods for molecular data, demonstrating consistent and substantial improvement over previous work. We compare our approaches to alternative approaches on the well-studied MoleculeNet dataset and the recently proposed LIT-PCBA dataset derived from high quality PubChem assays. We demonstrate our methods on multiple target hardware platforms, including Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), showing at least an order of magnitude improvement in energy efficiency versus even our smallest neural network baseline model with a single hidden layer. Our work thus motivates further investigation into molecular representation learning to develop ultra-efficient pre-screening tools. We make our code publicly available at https://github.com/LLNL/hdbind .
PMID:39578580 | DOI:10.1038/s41598-024-80009-w
The development of an attention mechanism enhanced deep learning model and its application for body composition assessment with L3 CT images
Sci Rep. 2024 Nov 22;14(1):28953. doi: 10.1038/s41598-024-79915-w.
ABSTRACT
Body composition assessment is very useful for evaluating a patient's status in the clinic, but recognizing, labeling, and calculating the body compositions would be burdensome. This study aims to develop a web-based service that could automate calculating the areas of skeleton muscle (SM), visceral adipose tissue (VAT), and subcutaneous adipose tissue (SAT) according to L3 computed tomography (CT) images. 1500 L3 CT images were gathered from Xuzhou Central Hospital. Of these, 70% were used as the training dataset, while the remaining 30% were used as the validating dataset. The UNet framework was combined with attention gate (AG), Squeeze and Excitation block (SEblock), and Atrous Spatial Pyramid Pooling (ASSP) modules to construct the segmentation deep learning model. The model's efficacy was externally validated using two other test datasets with multiple metrics, the consistency test and manual result checking. A graphic user interface was also created and deployed using the Streamlit Python package. The custom deep learning model named L3 Body Composition Segmentation Model (L3BCSM) was constructed. The model's Median Dice is 0.954(0.930, 0.963)(SATA), 0.849(0.774,0.901)(VATA), and 0.920(0.901, 0.936)(SMA), which is equal to or better than classic models, including UNETR and AHNet. L3BCSM also achieved satisfactory metrics in two external test datasets, consistent with the qualified label. An internet-based application was developed using L3BCSM, which has four functional modules: population analysis, time series analysis, consistency analysis, and manual result checking. The body composition assessment application was well developed, which would benefit the clinical practice and related research.
PMID:39578556 | DOI:10.1038/s41598-024-79915-w
A multi-perspective deep learning framework for enhancer characterization and identification
Comput Biol Chem. 2024 Nov 19;114:108284. doi: 10.1016/j.compbiolchem.2024.108284. Online ahead of print.
ABSTRACT
Enhancers are vital elements in the genome that boost the transcriptional activity of neighboring genes and are essential in regulating cell-specific gene expression. Therefore, accurately identifying and characterizing enhancers is essential for comprehending gene regulatory networks and the development of related diseases. This study introduces MPDL-Enhancer, a novel multi-perspective deep learning framework aimed at enhancer characterization and identification. In this study, enhancer sequences are encoded using the dna2vec model along with features derived from the structural properties of DNA sequences. Subsequently, these representations are processed through a novel dual-scale deep neural network designed to discern subtle correlations and extended interactions embedded within the semantic content of DNA. The predictive phase of our methodology employs a Support Vector Machine classifier to render the final classification. To rigorously assess the efficacy of our approach, a comprehensive evaluation was executed utilizing an independent test dataset, thereby substantiating the robustness and accuracy of our model. Our methodology demonstrated superior performance over existing computational techniques, with an accuracy (ACC) of 81.00 %, a sensitivity (SN) of 79.00 %, and specificity (SP) of 83.00 %. The innovative dual-scale deep neural network and the unique feature representation strategy contributed to this performance improvement. MPDL-Enhancer has effectively characterized enhancer sequences and achieved excellent predictive performance. Building upon this foundation, we conducted an interpretability analysis of the model, which can assist researchers in identifying key features and patterns that affect the functionality of enhancers, thereby promoting a deeper understanding of gene regulatory networks.
PMID:39577030 | DOI:10.1016/j.compbiolchem.2024.108284
Improved Prediction of Ligand-Protein Binding Affinities by Meta-modeling
J Chem Inf Model. 2024 Nov 22. doi: 10.1021/acs.jcim.4c01116. Online ahead of print.
ABSTRACT
The accurate screening of candidate drug ligands against target proteins through computational approaches is of prime interest to drug development efforts. Such virtual screening depends in part on methods to predict the binding affinity between ligands and proteins. Many computational models for binding affinity prediction have been developed, but with varying results across targets. Given that ensembling or meta-modeling approaches have shown great promise in reducing model-specific biases, we develop a framework to integrate published force-field-based empirical docking and sequence-based deep learning models. In building this framework, we evaluate many combinations of individual base models, training databases, and several meta-modeling approaches. We show that many of our meta-models significantly improve affinity predictions over base models. Our best meta-models achieve comparable performance to state-of-the-art deep learning tools exclusively based on 3D structures while allowing for improved database scalability and flexibility through the explicit inclusion of features such as physicochemical properties or molecular descriptors. We further demonstrate improved generalization capability by our models using a large-scale benchmark of affinity prediction as well as a virtual screening application benchmark. Overall, we demonstrate that diverse modeling approaches can be ensembled together to gain meaningful improvement in binding affinity prediction.
PMID:39576762 | DOI:10.1021/acs.jcim.4c01116
Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction
Bioinformatics. 2024 Nov 22:btae621. doi: 10.1093/bioinformatics/btae621. Online ahead of print.
ABSTRACT
MOTIVATION: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.
RESULTS: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).
AVAILABILITY: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
PMID:39576695 | DOI:10.1093/bioinformatics/btae621
Computed tomography-based radiomics and body composition model for predicting hepatic decompensation
Oncotarget. 2024 Nov 22;15:809-813. doi: 10.18632/oncotarget.28673.
ABSTRACT
Primary sclerosing cholangitis (PSC) is a chronic liver disease characterized by inflammation and scarring of the bile ducts, which can lead to cirrhosis and hepatic decompensation. The study aimed to explore the potential value of computational radiomics, a field that extracts quantitative features from medical images, in predicting whether or not PSC patients had hepatic decompensation. We used an in-house developed deep learning model called the body composition model, which quantifies body composition from computed tomography (CT) into four compartments: subcutaneous adipose tissue (SAT), skeletal muscle (SKM), visceral adipose tissue (VAT), and intermuscular adipose tissue (IMAT). We extracted radiomics features from all four body composition compartments and used them to build a predictive model in the training cohort. The predictive model demonstrated good performance in validation cohorts for predicting hepatic decompensation, with an accuracy score of 0.97, a precision score of 1.0, and an area under the curve (AUC) score of 0.97. Computational radiomics using CT images shows promise in predicting hepatic decompensation in primary sclerosing cholangitis patients. Our model achieved high accuracy, but predicting future events remains challenging. Further research is needed to validate clinical utility and limitations.
PMID:39576671 | DOI:10.18632/oncotarget.28673
MCNN-AAPT: accurate classification and functional prediction of amino acid and peptide transporters in secondary active transporters using protein language models and multi-window deep learning
J Biomol Struct Dyn. 2024 Nov 22:1-10. doi: 10.1080/07391102.2024.2431664. Online ahead of print.
ABSTRACT
Secondary active transporters play a crucial role in cellular physiology by facilitating the movement of molecules across cell membranes. Identifying the functional classes of these transporters, particularly amino acid and peptide transporters, is essential for understanding their involvement in various physiological processes and disease pathways, including cancer. This study aims to develop a robust computational framework that integrates pre-trained protein language models and deep learning techniques to classify amino acid and peptide transporters within the secondary active transporter (SAT) family and predict their functional association with solute carrier (SLC) proteins. The study leverages a comprehensive dataset of 448 secondary active transporters, including 36 solute carrier proteins, obtained from UniProt and the Transporter Classification Database (TCDB). Three state-of-the-art protein language models, ProtTrans, ESM-1b, and ESM-2, are evaluated within a deep learning neural network architecture that employs a multi-window scanning technique to capture local and global sequence patterns. The ProtTrans-based feature set demonstrates exceptional performance, achieving a classification accuracy of 98.21% with 87.32% sensitivity and 99.76% specificity for distinguishing amino acid and peptide transporters from other SATs. Furthermore, the model maintains strong predictive ability for SLC proteins, with an overall accuracy of 88.89% and a Matthews Correlation Coefficient (MCC) of 0.7750. This study showcases the power of integrating pre-trained protein language models and deep learning techniques for the functional classification of secondary active transporters and the prediction of associated solute carrier proteins. The findings have significant implications for drug development, disease research, and the broader understanding of cellular transport mechanisms.
PMID:39576667 | DOI:10.1080/07391102.2024.2431664
Predicting mortality in hospitalized influenza patients: integration of deep learning-based chest X-ray severity score (FluDeep-XR) and clinical variables
J Am Med Inform Assoc. 2024 Nov 22:ocae286. doi: 10.1093/jamia/ocae286. Online ahead of print.
ABSTRACT
OBJECTIVES: To pioneer the first artificial intelligence system integrating radiological and objective clinical data, simulating the clinical reasoning process, for the early prediction of high-risk influenza patients.
MATERIALS AND METHODS: Our system was developed using a cohort from National Taiwan University Hospital in Taiwan, with external validation data from ASST Grande Ospedale Metropolitano Niguarda in Italy. Convolutional neural networks pretrained on ImageNet were regressively trained using a 5-point scale to develop the influenza chest X-ray (CXR) severity scoring model, FluDeep-XR. Early, late, and joint fusion structures, incorporating varying weights of CXR severity with clinical data, were designed to predict 30-day mortality and compared with models using only CXR or clinical data. The best-performing model was designated as FluDeep. The explainability of FluDeep-XR and FluDeep was illustrated through activation maps and SHapley Additive exPlanations (SHAP).
RESULTS: The Xception-based model, FluDeep-XR, achieved a mean square error of 0.738 in the external validation dataset. The Random Forest-based late fusion model, FluDeep, outperformed all the other models, achieving an area under the receiver operating curve of 0.818 and a sensitivity of 0.706 in the external dataset. Activation maps highlighted clear lung fields. Shapley additive explanations identified age, C-reactive protein, hematocrit, heart rate, and respiratory rate as the top 5 important clinical features.
DISCUSSION: The integration of medical imaging with objective clinical data outperformed single-modality models to predict 30-day mortality in influenza patients. We ensured the explainability of our models aligned with clinical knowledge and validated its applicability across foreign institutions.
CONCLUSION: FluDeep highlights the potential of combining radiological and clinical information in late fusion design, enhancing diagnostic accuracy and offering an explainable, and generalizable decision support system.
PMID:39576664 | DOI:10.1093/jamia/ocae286