Deep learning
Recovering time-varying networks from single-cell data
Bioinformatics. 2025 Jul 1;41(Supplement_1):i628-i636. doi: 10.1093/bioinformatics/btaf210.
ABSTRACT
MOTIVATION: Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of these data for reconstructing such networks.
RESULTS: Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments.
AVAILABILITY AND IMPLEMENTATION: The code used to train Marlene is available at https://github.com/euxhenh/Marlene.
PMID:40662830 | DOI:10.1093/bioinformatics/btaf210
Harnessing deep learning for proteome-scale detection of amyloid signaling motifs
Bioinformatics. 2025 Jul 1;41(Supplement_1):i420-i428. doi: 10.1093/bioinformatics/btaf200.
ABSTRACT
MOTIVATION: Amyloid signaling sequences adopt the cross-β fold that is capable of self-replication in the templating process. Propagation of the amyloid fold from the receptor to the effector protein is used for signal transduction in the immune response pathways in animals, fungi, and bacteria. So far, a dozen of families of amyloid signaling motifs (ASMs) have been classified. Unfortunately, due to the wide variety of ASMs it is difficult to identify them in large protein databases available, which limits the possibility of conducting experimental studies. To date, various deep learning (DL) models have been applied across a range of protein-related tasks, including domain family classification and the prediction of protein structure and protein-protein interactions.
RESULTS: In this study, we develop tailor-made bidirectional LSTM and BERT-based architectures to model ASM, and compare their performance against a state-of-the-art machine learning grammatical model. Our research is focused on developing a discriminative model of generalized ASMs, capable of detecting ASMs in large datasets. The DL-based models are trained on a diverse set of motif families and a global negative set, and used to identify ASMs from remotely related families. We analyze how both models represent the data and demonstrate that the DL-based approaches effectively detect ASMs, including novel motifs, even at the genome scale.
AVAILABILITY AND IMPLEMENTATION: The models are provided as a Python package, asmscan-bilstm, and a Docker image at https://github.com/chrispysz/asmscan-proteinbert-run. The source code can be accessed at https://github.com/jakub-galazka/asmscan-bilstm and https://github.com/chrispysz/asmscan-proteinbert. Data and results are at https://github.com/wdyrka-pwr/ASMscan.
PMID:40662825 | DOI:10.1093/bioinformatics/btaf200
DivPro: diverse protein sequence design with direct structure recovery guidance
Bioinformatics. 2025 Jul 1;41(Supplement_1):i382-i390. doi: 10.1093/bioinformatics/btaf258.
ABSTRACT
MOTIVATION: Structure-based protein design is crucial for designing proteins with novel structures and functions, which aims to generate sequences that fold into desired structures. Current deep learning-based methods primarily focus on training and evaluating models using sequence recovery-based metrics. However, this approach overlooks the inherent ambiguity in the relationship between protein sequences and structures. Relying solely on sequence recovery as a training objective limits the models' ability to produce diverse sequences that maintain similar structures. These limitations become more pronounced when dealing with remote homologous proteins, which share functional and structural similarities despite low-sequence identity.
RESULTS: Here, we present DivPro, a model that learns to design diverse sequences that can fold into similar structures. To improve sequence diversity, instead of learning a single fixed sequence representation for an input structure as in existing methods, DivPro learns a probabilistic sequence space from which diverse sequences could be sampled. We leverage the recent advancements in in silico protein structure prediction. By incorporating structure prediction results as training guidance, DivPro ensures that sequences sampled from this learned space reliably fold into the target structure. We conducted extensive experiments on three sequence design benchmarks and evaluated the structures of designed sequences using structure prediction models including AlphaFold2. Results show that DivPro can maintain high structure recovery while significantly improving the sequence diversity.
AVAILABILITY AND IMPLEMENTATION: The source code and datasets are available at https://github.com/veghen/DivPro.
PMID:40662823 | DOI:10.1093/bioinformatics/btaf258
Accurate PROTAC-targeted degradation prediction with DegradeMaster
Bioinformatics. 2025 Jul 1;41(Supplement_1):i342-i351. doi: 10.1093/bioinformatics/btaf191.
ABSTRACT
MOTIVATION: Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade "undruggable" protein of interest by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, it is critical to develop more accurate computational methods for PROTAC-targeted protein degradation prediction.
RESULTS: This study proposes DegradeMaster, a semisupervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilizes a memory-based pseudolabeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semisupervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, with substantial improvement of AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognizes linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation.
AVAILABILITY AND IMPLEMENTATION: The source code and datasets are available at https://github.com/ABILiLab/DegradeMaster and https://zenodo.org/records/14715718.
PMID:40662822 | DOI:10.1093/bioinformatics/btaf191
ADME-drug-likeness: enriching molecular foundation models via pharmacokinetics-guided multi-task learning for drug-likeness prediction
Bioinformatics. 2025 Jul 1;41(Supplement_1):i352-i361. doi: 10.1093/bioinformatics/btaf259.
ABSTRACT
SUMMARY: Recent breakthroughs in AI-driven generative models enable the rapid design of extensive molecular libraries, creating an urgent need for fast and accurate drug-likeness evaluation. Traditional approaches, however, rely heavily on structural descriptors and overlook pharmacokinetic (PK) factors such as absorption, distribution, metabolism, and excretion (ADME). Furthermore, existing deep-learning models neglect the complex interdependencies among ADME tasks, which play a pivotal role in determining clinical viability. We introduce ADME-DL (drug likeness), a novel two-step pipeline that first enhances diverse range of molecular foundation models (MFMs) via sequential ADME multi-task learning. By enforcing an A→D→M→E flow-grounded in a data-driven task dependency analysis that aligns with established PK principles-our method more accurately encodes PK information into the learned embedding space. In Step 2, the resulting ADME-informed embeddings are leveraged for drug-likeness classification, distinguishing approved drugs from negative sets drawn from chemical libraries. Through comprehensive experiments, our sequential ADME multi-task learning achieves up to +2.4% improvement over state-of-the-art baselines, and enhancing performance across tested MFMs by up to +18.2%. Case studies with clinically annotated drugs validate that respecting the PK hierarchy produces more relevant predictions, reflecting drug discovery phases. These findings underscore the potential of ADME-DL to significantly enhance the early-stage filtering of candidate molecules, bridging the gap between purely structural screening methods and PK-aware modeling.
AVAILABILITY AND IMPLEMENTATION: The source code for ADME-DL is available at https://github.com/eugenebang/ADME-DL.
PMID:40662819 | DOI:10.1093/bioinformatics/btaf259
Soffritto: a deep learning model for predicting high-resolution replication timing
Bioinformatics. 2025 Jul 1;41(Supplement_1):i580-i589. doi: 10.1093/bioinformatics/btaf231.
ABSTRACT
MOTIVATION: Replication timing (RT) refers to the order in which DNA loci are replicated during S phase. RT is cell-type specific and implicated in cellular processes including transcription, differentiation, and disease. RT is typically quantified genome-wide using two-fraction assays (e.g. Repli-Seq) which sort cells into early and late S phase fractions followed by DNA sequencing, yielding a ratio as the RT signal. While two-fraction RT data are widely available in multiple cell lines, it is limited in its ability to capture high-resolution RT features. To address this, high-resolution Repli-Seq, which quantifies RT across 16 fractions, was developed, but it is costly and technically challenging with very limited data generated to date.
RESULTS: Here, we developed Soffritto, a deep learning model that predicts high-resolution RT data using two-fraction RT data, histone ChIP-seq data, GC content, and gene density as input. Soffritto is composed of a Long Short-Term Memory (LSTM) module and a prediction module. The LSTM module learns long- and short-range interactions between genomic bins, while the prediction module is composed of a fully connected layer that outputs a 16-fraction probability vector for each bin using the LSTM module's embeddings as input. By performing both within cell line and cross-cell line training and testing for five human and mouse cell lines, we show that Soffritto is able to capture experimental 16-fraction RT signals with high accuracy, and the predicted signals allow detection of high-resolution RT patterns.
AVAILABILITY AND IMPLEMENTATION: Soffritto is available at https://github.com/ay-lab/Soffritto.
PMID:40662815 | DOI:10.1093/bioinformatics/btaf231
Generation of synthetic tomographic images from biplanar X-ray: a narrative review of history, methods, and the state of the art
J Neurosurg Sci. 2025 Aug;69(4):350-361. doi: 10.23736/S0390-5616.25.06506-3.
ABSTRACT
This narrative review presents deep learning-based strategies for generating synthetic 3D CT-like images from biplanar or multiplanar 2D X-ray data. Current limitations of conventional CT imaging are discussed, hence emphasizing the potential of synthetic CT reconstruction as an alternative technique in certain scenarios. Previous non deep learning approaches for 3D reconstruction from 2D X-rays are presented, indicating their weaknesses and thus pointing out the potential benefits of deep learning techniques. Convolutional neural networks (CNNs), generative adversarial networks (GANs), and conditional diffusion processing (CDP) are introduced, as they demonstrate great potential for synthetic CT generation in multiple studies over the last few years. The review further presents the potential clinical applications, existing challenges and latest research advancements of deep learning strategies for 3D reconstruction from 2D X-rays.
PMID:40662246 | DOI:10.23736/S0390-5616.25.06506-3
Enhanced Leaf Disease Segmentation Using U-Net Architecture for Precision Agriculture: A Deep Learning Approach
Food Sci Nutr. 2025 Jul 14;13(7):e70594. doi: 10.1002/fsn3.70594. eCollection 2025 Jul.
ABSTRACT
This study presents a deep learning-based image segmentation approach for leaf disease identification using the U-Net architecture. Convolutional neural networks (CNNs), particularly U-Net, are effective for precise segmentation tasks and were trained and validated on a high-quality "Leaf Disease Segmentation" dataset. Each image contains annotated regions of unhealthy leaf tissue, enabling the model to distinguish between healthy and infected areas. Image preprocessing and augmentation further enhanced model performance and robustness. The U-Net model, composed of an encoder for context extraction and a decoder for precise segmentation was trained to accurately identify diseased regions at the pixel level. Regularization techniques such as dropout, batch normalization, and ReLU activation were used to prevent overfitting and improve learning. Furthermore, Adam optimizer was employed with a learning rate of 0.001. The model demonstrated strong generalization by accurately segmenting disease regions in unseen validation images. It effectively captured complex patterns in both healthy and diseased leaf sections, outperforming traditional image processing techniques. Trained on 7056 images for 40 epochs, the model achieved 99.70% training accuracy, 0.062 training loss, and 98.99% validation accuracy. These results highlight the model's high accuracy, efficient learning, and robustness, making it suitable for real-world applications in precision agriculture.
PMID:40661811 | PMC:PMC12257497 | DOI:10.1002/fsn3.70594
Assessment of prostate cancer aggressiveness through the combined analysis of prostate MRI and 2.5D deep learning models
Front Oncol. 2025 Jun 30;15:1539537. doi: 10.3389/fonc.2025.1539537. eCollection 2025.
ABSTRACT
OBJECTIVE: Prostate cancer is prevalent among older men. Although this malignancy has a relatively low mortality rate, its aggressiveness is critical in determining patient prognosis and treatment options. This study therefore aimed to evaluate the effectiveness of a 2.5D deep learning model based on prostate MRI to assess prostate cancer aggressiveness.
MATERIALS AND METHODS: This study included 335 patients with pathologically-confirmed prostate cancer from a tertiary medical center between January 2022 and December 2023. Of these, 266 cases were classified as aggressive and 69 as non-aggressive, using a Gleason score ≥7 as the cutoff. The subjects were automatically divided into a test set and validation set in a 7:3 ratio. Before pathological biopsy, all patients underwent biparametric MRI, including T2-weighted imaging, diffusion-weighted imaging, and apparent diffusion coefficient scans. Two radiologists, blinded to pathology results, segmented the lesions using ITK-SNAP software, extracting the minimal bounding rectangle of the largest ROI layer, along with the corresponding ROIs from adjacent layers above and below it. Subsequently, radiomic features were extracted using pyradiomics tool, while deep learning features from each cross-section were derived using the Inception_v3 neural network. To ensure consistency in feature extraction, intraclass correlation coefficient (ICC) analysis was performed on features extracted by radiologists, followed by feature normalization using the mean and standard deviation of the training set. Highly correlated features were removed using t-tests and Pearson correlation tests, and redundant features were ultimately screened with least absolute shrinkage and selection operator (Lasso). Models were constructed using the LightGBM algorithm: a radiomic feature model, a deep learning feature model, and a combined model integrating radiomic and deep learning features. Further, a clinical feature model (Clinic-LightGBM) was constructed using LightGBM to include clinical information. The optimal feature model was then combined with Clinic-LightGBM to establish a nomogram. The Grad-CAM technique was employed to explain the deep learning feature extraction process, supported by tree model visualization techniques to illustrate the decision-making process of the LightGBM model. Model classification performance in the test set was evaluated using the area under the receiver operating characteristic curve (AUC).
RESULTS: In the test set, the nomogram demonstrated the highest predictive ability for prostate cancer aggressiveness (AUC = 0.919, 95% CI: 0.8107-1.0000), with a sensitivity of 0.966 and specificity of 0.833. The DLR-LightGBM model (AUC = 0.872) outperformed the DL-LightGBM (AUC = 0.818) and Rad-LightGBM (AUC = 0.758) models, indicating the benefit of combining deep learning and radiomic features.
CONCLUSION: Our 2.5D deep learning model based on prostate MRI showed efficacy in identifying clinically significant prostate cancer, providing valuable references for clinical treatment and enhancing patient net benefit.
PMID:40661774 | PMC:PMC12256241 | DOI:10.3389/fonc.2025.1539537
Sustainable deep vision systems for date fruit quality assessment using attention-enhanced deep learning models
Front Plant Sci. 2025 Jun 30;16:1521508. doi: 10.3389/fpls.2025.1521508. eCollection 2025.
ABSTRACT
INTRODUCTION: Accurate and automated fruit classification plays a vital role in modern agriculture but remains challenging due to the wide variability in fruit appearances.
METHODS: In this study, we propose a novel approach to image classification by integrating a DenseNet121 model pre-trained on ImageNet with a Squeeze-and-Excitation (SE) Attention block to enhance feature representation. The model leverages data augmentation to improve generalization and avoid overfitting. The enhancement includes attention mechanisms and Nadam optimization, specifically tailored for the classification of date fruit images. Unlike traditional DenseNet variants, proposed model incorporates SE attention layers to focus on critical image features, significantly improving performance. Multiple deep learning models, including DenseNet121+SE and YOLOv8n, were evaluated for date fruit classification under varying conditions.
RESULTS: The proposed approach demonstrated outstanding performance, achieving 98.25% accuracy, 98.02% precision, 97.02% recall, and a 97.49% F1-score with DenseNet121+SE. In comparison, YOLOv8n achieved 96.04% accuracy, 99.76% precision, 99.7% recall, and a 99.73% F1- score.
DISCUSSION: These results underscore the effectiveness of the proposed method compared to widely used architecture, providing a robust and practical solution for automating fruit classification and quality control in the food industry.
PMID:40661755 | PMC:PMC12256447 | DOI:10.3389/fpls.2025.1521508
Neurofusionnet: a comprehensive framework for accurate epileptic seizure prediction from EEG data with hybrid meta-heuristic optimization algorithm
Cogn Neurodyn. 2025 Dec;19(1):113. doi: 10.1007/s11571-025-10293-3. Epub 2025 Jul 12.
ABSTRACT
This work uses cutting edge Electroencephalogram (EEG) data processing techniques to present a complete paradigm for epileptic seizure prediction. The methodology is a multi-step procedure that includes pre-processing, feature extraction, feature selection, and a new detection model based on deep learning for enhanced durability and accuracy. Bandpass filtering is used to reduce noise during the pre-processing phase, which improves the signal-to-noise ratio. EEG data quality is further improved using Independent Component Analysis, which finds and removes artifacts. Splitting continuous EEG data into fixed-duration segments, known as epoching, facilitates the investigation of discrete temporal patterns. Standard amplitude values are guaranteed by Z-score normalization, and seizure-related patterns are more sensitively detected when channels are selected using Common Spatial Patterns. Step one of the feature extraction processes involves statistical features and time-domain features. For spectrum information it is essential to recognizing seizures, frequency-domain features such as Power spectrum Density are extracted using a technique Fourier Transform. A full representation is obtained by extracting Time-Frequency Domain Features with the Wavelet Transform. Predictive power is increased by the efficient selection of discriminative characteristics through the use of a hybrid optimization model called Hybrid Chimp Enhanced Fox Optimization algorithm that combines optimization methods inspired by FOX and Chimp. The suggested NeuroFusionNet-based detection model combines Improved ShuffleNet V2, SqueezeNet, EfficientNet V2, and Multi Head Attention (MHA) based GhostNet V2, which captures complex patterns linked to epileptic episodes.
PMID:40661693 | PMC:PMC12255646 | DOI:10.1007/s11571-025-10293-3
Integrating structural homology with deep learning to achieve highly accurate protein-protein interface prediction for the human interactome
bioRxiv [Preprint]. 2025 Jun 12:2025.06.09.658393. doi: 10.1101/2025.06.09.658393.
ABSTRACT
A significant portion of disease-causing mutations occur at protein-protein interfaces however, the number of structurally resolved multi-protein complexes is extremely small. Here we present a computational pipeline, PIONEER2.0, that integrates 3D structural similarity with geometric deep learning to accurately predict protein binding partner-specific interfacial residues for all experimentally observed human binary protein-protein interactions. We estimate that AlphaFold3 fails to produce high-quality structural models for about half of the human interactome; for these challenging cases, PIONEER2.0 significantly outperforms AlphaFold3 in predicting their interface residues, making PIONEER2.0 an excellent alternative and complementary tool in real-world applications. We further systematically validated PIONEER2.0 predictions experimentally by generating 1,866 mutations and testing their impact on 5,010 mutation-interaction pairs, confirming PIONEER-predicted interfaces are comparable in accuracy as experimentally determined interfaces using PDB co-complex structures. We then used PIONEER2.0 to create a comprehensive multiscale structurally informed human interactome encompassing all 352,124 experimentally determined binary human protein interactions in the literature. We find that PIONEER2.0-predicted interfaces are instrumental in prioritizing disease-associated mutations and thus provide insight into their underlying molecular mechanisms. Overall, our PIONEER2.0 framework offers researchers a valuable tool at an unprecedented scale for studying disease etiology and advancing personalized medicine.
PMID:40661495 | PMC:PMC12259034 | DOI:10.1101/2025.06.09.658393
Spatial multi-omics and deep learning reveal fingerprints of immunotherapy response and resistance in hepatocellular carcinoma
bioRxiv [Preprint]. 2025 Jun 12:2025.06.11.656869. doi: 10.1101/2025.06.11.656869.
ABSTRACT
Despite advances in immunotherapy treatment, nonresponse rates remain high, and mechanisms of resistance to checkpoint inhibition remain unclear. To address this gap, we performed spatial transcriptomic and proteomic profiling on human hepatocellular carcinoma tissues collected before and after immunotherapy. We developed an interpretable, multimodal deep learning framework to extract key cellular and molecular signatures from these data. Our graph neural network approach based on spatial proteomic inputs achieved outstanding performance (ROC-AUC > 0.9) in predicting patient treatment response. Key predictive features and associated spatial transcriptomic profiles revealed the multi-omic landscape of immunotherapy response and resistance. One such feature was an interface niche expressing restrictive extracellular matrix factors that physically separates tumor tissue and lymphoid aggregates in nonresponders. We integrate this and other spatially-resolved signatures into SPARC, a multi-omic "fingerprint" comprising scores for immunotherapy response and resistance mechanisms. This study lays groundwork for future patient stratification and treatment strategies in cancer immunotherapy.
PMID:40661489 | PMC:PMC12259099 | DOI:10.1101/2025.06.11.656869
VNC-Dist: A machine learning-based semi-automated pipeline for quantification of neuronal positioning in the <em>C. elegans</em> ventral nerve cord
bioRxiv [Preprint]. 2025 Jun 11:2024.11.16.623955. doi: 10.1101/2024.11.16.623955.
ABSTRACT
The C. elegans ventral nerve cord (VNC) provides a simple model for investigating the developmental mechanisms involved in neuronal positioning and organization. The VNC of newly hatched larvae contains a set of 22 motoneurons organized into three distinct classes (DD, DA, and DB) that show consistent positioning and arrangement. This organization arises from the action of multiple convergent genetic pathways, which are poorly understood. To better understand these pathways, accurate and efficient methods for quantifying motoneuron cell body positions within large microscopy datasets are required. Here, we present VNC-Dist (Ventral Nerve Cord Distances), a software toolkit that replaces manual measurements with a faster and more accurate computer-assisted approach, combining machine learning and other tools, to quantify neuron cell body positions in the VNC. The VNC-Dist pipeline integrates several components: manual neuron cell body localization using Fiji's multipoint tool, deep learning-based worm segmentation with modified Segment Anything Model (SAM), accurate spline-based measurements of neuronal distances along the VNC, and built-in tools for statistical analysis and graphing. To demonstrate the robustness and versatility of VNC-Dist, we applied it to several genetic mutants known to disrupt neuronal positioning in the VNC. This toolbox will enable batch acquisition and analysis of large datasets across genotypes, thereby advancing investigations into the cellular and molecular mechanisms that govern VNC neuronal positioning and arrangement.
PMID:40661438 | PMC:PMC12258885 | DOI:10.1101/2024.11.16.623955
DeepFace: A High-Precision and Scalable Deep Learning Pipeline for Predicting Large-Scale Brain Activity from Facial Dynamics in Mice
bioRxiv [Preprint]. 2025 Jun 15:2025.06.10.658952. doi: 10.1101/2025.06.10.658952.
ABSTRACT
We present DeepFace, a next-generation facial analysis pipeline that enhances orofacial tracking and cortical activity prediction in mice. Rather than replacing existing tools, DeepFace builds upon DeepLabCut and Facemap to address scalability bottlenecks and improve behavioral quantification. It offers high precision, keypoint customization, and robust performance across GCaMP6s, GCaMP6f, and jGCaMP8m lines. With scalable batch processing and high-performance computing compatibility, DeepFace enables high-throughput brain-behavior analysis in large-scale preclinical neuroscience.
PMID:40661434 | PMC:PMC12259132 | DOI:10.1101/2025.06.10.658952
Simpatico: accurate and ultra-fast virtual drug screening with atomic embeddings
bioRxiv [Preprint]. 2025 Jun 8:2025.06.08.658499. doi: 10.1101/2025.06.08.658499.
ABSTRACT
Building on established methods for molecular docking, structure-based deep learning has recently yielded important advances in virtual drug screening. We present simpatico , a method that follows an alternate approach, based on the field of Representation Learning, to dramatically speed the process of accurate drug screening. Simpatico employs graph neural networks to produce high-dimensional embeddings for the atoms of proteins and small molecules, and uses these embeddings to rapidly produce accurate predictions of the interaction potential for drug candidates with target protein pockets. Simpatico can search a database containing 600 million drugs for good binding candidates to a single protein pocket in 2.5 hours on a single GPU. Despite being >1000x faster than state of the art docking and diffusion-based methods, simpatico is competitive with the most accurate of those methods. We also observe that simpatico embeddings can be used to explore toxicity risk and to identify proteins with similar binding potential. Simpatico is open source software; all code, weights, and data may be accessed at https://github.com/TravisWheelerLab/Simpatico .
PMID:40661404 | PMC:PMC12259003 | DOI:10.1101/2025.06.08.658499
Evaluating the representational power of pre-trained DNA language models for regulatory genomics
Genome Biol. 2025 Jul 14;26(1):203. doi: 10.1186/s13059-025-03674-8.
ABSTRACT
BACKGROUND: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question.
RESULTS: Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study.
DISCUSSION: This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
PMID:40660356 | DOI:10.1186/s13059-025-03674-8
MSCMLCIDTI: Drug-Target Interaction Prediction Based on Multiscale Feature Extraction and Deep Interactive Attention Fusion Mechanisms
J Comput Chem. 2025 Jul 15;46(19):e70170. doi: 10.1002/jcc.70170.
ABSTRACT
Drug-target interaction prediction serves as a crucial component in accelerating drug discovery. To overcome current limitations in deep learning approaches, specifically the inadequate representation of local features and insufficient modeling of drug and target information interactions, we propose a multiscale feature extraction coupled multilayer cross-interaction network (MSCMLCIDTI). The model uses multiscale convolutional blocks to extract structural fingerprints of drug compounds and amino acid sequences at different scales for multigranularity pattern recognition across spatial domains, followed by gated attention to obtain multidimensional features. This multidimensional feature extraction enhances the model's capability to identify critical binding sites between pharmacological compounds and their biological targets. Furthermore, we implement a deep cross-interaction mechanism utilizing multilayer attention-based interactions to model complex relationships between distinct drug substructures and protein fragments. This design empowers accurate identification of sophisticated interaction signatures in pharmaceutical target complexes. Comprehensive validation across four open-access benchmark datasets reveals our framework's superior predictive accuracy compared to existing leading-edge models.
PMID:40660331 | DOI:10.1002/jcc.70170
Mortality and antibiotic timing in deep learning-derived surviving sepsis campaign risk groups: a multicenter study
Crit Care. 2025 Jul 14;29(1):302. doi: 10.1186/s13054-025-05493-6.
ABSTRACT
BACKGROUND: The current Surviving Sepsis Campaign (SSC) guidelines provide recommendations on timing of administering antibiotics in sepsis patients based on probability of sepsis and presence of shock. However, there have been minimal efforts to stratify patients objectively into these groups and describe patient outcomes as a function of antibiotic timing recommendations based on risk stratification using this approach.
METHODS: We conducted an observational cohort study using prospectively applied patient data from two large health systems using patient encounters between 2016 and 2024. At the time of clinical suspicion of sepsis, two deep learning (DL) models were used to stratify patients objectively into groups analogous to the SSC risk groups, based on a patient's likelihood of having sepsis and likelihood of developing shock. These risk groups were: (1) shock likely to develop and sepsis probable, (2) shock likely to develop and sepsis possible, (3) shock unlikely to develop and sepsis probable, and (4) shock unlikely to develop and sepsis possible. The primary outcome was short-term mortality, a composite of in-hospital mortality and transition to hospice care, across each risk group.
RESULTS: We identified 34,087 adult patients with potential sepsis. At the development site, risk group mortality rates (%) and median time to antibiotics [IQR] were as follows: (1) 23.2%, 1.7 [1.0-3.1] hours; (2) 17.7%, 3.0 [1.7-6.2] hours; (3) 5.0%, 2.8 [1.5-5.1] hours; and (4) 1.9%, 4.6 [2.7-8.0] hours. Results from the validation site were similar. Mortality rates were similar for patients with possible sepsis unlikely to develop shock regardless of antibiotic administration within 1, 3 or more hours from triage. For patients with probable sepsis at the development site, regardless of risk of shock, mortality was significantly lower if antibiotics were administered within the first hour from triage.
CONCLUSIONS: Our data suggest that patients who are at low risk of developing shock and possible sepsis had similar rates of mortality in the 1-hour vs. > 1-hour and 3-hour vs. > 3-hour time to antibiotic administration groups. Thus, a more lenient time to antibiotic administration could allow for more detailed evaluations and judicious administration of antibiotics without impacting patient mortality. Patients with probable sepsis had lower mortality if antibiotics were administered within 1 h from triage, regardless of risk of shock. Additional prospective studies are required to validate these findings and guide optimal antibiotic timing in patients with suspected sepsis.
PMID:40660326 | DOI:10.1186/s13054-025-05493-6
BertADP: a fine-tuned protein language model for anti-diabetic peptide prediction
BMC Biol. 2025 Jul 15;23(1):210. doi: 10.1186/s12915-025-02312-w.
ABSTRACT
BACKGROUND: Diabetes is a global metabolic disease that urgently calls for the development of new and effective therapeutic agents. Anti-diabetic peptides (ADPs) have emerged as a research hotspot due to their therapeutic potential and natural safety, representing a promising class of functional peptides for diabetic management. However, conventional computational approaches for ADPs prediction mainly rely on manually extracted sequence features. These methods often lack generalizability and perform poorly on short peptides, thereby hindering effective ADPs discovery.
RESULTS: In this study, we introduce a fine-tuning strategy of large-scale pre-trained protein language models (PLMs) for ADPs prediction, enabling automated extraction of discriminative semantic representations. We established the most comprehensive ADPs dataset to date, comprising 899 rigorously curated non-redundant ADPs and 67 newly collected potential candidates. Based on three model construction strategies, we developed 11 candidate models. Among them, BertADP (a fine-tuned ProtBert model) demonstrated superior performance in the independent test set, outperforming existing ADPs prediction tools with an overall accuracy of 0.955, sensitivity of 1.000, and specificity of 0.910. Notably, BertADP exhibited remarkable sequence length adaptability, maintaining stable performance across both standard and short peptide sequences.
CONCLUSIONS: BertADP represents the first PLMs-based intelligent prediction tool for ADPs, whose exceptional identification capability will significantly accelerate anti-diabetic drug development and facilitate personalized therapeutic strategies, thereby enhancing precision diabetes management. Furthermore, the proposed approach provides a generalizable framework that can be extended to other bioactive peptide discovery studies, offering an innovative solution for bioactive peptide mining.
PMID:40660290 | DOI:10.1186/s12915-025-02312-w