Systems Biology
ADTnorm: robust integration of single-cell protein measurement across CITE-seq datasets
Nat Commun. 2025 Jul 1;16(1):5852. doi: 10.1038/s41467-025-61023-6.
ABSTRACT
Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) enables paired measurement of surface protein and mRNA expression in single cells using antibodies conjugated to oligonucleotide tags. Due to the high copy number of surface protein molecules, sequencing antibody-derived tags (ADTs) allows for robust protein detection, improving cell-type identification. However, variability in antibody staining leads to batch effects in the ADT expression, obscuring biological variation, reducing interpretability, and obstructing cross-study analyses. Here, we present ADTnorm, a normalization and integration method designed explicitly for ADT abundance. Benchmarking against 14 existing scaling and normalization methods, we show that ADTnorm accurately aligns populations with negative- and positive-expression of surface protein markers across 13 public datasets, effectively removing technical variation across batches and improving cell-type separation. ADTnorm enables efficient integration of public CITE-seq datasets, each with unique experimental designs, paving the way for atlas-level analyses. Beyond normalization, ADTnorm includes built-in utilities to aid in automated threshold-gating as well as assessment of antibody staining quality for titration optimization and antibody panel selection. Applying ADTnorm to an antibody titration study, a published COVID-19 CITE-seq dataset, and a human hematopoietic progenitors study allowed for identifying previously undetected phenotype-associated markers, illustrating a broad utility in biological applications.
PMID:40595741 | DOI:10.1038/s41467-025-61023-6
Genomic landscape of virus-associated cancers
Nat Commun. 2025 Jul 1;16(1):5887. doi: 10.1038/s41467-025-60836-9.
ABSTRACT
It has been estimated that 15%-20% of human cancers are attributable to infections, mostly by carcinogenic viruses. The incidence varies worldwide, with a majority affecting developing countries. Here, we conduct a comparative analysis of virus-positive and virus-negative tumors in nine cancers linked to five viruses. We observe a higher frequency of virus-positive tumors in males, with notable geographic differences in incidence. Our genomic analysis of 1971 tumors reveals a lower somatic burden, distinct mutation signatures, and driver gene mutations in virus-positive tumors. Compared to virus-negative cases, virus-positive cases have fewer mutations of TP53, CDKN2A, and deletions of 9p21.3/CDKN2A-CDKN1A while exhibiting more mutations in RNA helicases DDX3X and EIF4A1. Furthermore, an analysis of clinical trials of PD-(L)1 inhibitors suggests an association of virus-positivity with higher treatment response rate, particularly evident in gastric cancer and head and neck squamous cell carcinoma. Both cancer types also show evidence of increased CD8 + T cell infiltration and T cell receptor clonal selection in virus-positive tumors. These results illustrate the epidemiological, genetic, and therapeutic trends across virus-associated malignancies.
PMID:40595559 | DOI:10.1038/s41467-025-60836-9
Model-free photon analysis of diffusion-based single-molecule FRET experiments
Nat Commun. 2025 Jul 1;16(1):5537. doi: 10.1038/s41467-025-60764-8.
ABSTRACT
Photon-by-photon analysis tools for diffusion-based single-molecule Förster resonance energy transfer (smFRET) experiments often describe protein dynamics with Markov models. However, FRET efficiencies are only projections of the conformational space such that the measured dynamics can appear non-Markovian. Model-free methods to quantify FRET efficiency fluctuations would be desirable in this case. Here, we present such an approach. We determine FRET efficiency correlation functions free of artifacts from the finite length of photon trajectories or the diffusion of molecules through the confocal volume. We show that these functions capture the dynamics of proteins from nano- to milliseconds both in simulation and experiment, which provides a rigorous validation of current model-based analysis approaches.
PMID:40595536 | DOI:10.1038/s41467-025-60764-8
The postbiotic ReFerm® versus standard nutritional support in advanced alcohol-related liver disease (GALA-POSTBIO): a randomized controlled phase 2 trial
Nat Commun. 2025 Jul 1;16(1):5969. doi: 10.1038/s41467-025-60755-9.
ABSTRACT
Impaired gut barrier function may lead to progression of liver fibrosis in people with alcohol-related liver disease. The postbiotic ReFerm® can lower gut barrier permeability and may thereby reduce fibrosis formation. Here, we report the results from an open-labelled, single centre randomized controlled trial where 56 patients with advanced, compensated, alcohol-related liver disease were assigned 1:1 to receive either ReFerm® (n = 28) or standard nutritional support (Fresubin®, n = 28) for 24 weeks. The primary outcome was a ≥ 10% reduction of the fibrosis formation marker alpha-smooth muscle actin in liver biopsies, assessed by a blinded pathologist using automated digital imaging analysis. Paired liver biopsies meeting quality criteria for the primary outcome were available for 40 participants (ReFerm®, n = 21 and Fresubin®, n = 19). This reduction was observed in 29% of patients receiving ReFerm®, compared to 14% with Fresubin® (OR = 2.40; 95% CI 0.63 to 9.16; p = 0.200). No treatment-related serious adverse events occurred. Our findings suggest that ReFerm® may reduce liver fibrosis by enhancing gut barrier function, potentially preventing the progression of alcohol-related liver disease.
PMID:40595534 | DOI:10.1038/s41467-025-60755-9
Effective treatment of systemic candidiasis by synergistic targeting of cell wall synthesis
Nat Commun. 2025 Jul 1;16(1):5532. doi: 10.1038/s41467-025-60684-7.
ABSTRACT
Fungal infections pose a serious threat to global human health fueled by the increase in immunosuppressive therapies, medical implants, and transplantation. The emergence of multidrug resistance with limited options of current antifungal drugs are a further constraint. There is thus a clear and unmet need to identify therapeutic targets and develop alternative classes of antifungal agents. Here, we hypothesize that dual targeting of key regulatory genes of fungal cell wall synthesis (FKS1 encoding β-1,3-glucan synthase and CHS3 encoding chitin synthase) can synergistically inhibit fungal growth. Based on iterative designs, we generate a small library of fungal-targeted nanoconstructs, and identify a lead construct (FTNx) that shows preferential accumulation in fungal cells over mammalian cells and leads to prominent antifungal effects in vitro. We further show that FTNx is highly effective in a mouse model of disseminated candidiasis, demonstrating diminished fungal growth and enhanced survival rate. This strategy appears promising as an effective treatment for fungal infections in mammalian hosts.
PMID:40595501 | DOI:10.1038/s41467-025-60684-7
Publisher Correction: Metabolic adaptations direct cell fate during tissue regeneration
Nature. 2025 Jul 1. doi: 10.1038/s41586-025-09294-3. Online ahead of print.
NO ABSTRACT
PMID:40595366 | DOI:10.1038/s41586-025-09294-3
Pyrenees as the southernmost European refugium of glacial relict land snails
Sci Rep. 2025 Jul 2;15(1):23076. doi: 10.1038/s41598-025-07531-3.
ABSTRACT
Biogeographical relicts, particularly glacial relicts, are species that have survived postglacial climatic shifts in isolated refugia. In temperate Europe, such species are commonly found in high-altitude mountain ranges, including the Alps, Carpathians, and Pyrenees. While glacial relict land snails are well-documented in the Alps and Carpathians, their occurrence in the Pyrenees remains largely unexplored. In this study, we report the first records of Columella columella in the Iberian Peninsula, found in alpine rocky tundra and alkaline spring fen habitats, far south of its known distribution. Additionally, we report the first presence of Pyramidula saxatilis in Spain, a rock-dwelling species with a distinct Pyrenean haplotype, suggesting its long-term isolation. Our findings also challenge previous records of Vertigo genesii in the Pyrenees, which seem to represent Vertigo hoppii (syn. V. arctica). Furthermore, we document Vertigo alpestris for the first time in Spain, revealing a unique haplotype shared with an Icelandic population. These findings highlight the Pyrenees as a potential southern refugium for glacial relict snails and emphasize the need for further research and conservation measures to protect these highly isolated populations from habitat degradation, particularly due to overgrazing.
PMID:40595249 | DOI:10.1038/s41598-025-07531-3
Single-cell RNA-seq analysis of mouse carotid artery under disturbed flow and human carotid plaques identifies key cell populations in atherosclerosis development
Sci Rep. 2025 Jul 1;15(1):20747. doi: 10.1038/s41598-025-07395-7.
ABSTRACT
Atherosclerosis tends to occur in regions of disturbed blood flow. This study explored how disturbed flow aggravates atherosclerosis using single-cell RNA-seq (scRNA-seq) datasets from mouse carotid arteries under disturbed flow and human carotid artery plaques. The scRNA-seq datasets were obtained from the GEO (GSE159677, GSE43292, GSE163154, and GSE41571) and SRA (PRJNA722117) databases and were processed using Seurat. Functional enrichment analysis was conducted using Gene Set Enrichment Analysis (GSEA) and Gene Ontology (GO). Single-cell Flux Estimation Analysis (scFEA) was used to analyze cell type-specific changes in metabolism and "transcriptomic noise analysis" to examine senescence. GWAS-significant cardiovascular disease (CVD) risk genes were used to calculate risk gene scores for main cell populations. CellChat and Cytosig were used to analyze cell communication and cytokines. scRNA-seq identified seven cell clusters in mouse arteries: endothelial cells (ECs), vascular smooth muscle cells (VSMCs), fibroblasts, pericytes, macrophages, neutrophils, and T cells. Fibroblasts showed the most pronounced changes, particularly in inflammation and TGF-β signaling pathways. ECs, VSMCs, and fibroblasts had the highest enrichment of CVD risk gene scores, with fibroblasts showing most significant increases in gene risk scores after disturbed flow stimulation. A distinct fibroblast subgroup displayed high enrichment in inflammation and ossification-related pathways. CD36 + positive ECs exhibited significant senescence phenotypes following disturbed flow stimulation. Notable increases in VEGFA+ macrophages were discovered in the disturbed flow stimulation group, displaying a pronounced M1 pro-inflammatory phenotype associated with the severity of atherosclerosis and plaque stability. This study systematically elucidated functional changes of cell populations under disturbed flow. CD36+ ECs, VEGFA+ macrophages, and adventitial fibroblasts play critical roles in atherosclerosis.
PMID:40594772 | DOI:10.1038/s41598-025-07395-7
Proteomic risk scores for predicting common diseases using linear and neural network models in the UK biobank
Sci Rep. 2025 Jul 1;15(1):20520. doi: 10.1038/s41598-025-06232-1.
ABSTRACT
Plasma proteomics provides a unique opportunity to enhance disease prediction by capturing protein expression patterns linked to diverse pathological processes. Leveraging data from 2,923 proteins measured in 53,030 UK Biobank participants, we developed proteomic risk scores for 27 common outcomes over 5- and 15-year follow-up periods using two approaches: a linear ElasticNet regression model and a deep learning neural network (NN) model. Using Cox regression, we assessed the discrimination of proteomic risk scores either in isolation or as incremental improvements over clinical risk factors. We also studied the shared and unique protein predictors across conditions. Proteomic risk scores demonstrated strong discrimination for most outcomes, with a C-index > 0.80 for 12 diseases. NN models outperformed linear models for 11 outcomes, particularly for diseases such as Parkinson's disease (C-index 0.84) and pulmonary embolism (C-index 0.83), where nonlinear relationships contributed significantly to prediction. Across all outcomes, the addition of proteomic scores to clinical models improved predictive accuracy (ΔC-index 0.03), with the greatest gains observed in 9 diseases (ΔC-index > 0.1), including end-stage renal disease, pulmonary embolism, and Parkinson's disease. Analysis of protein contributions revealed shared predictors across multiple diseases, such as growth differentiation factor 15 (GDF15), as well as unique predictors like PAEP for endometriosis. While NN models may capture complex relationships, linear models provided value through simplicity and interpretability. These findings underscore the importance of tailoring predictive approaches to specific diseases and demonstrate the pivotal potential of proteomics in advancing risk stratification and early detection.
PMID:40594723 | DOI:10.1038/s41598-025-06232-1
Single cell sequencing and computational findings reveal anti-estrogen receptor-positive breast cancer roles of formononetin
Sci Rep. 2025 Jul 1;15(1):20892. doi: 10.1038/s41598-025-06133-3.
ABSTRACT
Breast cancer is the second leading cancer type in women, accounting for 11.6% of all cancers. Recently, its incidence has increased in younger individuals. Estrogen receptors (ERs) play important roles in the development and progression of breast cancer by controlling hormone signaling. Therefore, targeting ERs is one of the most promising therapeutic strategies for treating ER-positive breast cancer. High heterogeneity contributes to the tumorigenicity, metastatic ability, and therapeutic resistance of breast cancer. However, drug discovery studies at the single-cell level are lacking. In the present study, we used single-cell sequencing data analysis, together with a network pharmacology approach, to determine the targets and molecular mechanisms of formononetin in ER-positive breast cancer. Comparative single-cell sequencing analysis identified 3899, 3395, 333, 398, 398, and 17 differentially expressed genes in stromal cells, epithelial cells, fibroblasts, neutrophils, eosinophils, and macrophages, respectively, of ER-positive breast cancer compared with normal breast tissues. Further network pharmacology analysis highlighted the importance of formononetin targets in biological functions and signaling pathways related to immune and inflammatory responses, metastatic ability, metabolism, cell proliferation, and gland development in different ER-positive breast cancer cell types. For the first time, we used a systems biology approach to investigate the targets of formononetin and its anti-ER-positive breast cancer mechanisms at the single-cell level.
PMID:40594685 | DOI:10.1038/s41598-025-06133-3
A directionally evolved genomic feature in BRSK2 harbors divergent alleles in neurocognitive disorders
Sci Rep. 2025 Jul 1;15(1):21888. doi: 10.1038/s41598-025-07803-y.
ABSTRACT
The human Brain-specific Serine/Threonine Kinase 2 (BRSK2), alternatively known as Synapses of Amphids Defective (SAD)-A, is mainly expressed in the brain, and required for neuronal polarization and differentiation. This gene contains the longest 5' untranslated region (5' UTR) pentanucleotide short tandem repeat (STR), (CGGCT)6, in human. We hypothesized that this exceptional length may confer selective advantage in cognitive functioning in human. The region spanning (CGGCT)6 was sequenced in a sample of 339 unrelated individuals, consisting of cases affected by late-onset neurocognitive disorder (NCD) (N = 163) and matched controls (N = 176). Consequently, we mapped CGGCT motifs and STRs across the human genome and obtained the phylogenetic tree of the BRSK2 sequence spanning the CGGCT STR in 19 species belonging to several orders of mammals, including Rodents, Carnivora, Artiodactyls, Perissodactyla, and Primates. We found that (CGGCT)6 was part of a complex island of 17 consecutive CGGCT motifs/STRs, ranging from 1 to 6-repeats, stretching the BRSK2 core promoter and 5' UTR. Across the human genome, the CGGCT island was unique with respect to density, complexity, and repeat length of CGGCT motifs and repeats. This island was flanked by a 5' UTR CGG STR in its downstream. The evolution of the CGGCT island mainly coincided with the phylogenetic distance of the species studied, and the CGG STR was primate-specific, suggesting directional, rather than random evolution of this complex sequence. While (CGGCT)6 was strictly monomorphic in the human samples studied, a 7-repeat of this motif was detected in the controls only. In another CGGCT repeat inside the CGGCT island, there was a significant excess of homozygosity for a long allele (4-repeat) in the controls (Mid-P = 0.02). At the same locus, a 3-repeat allele was detected in the NCD group only. Additionally, alleles were detected at the extreme short and long lengths of the CGG STR in the NCD group only. Probable diagnosis in the patients harboring divergent genotypes spanned Alzheimer's disease and vascular dementia. We report a novel genomic feature, consisting of a CGGCT motif/STR island, and a CGG STR in BRSK2 that coincide with directional evolution of several orders of mammals. Several polymorphic and rare alleles were divergently distributed in the NCD and control groups across this region, which may reflect a possible link with cognitive functions in human.
PMID:40594678 | DOI:10.1038/s41598-025-07803-y
Structural and functional insights of AmpG in muropeptide transport and multiple β-lactam antibiotics resistance
Nat Commun. 2025 Jul 1;16(1):5744. doi: 10.1038/s41467-025-61169-3.
ABSTRACT
Anhydromuropeptide permease (AmpG) is a transporter protein located in the inner membrane of certain gram -negative bacteria, involved in peptidoglycan (PG) recycling and β-lactamase induction. Decreased AmpG function reduces resistance of antibiotic-resistant bacteria to β-lactam antibiotics. Therefore, AmpG-targeting inhibitors are promising 'antibiotic adjuvants'. However, as the tertiary structure of AmpG has not yet been identified, the development of targeted inhibitors remains challenging. We present four cryo-electron microscopy (cryo-EM) structures: the apo-inward and apo-outward state structures and the inward-occluded and outward states complexed with the substrate GlcNAc-1,6-anhMurNAc. Through functional analysis and molecular dynamics (MD) simulations, we identified motif A, which stabilizes the outward state, substrate-binding pocket, and protonation-related residues. Based on the structure of AmpG and our experimental results, we propose a muropeptide transport mechanism for AmpG. A deeper understanding of its structure and transport mechanism provides a foundation for the development of antibiotic adjuvants.
PMID:40593790 | DOI:10.1038/s41467-025-61169-3
An orthogonal transcription mutation system generating all transition mutations for accelerated protein evolution in vivo
Nat Commun. 2025 Jul 1;16(1):6041. doi: 10.1038/s41467-025-61354-4.
ABSTRACT
Targeted mutagenesis systems are critical for protein evolution. Current deaminase-T7 RNA polymerase fusion systems enable gene-specific mutagenesis but remain limited to certain model organisms. Here, we develop an orthogonal transcription mutation system for in vivo hypermutation in both non-model organism Halomonas bluephagenesis and E. coli, achieving >1,500,000-fold increased mutation rates. By fusing deaminases with three phage RNA polymerases, this system uniformly introduces C:G to T:A and A:T to G:C mutations across target genes. The system demonstrates high specificity, minimal off-target effects, and high orthogonality between phage polymerases. We apply this system to rapidly evolve fluorescent proteins, chromoproteins, cytoskeletal proteins, cell division-related proteins, global sigma factor, and the LysE exporter within a single day of the mutagenesis process. Overall, the orthogonal transcription mutation system is a modular and versatile platform that accelerates protein evolution in the shortest period reported so far.
PMID:40593783 | DOI:10.1038/s41467-025-61354-4
Elevated nitric oxide during colitis restrains GM-CSF production in ILC3 cells via suppressing an AhR-Cyp4f13-NF-κB axis
Nat Commun. 2025 Jul 1;16(1):5654. doi: 10.1038/s41467-025-60969-x.
ABSTRACT
Inflammatory bowel disease (IBD) presents a significant clinical challenge, yet the way bioactive gases are implicated remains elusive. We detect elevated colonic Nos2 levels in both IBD patients and mice undergoing diverse colitis. Additionally, Nos2 deficiency significantly aggravates anti-CD40-induced colitis, along with an increase in GM-CSF production by ILC3s. We identified a previously unappreciated role of the crucial ILC3 regulator, AhR, in promoting Cyp4f13 expression to allow ILC3s to bind with externally derived nitric oxide (NO). This further restrains Cyp4f13-catalyzed ROS generation and thereby diminishes NF-κB activation strictly necessary for GM-CSF production. Accordingly, the exacerbated anti-CD40-induced colitis due to defective NO generation in Nos2 deficient mice is efficiently recovered by a Cyp4f13 inhibitor, HET0016. Importantly, IBD patients with elevated NO binding to colonic ILC3s show decreased disease activity. Thus, our findings uncover a crucial regulatory mechanism for restraining colitogenic GM-CSF production in ILC3s and underscores its implication in IBD therapy.
PMID:40593752 | DOI:10.1038/s41467-025-60969-x
Author Correction: Base-excision repair pathway shapes 5-methylcytosine deamination signatures in pan-cancer genomes
Nat Commun. 2025 Jul 1;16(1):6029. doi: 10.1038/s41467-025-61578-4.
NO ABSTRACT
PMID:40593733 | DOI:10.1038/s41467-025-61578-4
Geno4ME Study: implementation of whole genome sequencing for population screening in a large healthcare system
NPJ Genom Med. 2025 Jul 1;10(1):50. doi: 10.1038/s41525-025-00508-1.
ABSTRACT
The Genomic Medicine for Everyone (Geno4ME) study was established across the seven-state Providence Health system to enable genomics research and genome-guided care across patients' lifetimes. We included multi-lingual outreach to underrepresented groups, a novel electronic informed consent and education platform, and whole genome sequencing with clinical return of results and electronic health record integration for 78 hereditary disease genes and four pharmacogenes. Whole genome sequences were banked for research and variant reanalysis. The program provided genetic counseling, pharmacist support, and guideline-based clinical recommendations for patients and their providers. Over 30,800 potential participants were initially contacted, with 2716 consenting and 2017 having results returned (47.5% racial and ethnic minority individuals). Overall, 432 (21.4%) had test results with one or more management recommendations related to hereditary disease(s) and/or pharmacogenomics. We propose Geno4ME as a framework to integrate population health genomics into routine healthcare.
PMID:40593689 | DOI:10.1038/s41525-025-00508-1
Spatial patterns of hepatocyte glucose flux revealed by stable isotope tracing and multi-scale microscopy
Nat Commun. 2025 Jul 1;16(1):5850. doi: 10.1038/s41467-025-60994-w.
ABSTRACT
Metabolic homeostasis requires engagement of catabolic and anabolic pathways consuming nutrients that generate and consume energy and biomass. Our current understanding of cell homeostasis and metabolism, including how cells utilize nutrients, comes largely from tissue and cell models analyzed after fractionation, and that fail to reveal the spatial characteristics of cell metabolism, and how these aspects relate to the location of cells and organelles within tissue microenvironments. Here we show the application of multi-scale microscopy, machine learning-based image segmentation, and spatial analysis tools to quantitatively map the fate of nutrient-derived 13C atoms across spatiotemporal scales. This approach reveals the cellular and organellar features underlying the spatial pattern of glucose 13C flux in hepatocytes in situ, including the timeline of mitochondria-ER contact dynamics in response to changes in blood glucose levels, and the discovery of the ultrastructural relationship between glycogenesis and lipid droplets.
PMID:40593680 | DOI:10.1038/s41467-025-60994-w
Pancreatic islet β-cell subtypes are derived from biochemically-distinct and nutritionally-regulated islet progenitors
Nat Commun. 2025 Jul 1;16(1):5758. doi: 10.1038/s41467-025-60831-0.
ABSTRACT
Endocrine islet β cells comprise heterogenous subtypes with different gene expression and function levels. Here we study when/how this heterogeneity is induced and how long each subtype maintains its characteristic properties. We show that islet progenitors with distinct gene expression and DNA methylation patterns produce β-cell subtypes of different secretory function, proliferation rate, and viability in male and female mice. These subtypes have differential gene expression that regulates insulin vesicle production or stimulation-secretion coupling and differential DNA methylation in the putative enhancers of these genes. Maternal obesity, a major diabetes risk factor, reduces the proportion of the β-cell subtype with higher levels of glucose responsiveness. The gene signature that defines mouse β-cell subtypes can reliably divide human cells into two sub-populations, with the one having higher predicted glucose responsiveness reduced in diabetic donors. These results suggest that β-cell subtypes can be derived from islet progenitor subsets modulated by maternal nutrition.
PMID:40593675 | DOI:10.1038/s41467-025-60831-0
Microbial bioremediation of persistent organic pollutants in plant tissues provides crop growth promoting liquid fertilizer
Nat Commun. 2025 Jul 1;16(1):5768. doi: 10.1038/s41467-025-60918-8.
ABSTRACT
Constructed wetlands are used to clean domestic wastewater via phytoremediation, commonly involving the use of reeds. The process results in the production of large amounts of polluted plant tissues, which are then considered unusable waste products. In this study, the reusability of reeds and nettle-polluted tissues is investigated. Fermenting contaminated plant tissues to produce liquid fertilizer is a sustainable means to remove 87-95% of persistent organic pollutants. A multiomics approach combining metabolomics and amplicon metagenomics is used to analyze the mechanisms that occur during fertilizer production from polluted plant tissues and identify the microbes that are likely key for this transformation. A consortium of bacteria and fungi with cellulolytic activity is identified. In addition, the obtained liquid fertilizer positively impacts plant growth in the presence of pathogens and therefore exhibits potential application in farming. This approach may be a simple, commercially attractive solution for the management of contaminated plant tissues originating from constructed wetlands, which are currently considered problematic, useless waste products.
PMID:40593605 | DOI:10.1038/s41467-025-60918-8
Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
Nat Commun. 2025 Jul 1;16(1):5466. doi: 10.1038/s41467-025-60622-7.
ABSTRACT
Protein engineering has recently seen tremendous transformation due to machine learning (ML) tools that predict structure from sequence at unprecedented precision. Predicting catalytic activity, however, remains challenging, restricting our capabilities to design protein sequences with desired catalytic function in silico. This predicament is mainly rooted in a lack of experimental methods capable of recording sequence-activity data in quantities sufficient for data-intensive ML techniques, and the inefficiency of searches in the enormous sequence spaces inherent to proteins. Herein, we address both limitations in the context of engineering proteases with tailored substrate specificity. We introduce a DNA recorder for deep specificity profiling of proteases in Escherichia coli as we demonstrate testing 29,716 candidate proteases against up to 134 substrates in parallel. The resulting sequence-activity data on approximately 600,000 protease-substrate pairs does not only reveal key sequence determinants governing protease specificity, but allows to build a data-efficient deep learning model that accurately predicts protease sequences with desired on- and off-target activities. Moreover, we present epistasis-aware training set design as a generalizable strategy to streamline searches within enormous sequence spaces, which strongly increases model accuracy at given experimental efforts and is thus likely to have implications for protein engineering far beyond proteases.
PMID:40593579 | DOI:10.1038/s41467-025-60622-7