
Therefore, to enable open, performant and extensible processing of high complexity DIA data, we propose a processing framework that builds on current developments in deep learning. Our algorithms view a DIA experiment as a high-dimensional snapshot of the peptide spectrum space. This representation is amenable to DIA methods on all major instrument platforms and naturally covers simple DIA methods, as well as ion mobility, variable windows, sliding quadrupole windows and yet-to-be-developed acquisition modes. Integral to this generalized representation, the data are processed without a reduction in retention time or mobility resolution. Instead, our feature-free approach performs machine learning directly on the raw signal, combing all available information before making discrete identifications. Furthermore, we propose a DIA transfer learning strategy based on our recently published alphaPeptDeep library. Transfer learning adapts the peptide library directly to the instrument and sample workflow36. This closer coupling of deep learning beyond library prediction may become characteristic of the next generation of search engines37. We showcase performance and versatility by extending DIA to arbitrary peptide PTMs, closing the gap between the versatility of DDA and the performance of DIA.
We present alphaDIA, a modular, open-source framework for DIA search. It builds on the scientific python stack and the alphaPept ecosystem allowing flexible search strategies and default workflows accessible through a Python API, Jupyter notebooks, a command line interface or an easily installable graphical user interface (Fig. 1a and Methods). AlphaDIA covers the entire workflow from raw files to reporting protein quantities and can process files and proprietary formats from all major vendors. It was designed for ‘one-stop processing’ of large cohorts, running natively on Windows, Linux and Mac or in a distributed fashion in the cloud with Slurm or Docker.
Apart from state-of-the-art DIA processing, the impetus for alphaDIA was the shift toward fast, sensitive and stochastic TOF detectors, presenting novel algorithmic challenges and opportunities. AlphaDIA’s feature-free and peptide-centric search is illustrated by the identification of the peptide LLELTSSYSPDVSDYK from timsTOF Ultra dia-PASEF (parallel accumulation serial fragmentation) data (Extended Data Fig. 1). First, we select all MS1 and MS2 spectra that contribute evidence for this precursor (Fig. 1b). A dense representation of the spectrum space is used to score potential peak group candidates, which does not involve feature building or centroiding (Fig. 1c,d). Instead, signals are aggregated across retention time, ion mobility and fragments using learned convolution kernels. Discrete peak groups are determined only after all this evidence has been collected (Fig. 1e). In this way, noisy TOF data in which individual fragment signals are not distinguishable from background can still be processed (Extended Data Fig. 2). The agreement with the predicted spectrum gives evidence for a confident identification only when the signal in the peak groups is integrated into a spectrum of matched fragments (Fig. 1f).
AlphaDIA uses deep-learning-based target-decoy competition and iterative calibration to search complex proteomes with spectral libraries. For each target precursor entry with a given sequence and charge state, a paired decoy peptide is created using a mutation pattern (Methods). Each peak group is scored by a collection of up to 47 features using a fully connected neural network (NN) (Fig. 2a). False precursor identifications are controlled using a count-based false discovery rate (FDR), calculated from the probabilities predicted by the NN (Fig. 2b,c). Measured properties such as retention time, ion mobility and m/z ratios are iteratively calibrated to the observed data on a high-confidence subset of precursors, using nonlinear locally estimated scatterplot smoothing (LOESS) regression with polynomial basis functions (Fig. 2d-f and Supplementary Fig. 1). AlphaDIA uses spectrum-centric fragment competition to ensure that fragment information is only used for single-precursor identification, even when multiple library entries match the same observed signal (Methods). To assess the performance of this algorithm, we performed a library-based search using a previously published spectral library from fractionated Hela lysate that was searched with MSFragger. On a 21-min gradient with 60 samples per day (SPD) of HeLa cell lysate measured on a timsTOF Ultra with dia-PASEF, our algorithm identified more than 73,000 precursors with unique sequence and charge, corresponding to almost 6,800 protein groups (Fig. 2g-i). For label-free quantification (LFQ), we integrated the recently developed directLFQ algorithm, which resulted in a median coefficient of variation (CV) of 7.7% for protein groups and a Person R > 0.99 across replicates (Fig. 2j,k). This suggests that alphaDIA can search and quantify complex protein mixtures with excellent depth and quantitative precision.
Recently, DIA has been coupled to sophisticated data acquisition schemes where the quadrupole isolation window scans nearly continuously through the m/z or m/z and ion mobility space. The methods, termed synchro-PASEF or midia-PASEF hold the promise of much improved precursor specificity and quantitative accuracy; however, this has been difficult to realize because of a lack of flexible algorithms handling the thousands of individual isolation windows per DIA cycle. AlphaDIA’s processing algorithm and alphaRaw’s efficient data handling allow using all synchro scans that contribute signal for a given precursor, considering its isotope distribution as a prior (Fig. 3a). Using the masses and abundance of the precursor isotopes, we model the behavior of the quadrupole, resulting in a template with the expected intensity distribution across synchro scan observations (Fig. 3b). This template includes the slicing of the isotope distribution by the quadrupole, which must be recapitulated in the intensity profiles of the fragments (Fig. 3c). This comparison of the fragment profile with the template contributes to our deep-learning-based identification score and enables the analysis of complex proteomes (Fig. 3d and Extended Data Fig. 3). This first processing algorithm for sliding quadrupole data could be extended from synchro-PASEF to similar acquisition schemes such as midia-PASEF or scanning SWATH (sequential window acquisition of all theoretical fragment ions).
Next, we wanted to extend the reach of alphaDIA to other proteomic platforms and methods. For instance, our algorithms adapted naturally to fixed-window and variable-window DIA data from quadrupole Orbitrap analyzers. The absence of ion mobility reduces the search space to a one-dimensional search across retention time while still using all valid MS2 observations for a given precursor (Fig. 3e). As before, after discrete peak group candidates have been identified (Fig. 3f), the spectrum-centric view allows detailed scoring using alphaPeptDeep-predicted spectra (Fig. 3g). Additionally, alphaDIA can process Orbitrap and Orbitrap Astral data with wide, narrow, variable or overlapping DIA windows. It can likewise process Sciex SWATH data (Extended Data Fig. 4).
Having established the ability of alphaDIA for in-depth analysis of complex proteomes and its adaptability to diverse platforms, we next wanted to directly benchmark its performance against other common DIA search engines. To avoid potential bias, we build upon a recently published benchmarking study from the Shui group, in which mouse brain membrane isolates were spiked into a complex background of yeast proteins in varying ratios and measured on a quadrupole orbitrap (QE-HF) and a timsTOF. The authors generated empirical libraries with MS Fragger and optimized search parameters for DIA-NN, Spectronaut and MaxDIA (Fig. 4a).
On the basis of the provided libraries, alphaDIA identified up to 50,600 mouse peptides in the QE data across all samples and up to 81,500 on the timsTOF (Extended Data Fig. 5). Inferring proteins from uniquely identified peptide involves considerations that can influence the number of reported protein groups. AlphaDIA allows strict (maximum parsimony) or commonly used heuristic grouping (Methods). With the latter, we identified 5,366 proteins (QE-HF) and 7,649 (timsTOF) protein groups across all samples, matching and even exceeding the other algorithms (Fig. 4b,c). This is also reflected across replicates for single conditions. AlphaDIA quantified the most protein groups in at least three of five replicates for most ratios while maintaining comparable CVs and accuracy as judged by the proteome mixing ratios (Fig. 4d and Supplementary Figs. 2 and 3).
To prevent over-reporting by sophisticated DIA database searching strategies based on internal target-decoy FDR estimates, results can be externally validated by including additional proteome databases from species not present in the sample. As in the benchmarking study, we performed an entrapment search with an Arabidopsis library added in increasing proportions to the target library. On both MS platforms, even for 100% entrapment, Arabidopsis identifications matched the chosen target FDR of 1% at the protein level (Fig. 4e,f). At this protein FDR, false-positive precursors are even less likely, appearing only at 0.1% globally. This contrasted with some of the other tested tools, which reported up to threefold more false-positive Arabidopsis identifications than intended at the chosen FDR target (Supplementary Fig. 4). The increased library size only minimally decreased overall identifications for alphaDIA. We conclude that, for library-based search, alphaDIA provides at least competitive performance with common search engines while maintaining a reliable and conservative FDR.
While empirical libraries benefit from implicitly capturing instrument and workflow specific properties, the key advantage of deep-learning-predicted libraries of the entire proteome database is that it eliminates cumbersome library measurement altogether. We recently introduced alphaPeptDeep, an open-source, transformer-based deep learning framework for predicting all MS-relevant peptide properties from their sequences.
With these state-of-the-art predicted libraries, we devised a two-step search workflow in alphaDIA consisting of library refinement and quantification (Fig. 5a). Furthermore, we reasoned that our feature-free search should adapt well to the high-sensitivity TOF data generated by the Orbitrap Astral MS instrument. For benchmarking, we acquired and searched bulk Hela samples with an alphaPeptDeep-predicted library containing 3.6 million tryptic precursors. AlphaDIA identified on average more than 120,000 precursors, matching or exceeding the performance of the other tested search engines (Fig. 5b). As comparison of inferred protein numbers in bottom-up proteomics depends on the chosen algorithm, which is not public for the other tools, we wanted to provide an upper and lower limit with heuristic grouping and more conservative maximum-parsimony-based inference (Methods). Remarkably, in the 60-SPD method (21 min) this corresponded to the identification of 9,800 protein groups with heuristic grouping and close to 8,600 proteins without grouping (Fig. 5d). The great depth of proteome characterization was also reflected in the data completeness across replicates (Extended Data Fig. 6). We validated the FDR control of this more complex two-step workflow by appending the Arabidopsis library, which externally confirmed rigorous control of false-positive identifications (1.08% at protein level and 0.2% at precursor level; Fig. 5f). While searches of fully predicted tryptic libraries are usually faster than acquisition for non-ion-mobility data (Fig. 5e), the explicit modeling of the ion mobility dimension leads to increased processing times (>1 h per file) for large libraries and will need improvement in future versions of AlphaDIA.
To compare identified proteins across search engines, we mapped peptide sequences to the UniProt reference proteome. Reassuringly, more than 78,000 peptides and 8,100 proteins (counting only nonambiguous matches) were jointly identified by all tested tools (Fig. 5g). AlphaDIA had the highest number of uniquely identified peptides among search engines, manifesting in high sequence coverage (median of eight peptides per protein; Fig. 5h) and few proteins with only single-peptide evidence across the tested search engines (Extended Data Fig. 7).
To assess the accuracy of LFQ, we used the established strategy of three species proteomes mixed in defined ratios, acquired on the Orbitrap Astral. Fully predicted library search combined with directLFQ recapitulated the expected ratios with excellent precision and accuracy (Fig. 5i and Extended Data Fig. 8).
Multiplexed DIA has recently shown great potential to increase throughput and depth. To analyze such data, identifications must be transferred between the channels, which involves an additional channel FDR. We benchmarked it to a DIA dataset in which HeLa cells were labeled as heavy and light using stable isotope labeling by amino acids in cell culture (SILAC) and analyzed on a QE-HFX (Extended Data Fig. 9). Proportions of identifications in ‘light only’, ‘heavy only’ and ‘light and heavy’ were very similar to the previous DDA and DIA results, validating our channel FDR. Interestingly, on the same data, the absolute number of identified peptides was threefold higher than in the original paper, reflecting advances in DIA search over the last years in general and specifically in alphaDIA.
To date, fully predicted libraries address many of the needs of DIA workflows but their pretrained prediction models are still best suited to the sample and instrument types that were used in training. This makes it necessary to train custom models for different situations (for example, PTMs), as they generally change retention and fragmentation behavior compared to the unmodified peptide. We reasoned that close integration of prediction by deep learning and the search engine might have the potential learn to adapt to such differences, an approach that we call DIA transfer learning. The subsequent search with alphaDIA confidently identified precursors and their spectra were first collected into a training dataset. The general pretrained models for retention time, fragmentation spectra and charge were fine-tuned on the experiment-specific training dataset (Fig. 6a,b). This resulted in a custom model, reflecting the behavior of peptides on the individual LC-MS setup. A hold-out validation and test dataset ensured generalization and prevented overfitting.
To assess the potential of transfer learning, we first applied it to a dataset of dimethylated HeLa peptides, an example of a modification that is known to alter retention times and fragmentation behavior (Methods and Fig. 6c). We found that transfer learning accurately modeled the effects of the lysine and N-terminal dimethylation on retention time behavior, improving R from 0.69 to 0.99 (Fig. 6d-i).
Using the transfer learned model resulted in a total of 96,000 unique precursor and 8,613 protein identifications, a 48% increase over the 65,000 precursors identified without transfer learning and a 25% increase in protein groups (Fig. 6d,e and Supplementary Fig. 5). This gain in identifications is driven additively by both improved predictions of retention times from a median prediction error of 317 s down to only 11 s and an increase in the median correlation to predicted spectra from 0.5 to 0.85 (Fig. 6g,h).
Given these large improvements, we wished to ascertain that they were not the result of overfitting, despite the use of a hold-out validation and test dataset. Similarly to before, we used entrapment with the Arabidopsis proteome library followed by transfer learning with all precursors, including false-positive Arabidopsis hits (Extended Data Fig. 10a). Remarkably, even successive rounds of transfer learning led to more confident precursors identifications and <0.5% false Arabidopsis identifications at 1% FDR (Extended Data Fig. 10b-d). Upon inspection, we found that predictions of target hits showed substantial improved agreement with observed data, whereas the opposite was true of false-positive Arabidopsis hits (Extended Data Fig. 10e-g). This implies that end-to-end transfer learning generalizes to the peptide behavior in the actual experiment, improving identifications and control of false discoveries at the same time.

