MacCoss - Barcoding Manuscript

Massively parallel assessment of designed protein solution properties using mass spectrometry and peptide barcoding
Data License: CC BY 4.0 | ProteomeXchange: PXD059224 | doi: https://doi.org/10.6069/ms7b-w526
  • Organism: Escherichia coli BL21(DE3)
  • Instrument: Orbitrap Fusion Lumos,Orbitrap Exploris 480
  • SpikeIn: No
  • Keywords: barcoding, peptide, size exclusion chromatography, SEC, protein design, RFDiffusion, generative design, deep learning, machine learning, DL, ML
  • Lab head: David Baker Submitter: Jeremiah Sims
Abstract
Library screening and selection methods can determine the binding activities of individual members of large protein libraries given a physical link between protein and nucleotide sequence, which enables identification of functional molecules by DNA sequencing. However, the solution properties of individual protein molecules cannot be probed using such approaches because they are completely altered by DNA attachment. Mass spectrometry enables parallel evaluation of protein properties amenable to physical fractionation such as solubility and oligomeric state, but current approaches are limited to libraries of 1,000 or fewer proteins. Here, we improved mass spectrometry barcoding by co-synthesizing proteins with barcodes optimized to be highly multiplexable and minimally perturbative, scaling to libraries of >5,000 proteins. We use these barcodes together with mass spectrometry to assay the solution behavior of libraries of de novo-designed monomeric scaffolds, oligomers, binding proteins and nanocages, rapidly identifying design failure modes and successes.
Experiment Description
Rapid improvements in computational methods have enabled the design of proteins with increasingly sophisticated structure and function (Watson et al. 2023; Dauparas et al. 2022; Baek et al. 2021). Nevertheless, for many tasks the success rate remains low, and the deficiencies in computational models are unclear. Library screens address this gap via efficient pooled testing of thousands of designs, identifying candidates for further development and providing feedback to improve the design process. A wide variety of enrichment methods have been applied to protein libraries, including display methods to measure binding affinity (Boder and Wittrup 1997) and protease stability (Tsuboyama et al. 2023), fluorescent substrate turnover for enzymatic activity (Harris et al. 2000), and split GFP or enzymatic reporters as proxies for solubility (Cabantous and Waldo 2006; Zutz et al. 2021). However, these methods rely on physical linkage between the protein and the DNA sequence which encodes it, and this requirement has precluded directly screening proteins in solution for key biophysical properties such as oligomeric state, solubility, expression yield, and functional delivery (Egloff et al. 2019; Rhym et al. 2023). Mass spectrometry, by contrast, enables parallel evaluation of protein properties such as aggregation propensity and resistance to chemical or thermal denaturation in solution. Physical fractionation methods such as size exclusion chromatography or bead-based pulldown of substrate-bound protein can be performed on a sample containing many proteins of interest, and the identities of the proteins in each fraction determined by mass spectrometry. However, direct application of shotgun proteomics to large libraries of designed proteins is frequently limited by high sequence similarity among designs. To circumvent sequence similarity, mass spectrometry multiplexing via peptide barcodes was recently developed to evaluate nanobody solution binding, monomericity, and expression (Egloff et al. 2019). However, this method used stochastic linking of peptide barcodes with randomized sequences to designs, and was thus limited to a pre-enriched pool of 1000 nanobodies attached to ~12,000 total barcodes. We reasoned that an approach combining mass spectrometry with peptide barcodes could provide a powerful way of assessing the properties of thousands of designed proteins in solution. We set out to optimize mass spectrometry barcoding for measuring the properties of diverse de novo-designed proteins, including monomeric and oligomeric scaffolds, minibinders, and nanocages. Since the previous approach used a shotgun cloning strategy to stochastically assign barcodes to a small nanobody library, there was an inherent need to use next generation sequencing (NGS) to identify unique barcode assignments to designs. While the previous approach proved effective, we reasoned that pre-assignment of barcodes at the DNA oligo level would afford greater overall throughput than a shotgun cloning approach, because pre-assignment would mitigate non-unique pairings of barcode and designed protein. Additionally, the previous approach suffered from barcode dropout at the level of mass spectrometry, likely due to barcode-specific differences in ionization efficiency. Thus, we aimed to identify a set of barcodes that would ionize reliably, thereby further increasing the fidelity of barcode identification, and subsequently, improving throughput. We began by seeking to design peptide barcodes in silico that are (1) co-synthesized with designs on an oligonucleotide array; (2) easily purified for mass spectrometry (Figure 1a); (3) minimally perturbative to the attached designs; and (4) efficiently separated and quantified by high resolution orbitrap liquid chromatography-coupled tandem mass spectrometry (LC-MS/MS). To meet criteria (1) and (2), we adapted pET-28, a T7 expression vector, for library cloning of 300-nt oligos containing barcodes fused to protein designs of up to 74 amino acids, or 154 amino acids if using oligo assembly. Proteins expressed from this vector contain a barcode flanked by a designed protein and either an N- or C-terminal His-tag. Arginine and lysine residues were restricted to the barcode boundaries to enable facile isolation of barcodes from the protein of interest and tags by sequential protease digest by LysC digest, His-tag purification, and then trypsin digest. Barcodes were limited to 8 to 13 amino acids in length and predicted to generate doubly-charged precursors by electrospray ionization, in order to optimally cover the mass-to-charge range of high resolution orbitrap mass spectrometry. To meet criterion (3), the amino acid content of the barcodes was limited to avoid bulky hydrophobic residues likely to disrupt folding, as well as residues that affect net charge in electrospray ionization, residues likely to undergo chemical modification, and residues that interfere with tryptic cleavage. Finally, to meet criterion (4), LC indexed retention time (iRT, a standardized measurement for predicting elution, (Escher et al. 2012)) and MS/MS fragmentation spectra were predicted for candidate barcodes using Prosit, and a first-generation set of up to 100,000 barcodes (v1 barcodes) was defined based on separability at the expected m/z and LC resolution. We applied our improved v1 mass spectrometry barcoding approach to a set of 520 de novo-designed small beta barrels with six different barrel topologies, four of which were previously identified as protease-resistant via yeast surface display (Kim et al. 2023). Encouraged by the ability to identify successful monomeric designs, we next applied pooled MS barcoding to a ~10x larger library of 4,495 oligomers. The assembly of oligomers in solution is not well recapitulated in yeast display, so library-scale methods have not been applied to designed oligomers to date. For this design library, we expected successful designs to form highly stable complexes due to their large oligomeric interfaces. These complexes first assemble in the clonal environment of individual cells, and if stable should be maintained through cell lysis, pooled purification, and pooled SEC. Further, to test pooled MS barcoding of larger proteins, we barcoded a set of 5,068 homo-oligomers designed from curved helical repeats, ranging in symmetry from C2 to C6 (26,805 barcodes, minimum 5 barcodes per design, design length 119-156 aa). We further increased the size range to ~215 kDa in a SEC-MS screen of 1173 one-component I3 nanoparticles interfaces with constant scaffolds. We compared DDA and DIA for barcode identification by linking ~25,000 v1 barcodes to muGFP for single-sample readouts that lack the peptide identification benefit of matching peptide IDs between runs. Whereas DIA used the MS2 signals for quantification, our DDA protocol relied upon the MS1 signal for this purpose. The unoptimized DIA protocol (at 500,000 resolution, 5µm packing silica) yielded more peptide identifications than our DDA protocol, supporting the hypothesis that DIA is more reliable in detection of these barcodes at higher pool complexities. We further improved sensitivity and quantitative accuracy using columns with finer particle sizes (from 5µm to 1.9µm) that provided higher chromatographic resolution and reducing the orbitrap MS1 resolution (from 500,000 to 30,000) to permit a more even survey of the sample. Together these resulted in a doubling of the barcode identification rate (30% to 60%). Despite the increase in peptide identification rate from our optimized DIA protocol, we still observed 40% barcode dropout. Barcodes with a high hydrophilicity score, and particularly barcodes rich in acidic residues Asp and Glu, were detected at lower rates. Using this information, we generated a v2 barcode set with fewer acidic residues. LC elution times for 95% of detected barcodes were accurately predicted within 0.5 iRT, confirming that Prosit predictions generalize with high accuracy to synthetic peptide sequences, and this library achieved an overall 86.6% barcode recovery rate with the optimized DIA protocol. Further, the high detection rate observed amongst these barcodes through our purification protocol suggests that vast majority of these barcodes do not impact soluble expression of muGF and are, therefore, suitable for screening libraries of proteins for properties related to soluble expression (see barcode criterion 3 above). The v2 barcoding library was developed to improve detection of barcodes to permit high-confidence identifications of hits in complex pools. To evaluate the utility of the v2 barcode library, we set out to screen a library of 1,187 putative de novo tetrahedra.
Sample Description
01_UWPR_beta_barrels: MS_sample SEC_Well Elution_volume_mL MS_154 1.A.11 9.023 MS_155 1.A.12 9.523 MS_156 1.B.1 10.023 MS_157 1.B.2 10.524 MS_158 1.B.3 11.024 MS_159 1.B.4 11.524 MS_160 1.B.5 12.024 MS_161 1.B.6 12.524 MS_162 1.B.7 13.024 MS_163 1.B.8 13.524 MS_164 1.B.9 14.024 MS_165 1.B.10 14.524 MS_166 1.B.11 15.025 MS_167 1.B.12 15.525 MS_168 1.C.1 16.025 MS_169 1.C.2 16.525 MS_170 1.C.3 17.025 MS_171 1.C.4 17.525 MS_172 1.C.5 18.025 MS_173 1.C.6 18.525 MS_174 1.C.7 19.025 05_UWPR_rolls: MS_sample SEC_Well Elution_volume_mL MS_191 insoluble MS_192 soluble MS_193 injection MS_212 1.H.10 10.15426254 MS_213 2.A.2 11.15418434 MS_214 2.A.6 12.15399075 MS_215 2.A.10 13.15520859 MS_216 2.B.2 13.65509415 MS_217 2.B.4 14.15519333 MS_218 2.B.6 14.6550684 MS_219 2.B.8 15.15495396 MS_220 2.B.10 15.65505791 MS_221 2.B.12 16.15494537 MS_222 2.C.2 16.65604591 MS_223 2.C.4 17.15616035 MS_224 2.C.6 17.65605545 MS_225 2.C.8 18.15612602 MS_226 2.C.10 18.65602493 MS_227 2.C.12 19.15591621 MS_228 2.D.2 19.65599823 MS_229 2.D.4 20.15587425 23_UWPR_CR_cages: MS_sample SEC_Well Elution_volume_mL MS_696 1.B.2 8.272531509 MS_697 1.B.3 8.772404671 MS_698 1.B.4 9.272274971 MS_699 1.B.5 9.772377014 MS_700 1.B.6 10.27224064 MS_701 1.B.7 10.77336884 MS_702 1.B.8 11.27346897 MS_703 1.B.9 11.7733345 MS_704 1.B.10 12.27345181 MS_705 1.B.11 12.77331734 MS_706 1.B.12 13.27319908 MS_707 1.C.1 13.77329731 MS_708 1.C.2 14.2731657 MS_709 1.C.3 14.77304649 MS_710 1.C.4 15.27315044 MS_711 1.C.5 15.77424812 MS_712 1.C.6 16.2741394 MS_713 1.C.7 16.7742424 MS_714 1.C.8 17.27409935 MS_715 1.C.9 17.77420807 MS_716 1.C.10 18.27408981 MS_717 1.C.11 18.77394676 MS_718 1.C.12 19.274061 26_UWPR_chip176_BWLM: MS_sample SEC_Well Elution_volume_mL MS_792 insoluble MS_793 soluble MS_794 injection MS_795 1.A.1 8.28 MS_796 1.A.3 8.78 MS_797 1.A.5 9.28 MS_798 1.A.7 9.78 MS_799 1.A.9 10.28 MS_800 1.A.11 10.78 MS_801 1.B.1 11.28 MS_802 1.B.3 11.77 MS_803 1.B.5 12.27 MS_804 1.B.7 12.77 MS_805 1.B.9 13.27 MS_806 1.B.11 13.77 MS_807 1.C.1 14.27 MS_808 1.C.3 14.77 MS_809 1.C.5 15.27 MS_810 1.C.7 15.77 MS_811 1.C.9 16.27 MS_812 1.C.11 16.77 28_UWPR_chip176_BWLM_S200: MS_sample SEC_Well Elution_volume_mL MS_912 1.A.10 10.40267754 MS_915 1.A.4 8.90126133 MS_916 1.A.2 8.401278496 MS_917 Injection MS_920 1.C.2 14.40377331 MS_921 1.B.12 13.90226746 MS_922 1.B.10 13.40219402 MS_923 1.B.8 12.90235329 MS_924 1.B.6 12.40250778 MS_925 1.B.4 11.9024353 MS_926 1.B.2 11.40260124 MS_927 1.A.12 10.90252399 MS_928 1.D.6 18.40488052 MS_929 1.D.4 17.90338707 MS_930 1.D.2 17.40330315 MS_931 1.C.12 16.90345764 MS_932 1.C.10 16.40361786 MS_933 1.C.8 15.90353966 MS_934 1.C.6 15.40369129 30_ms_barcoding_optimization MS_1329 01_muGFP_DDA MS_1342 02_muGFP_DIA MS_1369 03_gblock_muGFP_DIA MS_1390 04_gblock_muGFP_DIA_1_9micron_silica MS_1435 05_new_codes_DIA_replicate1 MS_1436 05_new_codes_DIA_replicate2 MS_1437 05_new_codes_DIA_replicate3 31_HE_Tetrahedra MS_sample SEC_Well Elution_volume_mL MS_1444 2.A.1 10.10438251 MS_1445 2.A.3 10.60443211 MS_1446 2.A.5 11.1043005 MS_1447 2.A.7 11.60418034 MS_1448 2.A.9 12.10427666 MS_1449 2.A.11 12.60414886 MS_1450 2.B.1 13.10526466 MS_1451 2.B.3 13.60537148 MS_1452 2.B.5 14.10523796 MS_1453 2.B.7 14.60516167 MS_1454 2.B.9 15.10521507 MS_1455 2.B.11 15.6050787 MS_1456 2.C.1 16.10518265 MS_1457 2.C.3 16.60505867 MS_1458 2.C.5 17.10617065 MS_1459 2.C.7 17.60626793 MS_1460 2.C.9 18.10614777 MS_1461 2.C.11 18.60601044 MS_1462 2.D.1 19.10610962 MS_1463 2.D.3 19.60598946 MS_1464 2.D.5 20.10593414 MS_1465 2.D.7 20.60595322 MS_1466 2.D.9 21.10707283 MS_1467 2.D.11 21.60716629 MS_1468 Injection MS_1469 Insoluble
Created on 12/24/24, 7:35 PM

Massively parallel assessment of designed protein solution properties using mass spectrometry and peptide barcoding



David Feldman*1,2, Jeremiah N. Sims*1,3,4, Xinting Li1,2, Richard Johnson5, Stacey Gerben1,2, David E. Kim1,2, Christian Richardson1,6, Brian Koepnick1,2, Helen Eisenach1,2, Derrick R. Hicks1.2, Erin C. Yang1,2, Basile I. M. Wicky1,2, Lukas F. Milles1,2, Asim K. Bera1,2, Alex Kang1,2, Evans Brackenbrough1,2, Emily Joyce1,2, Banumathi Sankaran7, Joshua M. Lubner1,2, Inna Goreshnik1,2, Dionne Vafeados1,2, Aza Allen1,2, Lance Stewart1,2, Michael J. MacCoss5, David Baker1,2,8


  1. Institute for Protein Design, University of Washington, Seattle, WA 98105, USA

  2. Department of Biochemistry, University of Washington, Seattle, WA 98105, USA

  3. Department of Molecular & Cellular Biology, University of Washington, Seattle, WA 98105, USA

  4. Medical Scientist Training Program, University of Washington, Seattle, WA 98105, USA

  5. Department of Genome Sciences, University of Washington, Seattle, WA 98105, USA

  6. Department of Bioengineering, University of Washington, Seattle, Washington 98105, United States

  7. Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

  8. Howard Hughes Medical Institute, University of Washington, Seattle, WA 98105, USA


* These authors contributed equally to this work


Abstract 

Library screening and selection methods can determine the binding activities of individual members of large protein libraries given a physical link between protein and nucleotide sequence, which enables identification of functional molecules by DNA sequencing. However, the solution properties of individual protein molecules cannot be probed using such approaches because they are completely altered by DNA attachment. Mass spectrometry enables parallel evaluation of protein properties amenable to physical fractionation such as solubility and oligomeric state, but current approaches are limited to libraries of 1,000 or fewer proteins. Here, we improved mass spectrometry barcoding by co-synthesizing proteins with barcodes optimized to be highly multiplexable and minimally perturbative, scaling to libraries of >5,000 proteins. We use these barcodes together with mass spectrometry to assay the solution behavior of libraries of de novo-designed monomeric scaffolds, oligomers, binding proteins and nanocages, rapidly identifying design failure modes and successes.

Clustergrammer Heatmap
 
Download
02_muGFP_DIA_2024-12-24_09-13-49.sky.zip2024-12-24 19:18:232,37319,34619,346174,1141
23_UWPR_CR_cages_2024-12-19_09-57-55.sky.zip2024-12-24 19:18:231,1732,4342,4347,30223
05_UWPR_rolls_2024-12-16_23-19-35.sky.zip2024-12-24 19:18:233,0608,3118,31124,93321
31_HE_Tetrahedra_2024-12-16_23-05-50.sky.zip2024-12-24 19:18:231,1878,0058,00571,68226
05_new_codes_DIA_2024-12-16_22-52-27.sky.zip2024-12-24 19:18:2314,969107,938107,938971,4423
04_gblock_muGFP_DIA_1_9micron_silica_2024-12-16_22-46-14.sky.zip2024-12-24 19:18:232,39725,99225,992233,9281
03_gblock_muGFP_DIA_2024-12-16_22-41-41.sky.zip2024-12-24 19:18:232,37517,98017,980161,8201
01_muGFP_DDA_2024-12-16_22-39-57.sky.zip2024-12-24 19:18:232,3969399392,8171
26_UWPR_chip176_BWLM_2024-12-16_22-35-02.sky.zip2024-12-24 19:18:234,4955,2025,20215,60621
01_UWPR_beta_barrels_2024-12-16_22-31-57.sky.zip2024-12-24 19:18:235202,8042,8048,41219
28_UWPR_chip176_BWLM_S200_2024-12-16_22-30-21.sky.zip2024-12-24 19:18:234,4954,5024,50213,50619