Improving Antibody CDR Template Selection by Structural Cluster Prediction

Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
With the advent of high-throughput sequencing, antibody sequences can be acquired at much greater speed than corresponding structures, creating a need for rapid structure determination. Computational modeling is the only feasible method for high-throughput structure determination, however it does not always produce models with high accuracy. In antibody modeling, the framework regions are well conserved and readily modeled to sub-Angstrom accuracy, but accurate modeling of the complementarity determining region (CDR) loops remains elusive. This is a challenge we must overcome if we are to study antibody function or design an antibody, using models. Of the six CDR loops, the non-H3 CDR loops (H1, H2, and L1–L3) are easier to model than the H3 loop, because they are shorter and have less structural and length variability. Moreover, most of the non-H3 CDR loop structures can be grouped by CDR and length and can be clustered into a few canonical structure clusters. The ability to accurately predict the correct cluster of a CDR from sequence alone could improve structural modeling. In this thesis, I assessed how well current modeling techniques can identify the CDR canonical structures from sequence alone and I improved the retrieval accuracy. First, I benchmarked the current CDR loop modeling method in Rosetta and found it failed to predict the correct canonical structure clusters for 19% of CDRs. Next, I assessed the significance of the failures by comparing to a random cluster selection model. Then, to improve the accuracy of template selection, I trained a machine learning classifier, for each CDR and length group, with sequences as features, and found that the classifier successfully improved the retrieval of canonical structures. This improvement is not achievable by the residue position rules alone. Finally, I propose incorporating canonical class prediction via machine learning to improve canonical structure retrieval accuracy and I expected this improvement to increase as the less populated CDR clusters become more enriched.
Antibody, complementary determining regions, CDRs, Rosetta Antibody, protein structural modeling