Description of TMCrys

Method description

TMCrys aims helping the crystallization of transmembrane proteins by providing prediction of the probability of success at several structure determination steps based on the sequence of the protein. XGBoost Decision Trees were trained on data from PDBTM database that hosts structures of transmembrane proteins and from TargetTrack database that was established as part of the Protein Structure Initiative and incorporated hundred of thousands structural genomics experiments. The testing was performed on a hold-out test set.

You can find further information in the paper describing TMCrys method:

Julia K. Varga and Gábor E. Tusnády
TMCrys: predict propensity of success for membrane protein crystallization
Bioinformatics (2018) 34, 3126–3130. Medline Journal PDF

Julia K. Varga and Gábor E. Tusnády
The TMCrys server for supporting crystallization of transmembrane proteins
Bioinformatics (2019) 35, 4203–4204.

Medline Journal PDF

Creating training and test datasets

Training and test datasets for TMCrys were created using PDBTM and TargetTrack databases. Entries from PDBTM were categorized as positive examples for all three steps while entries from TargetTrack were sorted and categorized by the outcome of the experiments.

The training and test data for the whole process was created by using the positive training and test data of the crystallization step as the positive training and test set, respectively. For the negative dataset, the negative datasets of the three steps were concatenated.

Datasets were redundancy removed with CD-HIT package, first by using CD-HIT-2D to remove entries from the negative dataset with at least 60% identity to any entry from the positive set. Then, every dataset was clustered using CD-HIT with 60% identity threshold as well.

The datasets can be downloaded from here.

Step Success Failure Training Test Total
Solubilization 2747 1086 3066 767 3833
Purification 2199 536 2188 547 2735
Crystallization 762 1605 1894 473 2367
Whole process 762 3188 3164 783 3950

Number of entries of the training and test datasets.

Calculating features

Machine learning features were calculated by external programs and in-house scripts. The following variable groups were calculated with the listed softwares and packages:

  • Membrane protein topology
  • Signal peptide
  • Amino acid composition
  • Full length protein
  • Transmembrane and extramembrane regions
  • ProtParam
  • Length, molecular weight, GRAVY
  • Half life, instability index, pI
  • protr and AAIndex properties
  • Autocorrelation descriptors
  • Pseudo amino acid compositions and transitions
  • NetSurfP
  • Solvent accessible surface area
  • Exposed and buried residues

Training and testing the model

An XGBoost Decision Trees algorithm was trained using Bayesian Optimization. The classification thresholds were modified to balance sensitivity and specificity.

The model was validated using 10-fold cross-validation. The tests were performed on holdout test sets (described above), and for the whole process we used other available crystallization prediction methods listed under Curve 4.

ROC curves of the different models (1-3) and the whole process (4).