The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction

Binti Solihah; Azhari Azhari; Aina Musdholifah

doi:10.30595/juita.v9i1.9969

Authors

Binti Solihah Jurusan Teknik Informatika, FTI, Universitas Trisakti
Azhari Azhari
Aina Musdholifah

DOI:

https://doi.org/10.30595/juita.v9i1.9969

Keywords:

sampling-based method, class imbalance, conformational epitope, B-cell, machine learning-based

Abstract

A conformational epitope is a part of a protein-based vaccine. It is challenging to identify using an experiment. A computational model is developed to support identification. However, the imbalance class is one of the constraints to achieving optimal performance on the conformational epitope B cell prediction. In this paper, we compare several conformational epitope B cell prediction models from non-ensemble and ensemble approaches. A sampling method from Random undersampling, SMOTE, and cluster-based undersampling is combined with a decision tree or SVM to build a non-ensemble model. A random forest model and several variants of the bagging method is used to construct the ensemble model. A 10-fold cross-validation method is used to validate the model.Â The experiment results show that the combination of the cluster-based under-sampling and decision tree outperformed the other sampling method when combined with the non-ensemble and the ensemble method. This study provides a baseline to improve existing models for dealing with the class imbalance in the conformational epitope prediction.

References

[1] U. Kulkarni-kale, S. Bhosle, and A. S. Kolaskar, “CEP : a conformational epitope prediction server,” Nucleic Acids Res., vol. 33, no. Web Server issue, pp. 168–171, 2005.

[2] G. A. Dalkas and M. Rooman, “SEPIa , a knowledge-driven algorithm for predicting conformational B-cell epitopes from the amino acid sequence,” BMC Bioinformatics, vol. 18, no. 95, pp. 1–12, 2017.

[3] J. Ren, Q. Liu, J. Ellis, and J. Li, “Tertiary structure-based prediction of conformational B-cell epitopes through B factors,” Bioinformatics, vol. 30, pp. 264–273, 2014.

[4] B. Solihah, A. Azhari, and A. Musdholifah, “Enhancement of conformational B-cell epitope prediction using CluSMOTE,” PeerJ Comput. Sci., vol. 6, 2020.

[5] J. Ren, Q. Liu, J. Ellis, and J. Li, “Positive-unlabeled learning for the prediction of conformational B-cell epitopes,” BMC Bioinformatics, vol. 16, no. Suppl 18, pp. 1–15, 2015.

[6] J. Zhang, X. Zhao, P. Sun, B. Gao, and Z. Ma, “Conformational B-Cell Epitopes Prediction from Sequences Using Cost-Sensitive Ensemble Classifiers and Spatial Clustering,” Biomed Res. Int., vol. 2014, pp. 1–12, 2014.

[7] M. Galar, A. Fern, E. Barrenechea, and H. Bustince, “Hybrid-Based Approaches,” IEEE Trans. Syst. Cybern. -PART C Appl. Rev., vol. 42, no. 4, pp. 463–484, 2012.

[8] X. Liu, J. Wu, and Z. Zhou, “Exploratory Undersampling for,” IEEE Trans. Syst. Cybern. -PART BCYBERNETICS, vol. 39, no. 2, pp. 539–550, 2009.

[9] J. Blaszczynski and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 2, no. 5–6, 2014.

[10] A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resampling method for learning from imbalanced data sets,” Comput. Intell., vol. 20, no. 1, pp. 18–36, 2004.

[11] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE : Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

[12] H. Han, W. Wang, and B. Mao, “Borderline-SMOTE : A New Over-Sampling Method in,” in ICIC LNCS, 2005, pp. 878–887.

[13] N. D. Rubinstein, I. Mayrose, D. Halperin, D. Yekutieli, J. M. Gershoni, and T. Pupko, “Computational characterization of B-cell epitopes,” Mol. Immunol., vol. 45, pp. 3477–3489, 2008.

[14] J. Mihel, M. Šiki, S. Tomi, B. Jeren, and K. Vlahovi, “PSAIA – Protein Structure and Interaction Analyzer,” BMC Struct. Biol., vol. 11, pp. 1–11, 2008.

[15] P. H. Andersen, M. Nielsen, and O. L. E. Lund, “Prediction of residues in discontinuous B-cell epitopes using protein 3D structures,” Protein Sci., vol. 15, pp. 2558–2567, 2006.

[16] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, and M. Kanehisa, “AAindex : amino acid index database , progress report 2008,” Nucleid Acids Res., vol. 36, no. November 2007, pp. 202–205, 2008.

[17] S. Yen and Y. Lee, “Expert Systems with Applications Cluster-based under-sampling approaches for imbalanced data distributions,” Expert Syst. Appl., vol. 36, pp. 5718–5727, 2009.

[18] R. A. Sowah, M. A. Agebure, G. A. Mills, K. M. Koumadi, and S. Y. Fiawoo, “New Cluster Undersampling Technique for Class Imbalance Learning,” Int. J. Mach. Learn. Comput., vol. 6, no. 3, pp. 205–214, 2016.

[19] W. Lin, C. Tsai, Y. Hu, and J. Jhang, “Clustering-based undersampling in class-imbalanced data,” Inf. Sci. (Ny)., vol. 409–410, pp. 17–26, 2017.

[20] C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, “Under-sampling class imbalanced datasets by combining clustering analysis and instance selection,” Inf. Sci. (Ny)., vol. 477, pp. 47–54, 2019.

[21] S. Wang and X. Yao, “Diversity analysis on imbalanced data sets by using ensemble models,” 2009 IEEE Symp. Comput. Intell. Data Mining, CIDM 2009 - Proc., pp. 324–331, 2009.

[22] Y. Qian, Y. Liang, M. Li, G. Feng, and X. Shi, “A resampling ensemble algorithm for classification of imbalance problems,” Neurocomputing, vol. 143, pp. 57–67, 2014.

[23] E. Raff, “JSAT : Java Statistical Analysis Tool , a Library for Machine Learning,” J. Mach. Learn. Res., vol. 18, pp. 1–5, 2017.