In collaboration with Payame Noor University and Iranian Chemical Science and Technologies Association

Document Type : Full research article

Author

Department of Chemistry, Payam Noor University, 19395-4697, Tehran, Iran

Abstract

Feature selection is crucial in Quantitative Structure-Activity Relationship (QSAR) studies, enhancing learning algorithms’ performance and reducing computational costs. This study evaluates the impact of eight variable selection methods on the classification of isoform-selective ligands for Bcl-2 and Bcl-xL targets using three machine learning techniques: Supervised Kohonen Network (SKN), Support Vector Machine (SVM), and Partial Least Squares Discriminant Analysis (PLS-DA). Classification models were assessed using confusion matrix parameters, 10-fold Venetian blind cross-validation, and test sets.
The results show that PLS-DA and SVM have comparable classification capabilities, outperforming SKN. However, PLS-DA occasionally leaves some ligands unassigned, making SVM a more robust and efficient choice. Despite using different variable selection methods, no clear advantage was found for any specific method, with all achieving around 70% classification accuracy in validation and test series. This suggests that the choice of variable selection method does not consistently affect outcomes across all techniques.
Ensuring the reliability of selected variables involves meticulous data quality assessments, literature review, and robust cross-validation. Eliminating redundant features is essential for accurate classification models, as many physicochemical properties may be irrelevant to target bioactivity. While no single method guarantees superior models, selecting important variables is vital for extracting relevant features. This study highlights the importance of careful variable selection in QSAR studies, emphasizing its role in reducing dimensionality and improving model interpretability. Ultimately, this enhances drug discovery efficiency by identifying safer and more effective compounds, reducing time and cost.

Keywords

 
[1]M. Eklund, U. Norinder, S. Boyer, and L. Carlsson, Choosing feature selection and learning algorithms in QSAR, J. Chem. Inf. Model 54 (2014) 837-843.
 [2]M. Eklund, U. Norinder, S. Boyer, and L. Carlsson, Benchmarking Variable Selection in QSAR, Mol. Inform31 (2012) 173–179.
[3]N. Georges, I. Mhiri, and I. Rekik, Alzheimer’s disease Neuroimaging Initiative Identifying the best datadriven feature selection method for boosting reproducibility in classification tasks, Pattern Recognition 101 (2020) 1- 14.
[4]M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang, and J. Chong, BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology, Nucleic Acids Res. 44(D1) (2016) D1045–D1053.
[5]S. Goswami, and A. Chakraborty, An efficient feature selection technique for clustering based on a new measure of feature importance, J. Intell. Fuzzy Syst. 32(6) (2017) 3847–3858.
[6]A. Mani-Varnosfaderani, M. S. Neiband, and A. Benvidi, Identification of molecular features necessary for selective inhibition of B cell lymphoma proteins using machine learning techniques, Mol. Divers. 23 (2019) 55–73.
[7]A. Mauri, V. Consonni, M. Pavan, and R. Todeschini, Dragon software: An easy approach to molecular descriptor calculations, Match, 56(2) (2006) 237-248.
[8]M. W. Mwadulo, A Review on Feature Selection Methods For Classification Tasks,  Int. J. Comput. Appl. Technol. 5 (2016) 395-402.
[9]N. M. O'Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison, Open Babel: An open chemical toolbox, J. Chem. inf. 3 (2011) 33.
[10]H. Kaneko, Examining variable selection methods for the predictive performance of regression models and the proportion of selected variables and selected random variables, Heliyon 7 (2021).
[11]R. Davronov, and S. Kushmuratov, Comparative analysis of QSAR feature selection methods, In AIP Conference Proceedings 3004 (2024).
[12]P. De, S. Kar, P. Ambure, and K. Roy, Prediction reliability of QSAR models: an overview of various validation tools, Arch. Toxicol. 96 (2022) 1279-1295.
[13]S. Kausar, and A. O. Falcao, An automated framework for QSAR model building,  J. Chem. Inf. Comput. Sci. 10 (2018) 1.
[14]I. Ponzoni, V. Sebastián-Pérez, C. Requena-Triguero,  C. Roca, M. J. Martínez, F. Cravero, M. F. Díaz, J. A. Páez, R. G. Arrayás, J. Adrio, and  N. E.  Campillo, Hybridizing Feature Selection and Feature Learning Approaches in QSAR Modeling for Drug Discovery, Sci. Rep. 7 (2017) 2403.
[15]J. Tang, S. Alelyani, and H. Liu, Feature selection for classification: A review, Data Classification: Algorithms and Applications book (2014) 37-64.
[16]L. Yu, and H. Liu, Efficient feature selection via analysis of relevance and redundancy,  J. Mach. Learn. Res5 (2004) 1205-1224.