Implementation of Feature Selection to Improve the Accuracy of Gender Classification Based on Voice Data with Random Forest
DOI:
https://doi.org/10.32815/jitika.v20i1.1204Keywords:
feature selection, gender classification, random forest, voice dataAbstract
Voice-based gender recognition has gained increasing importance in biometrics, security, forensics, and human–computer interaction. While humans can easily distinguish male and female voices, automatic classification remains challenging due to variability and high-dimensional acoustic data. This study investigates the role of feature selection in enhancing the performance and efficiency of Random Forest for gender classification. The dataset, obtained from Kaggle, consists of 3,168 balanced voice samples with 23 acoustic features. Using Pearson’s correlation analysis, five features with the strongest associations to the target variable were selected. Random Forest classification was then conducted using both the full set of 22 features and the reduced set of 5 features. Results suggest that although the accuracy gain was marginal (98% to 99%), computation time decreased substantially from 0.3 to 0.1 seconds, representing a 66% efficiency improvement. These findings suggest that lightweight correlation-based feature selection can simplify models and enable faster real-time applications without compromising predictive performance. The study emphasizes efficiency rather than accuracy as the main contribution, providing a methodological insight for designing scalable and inclusive voice-based gender recognition systems.
Downloads
References
Aithal, P. S., Prabhu, S., & Aithal, S. (2024). Future of Higher Education through Technology Prediction and Forecasting. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4901474
Amjad Hassan Khan M. K., & Aithal, P. S. (2022). Voice Biometric Systems for User Identification and Authentication – A Literature Review. International Journal of Applied Engineering and Management Letters, 198–209. https://doi.org/10.47992/ijaeml.2581.7000.0131
Bora, B., Emanet, A. E., Elmaci, E., Kandaz, D., & Uçar, M. K. (2023). Hybrid AI-based Voice Authentication. Turkish Journal of Forecasting, 07(2), 17–22. https://doi.org/10.34110/forecasting.1260073
Chen, X., Li, Z., Setlur, S., & Xu, W. (2022). Exploring Racial and Gender Disparities in Voice Biometrics. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-022-06673-y
Dantcheva, A., Elia, P., & Ross, A. (2016). What Else Does Your Biometric Data Reveal? A Survey on Soft Biometrics. IEEE Transactions on Information Forensics and Security, 11(3), 441–467. https://doi.org/10.1109/tifs.2015.2480381
De Kloet, M., & Yang, S. (2022). The effects of anthropomorphism and multimodal biometric authentication on the user experience of voice intelligence. Frontiers in Artificial Intelligence, 5, 831046. https://doi.org/10.3389/frai.2022.831046
Farida, F., & Mustopa, A. (2023). Comparison of Logistic Regression and Random Forest Using Correlation-Based Feature Selection for Phishing Website Detection. Sistemasi, 12(1), 13. https://doi.org/10.32520/stmsi.v12i1.1832
Han, T., Jiang, D., Zhao, Q., Wang, L., & Yin, K. (2017). Comparison of Random Forest, Artificial Neural Networks and Support Vector Machine for Intelligent Diagnosis of Rotating Machinery. Transactions of the Institute of Measurement and Control, 40(8), 2681–2693. https://doi.org/10.1177/0142331217708242
Ishak, R. (2022). Volume 4 Nomor 2 Juli 2022 Implementasi Seleksi Fitur Klasifikasi Waktu Kelulusan Mahasiswa Menggunakan Correlation Matrix With Heatmap. Jambura Journal of Electrical and Electronics Engineering, 169. https://siakun.unisan.ac.id/
Islam, R., Imran, A., & Rabbi, Md. F. (2024). Prostate Cancer Detection From MRI Using Efficient Feature Extraction With Transfer Learning. Prostate Cancer, 2024, 1–28. https://doi.org/10.1155/2024/1588891
Jansen, F., Sánchez‐Monedero, J., & Dencik, L. (2021). Biometric Identity Systems in Law Enforcement and the Politics of (Voice) Recognition: The Case of SiiP. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211063604
Jha, A. K., Singhal, N., & Chhabra, A. (2024). Voice Recognition Techniques: A Review Paper. Educational Administration Theory and Practices. https://doi.org/10.53555/kuey.v30i3.5944
Kim, J.-H., & Yang, Y.-M. (2018). An Enhanced Classification Scheme With AdaBoost Concept in BCI. Journal of Intelligent & Fuzzy Systems, 35(1), 63–68. https://doi.org/10.3233/jifs-169567
Kushwah, S., Singh, S., Vats, K., & Nemade, M. V. (2019). Gender Identification Via Voice Analysis. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 746–753. https://doi.org/10.32628/cseit1952188
Maxwell, A. E., Warner, T. A., & Fang, F. (2018). Implementation of Machine-Learning Classification in Remote Sensing: An Applied Review. International Journal of Remote Sensing, 39(9), 2784–2817. https://doi.org/10.1080/01431161.2018.1433343
Ren, Y., Wu, D., Singh, A. K., Kasson, E., Huang, M., & Cavazos‐Rehg, P. (2022). Automated Detection of Vaping-Related Tweets on Twitter During the 2019 EVALI Outbreak Using Machine Learning Classification. Frontiers in Big Data, 5. https://doi.org/10.3389/fdata.2022.770585
Sandhan, T., Sonowal, S., & Choi, J. Y. (2014). Audio Bank: A high-level acoustic signal representation for audio event recognition. International Conference on Control, Automation and Systems, 82–87. https://doi.org/10.1109/ICCAS.2014.6987963
Sen, S. (2024). Comparison of Boosting and Random Forest Models in Forecasting Bank Failures: Revisiting the 2008 Financial Crisis From a Supervisory Perspective. Economy & Finance, 11(3), 258–281. https://doi.org/10.33908/ef.2024.3.2
Shafhah, A. A., Adikara, P. P., & Adinugroho, S. (2020). Klasifikasi Jenis Kelamin Berdasarkan Suara Menggunakan Metode Learning Vector Quantization. http://j-ptiik.ub.ac.id
Štitilis, D., Laurinaitis, M., & Verenius, E. (2023). The Use of Biometric Technologies in Ensuring Critical Infrastructure Security: The Context of Protecting Personal Data. Journal of Entrepreneurship and Sustainability Issues, 10(3), 133–150. https://doi.org/10.9770/jesi.2023.10.3(10)
Suryanegara, G. A. B., Adiwijaya, A., & Purbolaksono, M. D. (2021). Peningkatan Hasil Klasifikasi Pada Algoritma Random Forest Untuk Deteksi Pasien Penderita Diabetes Menggunakan Metode Normalisasi. Jurnal Resti (Rekayasa Sistem Dan Teknologi Informasi), 5(1), 114–122. https://doi.org/10.29207/resti.v5i1.2880
Swastika, W., Widodo, R. B., & Oepojo, A. A. (2023). Perbandingan Akurasi Deteksi Emosi Pada Suara Menggunakan Multilayer Perceptron, Random Forest, Decision Tree Dan K-Nn. Journal of Intelligent System and Computation, 5(1), 17–22. https://doi.org/10.52985/insyst.v5i1.264
Syukron, A., Sardiarinto, S., Saputro, E., & Widodo, P. (2023). Penerapan Metode Smote Untuk Mengatasi Ketidakseimbangan Kelas Pada Prediksi Gagal Jantung. Jurnal Teknologi Informasi Dan Terapan, 10(1), 47–50. https://doi.org/10.25047/jtit.v10i1.313
Taha, T. M., Messaoud, Z. B., & Frikha, M. (2024). Convolutional Neural Network Architectures for Gender, Emotional Detection From Speech and Speaker Diarization. International Journal of Interactive Mobile Technologies (IJIM), 18(03), 88–103. https://doi.org/10.3991/ijim.v18i03.43013
Wen, L., & Hughes, M. G. (2020). Coastal Wetland Mapping Using Ensemble Learning Algorithms: A Comparative Study of Bagging, Boosting and Stacking Techniques. Remote Sensing, 12(10), 1683. https://doi.org/10.3390/rs12101683
Wilson, A., & Anwar, M. R. (2024). The Future of Adaptive Machine Learning Algorithms in High-Dimensional Data Processing. International Transactions on Artificial Intelligence (ITALIC), 3(1), 97–107. https://doi.org/10.33050/italic.v3i1.656
Xu, H., Xie, W., Pang, M., Li, Y., Jin, L., Huang, F., & Shao, X. (2025). Non-Invasive Detection of Parkinson’s Disease Based on Speech Analysis and Interpretable Machine Learning. Frontiers in Aging Neuroscience, 17. https://doi.org/10.3389/fnagi.2025.1586273
Yusdiantoro, S. Y., & Sasongko, T. B. (2023). Implementasi Algoritma MFCC Dan CNN Dalam Klasifikasi Makna Tangisan Bayi. Indonesian Journal of Computer Science, 12(4). https://doi.org/10.33022/ijcs.v12i4.3243
Zhang, L., Wang, Y., Chen, J., & Chen, J. (2022). RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals. Frontiers in Genetics, 12. https://doi.org/10.3389/fgene.2021.749573
Zhou, Y., Shen, H., & Zhang, M. (2022). A Distributed and Privacy-Preserving Random Forest Evaluation Scheme With Fine Grained Access Control. Symmetry, 14(2), 415. https://doi.org/10.3390/sym14020415
Additional Files
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Jurnal Ilmiah Teknologi Informasi Asia

This work is licensed under a Creative Commons Attribution 4.0 International License.
Upon acceptance for publication, authors transfer copyright of their article to Jurnal Ilmiah Teknologi Informasi Asia. This includes the rights to reproduce, transmit, and translate the material in any form or medium.
While the editorial board endeavors to ensure accuracy, they accept no responsibility for the content of articles or advertisements. Liability rests solely with the respective authors and advertisers.
Website material is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Under this license, users are free to share and adapt the material for any purpose, including commercial use, provided license terms are met. These freedoms are irrevocable by the licensor under such conditions.






