Implementation of Feature Selection to Improve the Accuracy of Gender Classification Based on Voice Data with Random Forest

Authors

  • Suhardiyanto Suhardiyanto Universitas PGRI Ronggolawe, Tuban Regency, 62391, Indonesia
  • Fitroh Amaluddin Universitas PGRI Ronggolawe, Tuban Regency, 62391, Indonesia
  • Aris Wijayanti Universitas PGRI Ronggolawe, Tuban Regency, 62391, Indonesia

DOI:

https://doi.org/10.32815/jitika.v20i1.1204

Keywords:

feature selection, gender classification, random forest, voice data

Abstract

Voice-based gender recognition has gained increasing importance in biometrics, security, forensics, and human–computer interaction. While humans can easily distinguish male and female voices, automatic classification remains challenging due to variability and high-dimensional acoustic data. This study investigates the role of feature selection in enhancing the performance and efficiency of Random Forest for gender classification. The dataset, obtained from Kaggle, consists of 3,168 balanced voice samples with 23 acoustic features. Using Pearson’s correlation analysis, five features with the strongest associations to the target variable were selected. Random Forest classification was then conducted using both the full set of 22 features and the reduced set of 5 features. Results suggest that although the accuracy gain was marginal (98% to 99%), computation time decreased substantially from 0.3 to 0.1 seconds, representing a 66% efficiency improvement. These findings suggest that lightweight correlation-based feature selection can simplify models and enable faster real-time applications without compromising predictive performance. The study emphasizes efficiency rather than accuracy as the main contribution, providing a methodological insight for designing scalable and inclusive voice-based gender recognition systems.

Downloads

Download data is not yet available.

References

Aithal, P. S., Prabhu, S., & Aithal, S. (2024). Future of Higher Education through Technology Prediction and Forecasting. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4901474

Amjad Hassan Khan M. K., & Aithal, P. S. (2022). Voice Biometric Systems for User Identification and Authentication – A Literature Review. International Journal of Applied Engineering and Management Letters, 198–209. https://doi.org/10.47992/ijaeml.2581.7000.0131

Bora, B., Emanet, A. E., Elmaci, E., Kandaz, D., & Uçar, M. K. (2023). Hybrid AI-based Voice Authentication. Turkish Journal of Forecasting, 07(2), 17–22. https://doi.org/10.34110/forecasting.1260073

Chen, X., Li, Z., Setlur, S., & Xu, W. (2022). Exploring Racial and Gender Disparities in Voice Biometrics. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-022-06673-y

Dantcheva, A., Elia, P., & Ross, A. (2016). What Else Does Your Biometric Data Reveal? A Survey on Soft Biometrics. IEEE Transactions on Information Forensics and Security, 11(3), 441–467. https://doi.org/10.1109/tifs.2015.2480381

De Kloet, M., & Yang, S. (2022). The effects of anthropomorphism and multimodal biometric authentication on the user experience of voice intelligence. Frontiers in Artificial Intelligence, 5, 831046. https://doi.org/10.3389/frai.2022.831046

Farida, F., & Mustopa, A. (2023). Comparison of Logistic Regression and Random Forest Using Correlation-Based Feature Selection for Phishing Website Detection. Sistemasi, 12(1), 13. https://doi.org/10.32520/stmsi.v12i1.1832

Han, T., Jiang, D., Zhao, Q., Wang, L., & Yin, K. (2017). Comparison of Random Forest, Artificial Neural Networks and Support Vector Machine for Intelligent Diagnosis of Rotating Machinery. Transactions of the Institute of Measurement and Control, 40(8), 2681–2693. https://doi.org/10.1177/0142331217708242

Ishak, R. (2022). Volume 4 Nomor 2 Juli 2022 Implementasi Seleksi Fitur Klasifikasi Waktu Kelulusan Mahasiswa Menggunakan Correlation Matrix With Heatmap. Jambura Journal of Electrical and Electronics Engineering, 169. https://siakun.unisan.ac.id/

Islam, R., Imran, A., & Rabbi, Md. F. (2024). Prostate Cancer Detection From MRI Using Efficient Feature Extraction With Transfer Learning. Prostate Cancer, 2024, 1–28. https://doi.org/10.1155/2024/1588891

Jansen, F., Sánchez‐Monedero, J., & Dencik, L. (2021). Biometric Identity Systems in Law Enforcement and the Politics of (Voice) Recognition: The Case of SiiP. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211063604

Jha, A. K., Singhal, N., & Chhabra, A. (2024). Voice Recognition Techniques: A Review Paper. Educational Administration Theory and Practices. https://doi.org/10.53555/kuey.v30i3.5944

Kim, J.-H., & Yang, Y.-M. (2018). An Enhanced Classification Scheme With AdaBoost Concept in BCI. Journal of Intelligent & Fuzzy Systems, 35(1), 63–68. https://doi.org/10.3233/jifs-169567

Kushwah, S., Singh, S., Vats, K., & Nemade, M. V. (2019). Gender Identification Via Voice Analysis. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 746–753. https://doi.org/10.32628/cseit1952188

Maxwell, A. E., Warner, T. A., & Fang, F. (2018). Implementation of Machine-Learning Classification in Remote Sensing: An Applied Review. International Journal of Remote Sensing, 39(9), 2784–2817. https://doi.org/10.1080/01431161.2018.1433343

Ren, Y., Wu, D., Singh, A. K., Kasson, E., Huang, M., & Cavazos‐Rehg, P. (2022). Automated Detection of Vaping-Related Tweets on Twitter During the 2019 EVALI Outbreak Using Machine Learning Classification. Frontiers in Big Data, 5. https://doi.org/10.3389/fdata.2022.770585

Sandhan, T., Sonowal, S., & Choi, J. Y. (2014). Audio Bank: A high-level acoustic signal representation for audio event recognition. International Conference on Control, Automation and Systems, 82–87. https://doi.org/10.1109/ICCAS.2014.6987963

Sen, S. (2024). Comparison of Boosting and Random Forest Models in Forecasting Bank Failures: Revisiting the 2008 Financial Crisis From a Supervisory Perspective. Economy & Finance, 11(3), 258–281. https://doi.org/10.33908/ef.2024.3.2

Shafhah, A. A., Adikara, P. P., & Adinugroho, S. (2020). Klasifikasi Jenis Kelamin Berdasarkan Suara Menggunakan Metode Learning Vector Quantization. http://j-ptiik.ub.ac.id

Štitilis, D., Laurinaitis, M., & Verenius, E. (2023). The Use of Biometric Technologies in Ensuring Critical Infrastructure Security: The Context of Protecting Personal Data. Journal of Entrepreneurship and Sustainability Issues, 10(3), 133–150. https://doi.org/10.9770/jesi.2023.10.3(10)

Suryanegara, G. A. B., Adiwijaya, A., & Purbolaksono, M. D. (2021). Peningkatan Hasil Klasifikasi Pada Algoritma Random Forest Untuk Deteksi Pasien Penderita Diabetes Menggunakan Metode Normalisasi. Jurnal Resti (Rekayasa Sistem Dan Teknologi Informasi), 5(1), 114–122. https://doi.org/10.29207/resti.v5i1.2880

Swastika, W., Widodo, R. B., & Oepojo, A. A. (2023). Perbandingan Akurasi Deteksi Emosi Pada Suara Menggunakan Multilayer Perceptron, Random Forest, Decision Tree Dan K-Nn. Journal of Intelligent System and Computation, 5(1), 17–22. https://doi.org/10.52985/insyst.v5i1.264

Syukron, A., Sardiarinto, S., Saputro, E., & Widodo, P. (2023). Penerapan Metode Smote Untuk Mengatasi Ketidakseimbangan Kelas Pada Prediksi Gagal Jantung. Jurnal Teknologi Informasi Dan Terapan, 10(1), 47–50. https://doi.org/10.25047/jtit.v10i1.313

Taha, T. M., Messaoud, Z. B., & Frikha, M. (2024). Convolutional Neural Network Architectures for Gender, Emotional Detection From Speech and Speaker Diarization. International Journal of Interactive Mobile Technologies (IJIM), 18(03), 88–103. https://doi.org/10.3991/ijim.v18i03.43013

Wen, L., & Hughes, M. G. (2020). Coastal Wetland Mapping Using Ensemble Learning Algorithms: A Comparative Study of Bagging, Boosting and Stacking Techniques. Remote Sensing, 12(10), 1683. https://doi.org/10.3390/rs12101683

Wilson, A., & Anwar, M. R. (2024). The Future of Adaptive Machine Learning Algorithms in High-Dimensional Data Processing. International Transactions on Artificial Intelligence (ITALIC), 3(1), 97–107. https://doi.org/10.33050/italic.v3i1.656

Xu, H., Xie, W., Pang, M., Li, Y., Jin, L., Huang, F., & Shao, X. (2025). Non-Invasive Detection of Parkinson’s Disease Based on Speech Analysis and Interpretable Machine Learning. Frontiers in Aging Neuroscience, 17. https://doi.org/10.3389/fnagi.2025.1586273

Yusdiantoro, S. Y., & Sasongko, T. B. (2023). Implementasi Algoritma MFCC Dan CNN Dalam Klasifikasi Makna Tangisan Bayi. Indonesian Journal of Computer Science, 12(4). https://doi.org/10.33022/ijcs.v12i4.3243

Zhang, L., Wang, Y., Chen, J., & Chen, J. (2022). RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals. Frontiers in Genetics, 12. https://doi.org/10.3389/fgene.2021.749573

Zhou, Y., Shen, H., & Zhang, M. (2022). A Distributed and Privacy-Preserving Random Forest Evaluation Scheme With Fine Grained Access Control. Symmetry, 14(2), 415. https://doi.org/10.3390/sym14020415

Additional Files

Published

02-11-2025

How to Cite

Suhardiyanto, S., Amaluddin, F., & Wijayanti, A. (2025). Implementation of Feature Selection to Improve the Accuracy of Gender Classification Based on Voice Data with Random Forest. Jurnal Ilmiah Teknologi Informasi Asia, 20(1), 1–7. https://doi.org/10.32815/jitika.v20i1.1204