Optimalisasi Model Logistic Regression untuk Prediksi Diabetes Menggunakan Seleksi Fitur Berbasis Korelasi

  • Wahyu Nugraha Universitas Bina Sarana Informatika
  • Muhamad Syarif Universitas Bina Sarana Informatika
Keywords: Prediksi Diabetes, Logistic Regression, Machine Learning, Seleksi Fitur, Analisis Korelasi

Abstract

Diabetes Mellitus is a pressing global health challenge, making early detection a key component of effective intervention. Machine learning has shown great potential in predicting diabetes risk. Among various models, Logistic Regression (LR) is often favored in a medical context due to its high interpretability, although its accuracy frequently lags behind more complex black-box models. LR performance is known to be highly sensitive to the quality and relevance of input features. This study aims to quantitatively evaluate the impact of a strict correlation-based feature selection strategy on the accuracy of the Logistic Regression model. Using the "Diabetes Health Indicators" dataset (N=100,000), this study compares two scenarios: (1) a baseline LR model using all features (All Input) and (2) an optimized LR model using only a subset of features (including engineered features) that have a high absolute correlation with diabetes diagnosis (Correlated Input). The results demonstrate a significant performance improvement. The All Input baseline model achieved an accuracy of 80.45%, while the Correlated Input model achieved an accuracy of 85.67%. This +5.22% absolute increase demonstrates that correlation-based feature selection effectively eliminates noise from irrelevant features, thus drastically improving the predictive power of the LR model. This study concludes that an optimized Logistic Regression with feature selection offers a strong balance between improved accuracy and the interpretability essential for clinical applications.

Published
2025-12-18