The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data. The kNN algorithm is a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets. One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data. Despite this, the model’s simplicity makes it easy to understand and easy to implement in a variety of fields. One such field is the field of healthcare where kNN models have been successfully used to predict diseases such as diabetes and hypertension. In this paper we will focus on the methodology and application of kNN models in the field of healthcare to predict diabetes, a pressing public health problem.
To better understand the role of kNN in healthcare applications, it is important to first review its theoretical foundations, the key factors affecting its performance, and recent advancements in optimizing kNN for large datasets and medical diagnosis, particularly for diabetes prediction.
Theoretical Background of kNN
kNNs are supervised learning algorithms that work by comparing a data point to other similar data points to label it. It works on the assumption that data points that are similar to each other must be close to each other. In the thesis (Z. Zhang 2016), the author gave the reader an introduction to how kNN works and how to run a kNN model in R studio. He describes the methodology as assigning an unlabeled observation to a class by using labeled examples that are similar to it. It also describes the Euclidean distance equation which is the default distance equation that is used for kNNs. The author also describes the impact the k parameter has on the algorithm. The k parameter is the parameter that tells the model how many neighbors it will use when trying to classify a data point. Zhang recommends setting the k parameter equal to the square root of the number of observations in the training dataset.
Although Zhang’s recommendation to set the k parameter could be a great starting point, the thesis (S. Zhang et al. 2017) proposed the decision tree-assisted tuning to optimize k, significantly enhancing accuracy. The authors of this thesis propose using a training stage where we use a decision tree to select the ideal number of k values and thus make the kNN more efficient. The authors deployed and tested two more efficient kNN methods called kTree and the k*Tree methods. They found their method did reduce running costs and increased classification accuracy.
Another big impact on accuracy is the distance the model uses to classify neighbors. Although the euclidean distance is the default distance that is used in kNNs there are other distances that can be used. In the thesis (Kataria and Singh 2013) the authors compare different distances in classification algorithms with a focus on the kNN algorithm. It starts off explaining how the kNN algorithm uses the nearest k-neighbors in order to classify data points and then describes how the euclidean distance does this by putting a line segment between point a and point b and then measuring the distance using the euclidean distance formula. It moves on to describe the “cityblock” or taxican distance and describes it as “the sum of the length of the projections of the line segment”. It also describes the cosine distance and the correlation distance and then compares the performance of the default euclidean distance to the performance of using city block, cosine and correlation distances. In the end it found the euclidean distance was more efficient than the others in their observations.
Syriopoulos et al. (Syriopoulos et al. 2023) also reviewed distance metric selection, confirming that Euclidean distance remains the most effective choice for most datasets. However, alternative metrics like Mahalanobis distance can perform better for correlated features. The review emphasized that selecting the right metric is dataset-dependent, influencing classification accuracy.
Challenges in Scaling kNN for Large Datasets
While kNN is simple and effective, it struggles with computational inefficiency when working with large datasets since it must calculate distances for every new observation. This becomes a major challenge in big data, where the sheer volume of information makes traditional kNN methods slow and resource-intensive.
To address this, Deng et al. (Deng et al. 2016) proposed an improved approach called LC-kNN, which combines k-means clustering with kNN to speed up computations and enhance accuracy. By dividing large datasets into smaller clusters, their method reduces the number of distance calculations needed. After extensive testing, the authors found that LC-kNN consistently outperformed standard kNN, achieving higher accuracy and better efficiency. Their study highlights a key limitation of traditional kNN (without optimization, its performance significantly declines on big data) and offers an effective solution to improve its scalability.
Continuing and summarizing these ideas, Syriopoulos et al. (Syriopoulos et al. 2023) explored techniques for accelerating kNN computations, such as:
Dimensionality reduction (e.g., PCA, feature selection) to reduce data complexity.
Approximate Nearest Neighbor (ANN) methods to speed up distance calculations.
Hybrid models combining kNN with clustering (e.g., LC-kNN) to improve efficiency.
This approach enhanced both speed and accuracy, making it a promising solution for handling large datasets. In addition, the study categorizes kNN modifications into local hyperplane methods, fuzzy-based models, weighting schemes, and hybrid approaches, demonstrating how these adaptations help tackle issues like class imbalance, computational inefficiency, and sensitivity to noise.
Another key challenge for kNN is its performance in high-dimensional datasets. The 2023 study by Syriopoulos et al. evaluates multiple nearest neighbor search algorithms such as kd-trees, ball trees, Locality-Sensitive Hashing (LSH), and graph-based search methods that enable kNN performance scaling for larger datasets through minimized distance calculations.
The enhancements to kNN have substantially increased its performance in terms of speed and accuracy which now allows it to better handle large-scale datasets. However, as Syriopoulos et al. primarily compile prior research rather than conducting empirical comparisons, further work is needed to evaluate these optimizations in real-world medical classification tasks.
kNN in Disease Prediction: Applications & Limitations
Disease Prediction with kNN
kNN has been widely used for diabetes classification and early detection. Ali et al. (Ali et al. 2020) tested six different kNN variants in MATLAB to classify blood glucose levels, finding that fine kNN was the most accurate. Their research highlights how optimizing kNN can improve classification performance, making it a valuable tool in healthcare.
In turn, Saxena et al. (Saxena, Khan, and Singh 2014) used kNN on a diabetes dataset and observed that increasing the number of neighbors (k) led to better accuracy, but only to a certain extent. In their MATLAB-based study, they found that using k = 3 resulted in 70% accuracy, while increasing k to 5 improved it to 75%. Both studies demonstrate how kNN can effectively classify diabetes, with accuracy depending on the choice of k and dataset characteristics. Ongoing research continues to refine kNN, making it a more efficient and reliable tool for medical applications.
Feature selection is another critical factor. Panwar et al. (Panwar et al. 2016) demonstrated that focusing on just BMI and Diabetes Pedigree Function improved accuracy, suggesting that simplifying feature selection enhances model performance. The study of Suriya and Muthu (Suriya and Muthu 2023) showed that kNN is a promising model for predicting type 2 diabetes, showing the highest accuracy on smaller datasets. The authors tested three datasets of varying sizes from 692 to 1853 rows and 9-22 dimensions to test the kNN algorithm’s performance and found that the larger dataset requires a higher k-value. Besides, PCA analysis to reduce dimensionality did not improve model performance. This suggests that simplifying the data doesn’t always lead to better results in diabetes prediction. The same findings about PCA influence on ML models implementation, and kNN in particular, showed in the research of Iparraguirre-Villanueva et al. (Iparraguirre-Villanueva et al. 2023). Also, they confirmed that kNN alone is not always the best choice. Authors compared kNN with Logistic Regression, Naïve Bayes, and Decision Trees. Their results showed that while kNN performed well on balanced datasets, it struggled when class imbalances existed. While PCA significantly reduced accuracy for all models, the SMOTE-preprocessed dataset demonstrated the highest accuracy for the k-NN model (79.6%), followed by BNB with 77.2%. This reveals the importance of correct preprocessing techniques in improving kNN model accuracy, especially when handling imbalanced datasets.
Khateeb & Usman (Khateeb and Usman 2017) extended kNN’s application to heart disease prediction, demonstrating that feature selection and data balancing techniques significantly impact accuracy. Their study showed that removing irrelevant features did not always improve performance, emphasizing the need for careful feature engineering in medical datasets.
kNN Beyond Prediction: Handling Missing Data
While kNN is widely known for classification, it also plays a key role in data preprocessing for medical machine learning. Altamimi et al. (Altamimi et al. 2024) explored kNN imputation as a method to handle missing values in medical datasets. Their study showed that applying kNN imputation before training a machine learning model significantly improved diabetes prediction accuracy - from 81.13% to 98.59%. This suggests that kNN is not only useful for disease classification but also for improving data quality and completeness in healthcare applications.
Traditional methods often discard incomplete records, but kNN imputation preserves valuable information, leading to more reliable model performance. However, Altamimi et al. (2024) also highlighted challenges such as computational costs and sensitivity to parameter selection, reinforcing the need for further optimization when applying kNN to large-scale medical datasets.
Comparing kNN Variants & Hybrid Approaches
Research indicate that kNN works well for diabetes prediction, but recent studies demonstrate it doesn’t consistently provide the best results. The study by Theerthagiri et al. (Theerthagiri, Ruby, and Vidya 2022) evaluated kNN against multiple machine learning models such as Naïve Bayes, Decision Trees, Extra Trees, Radial Basis Function (RBF), and Multi-Layer Perceptron (MLP) through analysis of the Pima Indians Diabetes dataset. The research indicated that kNN performed adequately but MLP excelled beyond all other algorithms achieving top accuracy at 80.68% and leading in AUC-ROC with an 86%. Despite its effectiveness in classification tasks, kNN’s primary limitation is its inability to compete with advanced models like neural networks when processing complex datasets.
In turn, Uddin et al.(Uddin et al. 2022) explored advanced kNN variants, including Weighted kNN, Distance-Weighted kNN, and Ensemble kNN. Their findings suggest that:
Weighted kNN improved classification by assigning greater importance to closer neighbors.
Ensemble kNN outperformed standard kNN in disease prediction but required additional computational resources.
Performance was highly sensitive to the choice of distance metric and k value tuning.
Their findings suggest that kNN can be improved through modifications, but it remains highly sensitive to dataset size, feature selection, and distance metric choices. In large-scale healthcare applications, Decision Trees (DT) and ensemble models may offer better trade-offs between accuracy and efficiency. These studies highlight the ongoing debate over kNN’s role in medical classification - whether modifying kNN is the best approach or if other models, such as DT or ensemble learning, provide stronger performance for diagnosing diseases.
kNN continues to be a valuable tool in medical machine learning, offering simplicity and strong performance in classification tasks. However, as research shows, its effectiveness depends on proper feature selection, optimized k values, and preprocessing techniques like imputation. While kNN remains an interpretable and adaptable model, newer methods - such as ensemble learning and neural networks - often outperform it, particularly in large-scale datasets. For our capstone project, exploring feature selection, fine-tuning kNN’s settings, and comparing it to other algorithms could give us valuable insights into its strengths and limitations.
Methods
The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems (Syriopoulos et al. 2023). In classification, it works on the assumption that similar data is close to each other in distance. It classifies a datapoint by using the euclidean distance formula to find the nearest k data specified. Once these k data points have been found, the kNN assigns a category to the new datapoint based off the category with the majority of the data points that are similar (Z. Zhang 2016). Figure 1 illustrates this methodology with two distinct classes of hearts and circles. The knn algorithm is attempting to classify the mystery figure represented by the red square. The k parameter is set to k=5 which means the algorithm will use the euclidean distance formula to find the 5 nearest neighbors illustrated by the green circle. From here the algorithm simply counts the number from each class and designates the class that represents the majority which in this case is a heart.
Figure 1. Visual Example of k-Nearest Neighbors (kNN) Classification with k = 5
The red square represents a data point to be classified. The algorithm selects the 5 nearest neighbors within the green circle—3 hearts and 2 circles. Based on the majority vote, the red square is classified as a heart.
Classification process
The classification process has three distinct steps:
1. Distance calculation
The kNN algorithm first calculates the distance between the data point it’s trying to classify and all the points in the training dataset. The most commonly used method is the Euclidean distance(Theerthagiri, Ruby, and Vidya 2022), which measures the straight-line distance between two points. The formula is:
\[
d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2}
\] Figure 2 shows the euclidean distance formula where \(X_2 - X_1\) calculates the horizontal difference and \(Y_2 - Y_1\) calculates the vertical difference. These two distances are then squared to ensure they are positive regardless of which directionality it has. Squaring the distances also gives greater emphasis to larger distances.
In some cases, Manhattan distance may be used instead. This metric calculates the total absolute difference across dimensions:
\[
d = |X_2 - X_1| + |Y_2 - Y_1|
\]
Unlike Euclidean distance, Manhattan distance follows a grid-like path (horizontal + vertical in Figure 2), making it more suitable for certain types of structured data or when outliers need to be minimized in influence (Aggarwal et al. 2015).
2. Neighbor Selection
The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. The k parameter is very important as a k parameter that is too large can lead to a classification problem caused by a majority of the samples creating a bias and causing underfitting. (Mucherino et al. 2009) A k being too small can cause the algorithm to be too sensitive to noise and outliers which can cause overfitting. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value (Syriopoulos et al. 2023).
3. Classification decision based on majority voting
Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision (Uddin et al. 2022).
Assumptions
The k-Nearest Neighbors (kNN) algorithm operates under the assumption that data points with similar features exist in close proximity within the feature space and are therefore likely to belong to the same class (Boateng, Otoo, and Abaye 2020).
Implementation of kNN
Code
library(DiagrammeR)grViz("digraph { graph [layout = dot, rankdir = LR, splines = true, size= 10] node [shape = box, style = rounded, fillcolor = lightblue, fontname = Arial, fontsize = 25, penwidth = 2] A [label = '1. Load Required Libraries',width=3, height=1.5] B [label = '2. Import & Explore Dataset',width=3, height=1.5] C [label = '3. Is preprocessing required?', shape = circle, fillcolor = lightblue, width=0.8, height=0.8, fontsize=25] D [label = '3a. Pre-Process the data',width=3, height=1.5] E [label = '4. Split Dataset into Training & Testing',width=3, height=1.5] F [label = '5. Hyperparameter tuning',width=3, height=1.5] G [label = '6. Train kNN Model',width=3, height=1.5] H [label = '7. Make Predictions',width=3, height=1.5] I [label = '8. Evaluate Model',width=3, height=1.5] A -> B B -> C C -> E [label = 'No', fontsize=25] C -> D [label = 'Yes', fontsize=25] D -> E E -> F F -> G G -> H H -> I #Edge Style edge [color = '#8B814C', arrowhead = vee, penwidth = 2]}")
Pre-processing Data
Data must be prepared before implementing the kNN. In order for the kNN algorithm to work we need to handle missing values, make all values numeric and normalize or standardize the features. We also have the option of increasing accuracy by reducing dimensionality, removing correlated features and fixing class imbalance if we notice our data needs it.
Handle missing values: kNN’s work by calculating the distance between datapoints and missing values can skew the results. We must remove the missing values by either inputting them or dropping them.
Make all values numeric: kNN’s only handle numeric values so all categorical values must be encoded using either one-hot encoding or label encoding.
Normalize or Standardize the features: We must normalize or standardize the features to make sure we reduce bias. We can use the min-max scaler or the standard scaler to do this.
Reduce dimensionality: The kNN can struggle to calculate the distance between features if there are too many features. In order to solve this we can use Principal Component Analysis to reduce the number of features but keep the variance.
Remove correlated features: The kNN works best when there aren’t too many features, so we can use a correlation matrix to see which features we can drop. For example, it might be good to drop any features that have low variance or have a high correlation over 0.9 because this can be redundant.
Fix class imbalance: Class imbalances can lead to a bias. We noticed a class imbalance in our dataset and chose to use Synthetic Minority Over-sampling Technique(SMOTE) in order to handle the imbalance.
Hyperparameter Tuning
In order to increase the accuracy of the model there are a few parameters that we can adjust.
Find the optimal k parameter: We manually tested several k values and selected the one that provided the best balance of performance metrics
Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, or another distance.
Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.
Advantages and Limitations
One of the advantages of the kNN is it’s easy to understand and implement. It is able to maintain great accuracy even with noisy data. (Syriopoulos et al. 2023). A serious limitation it has is the high computational cost and that it needs a large amount of memory to calculate the distance between all the datapoints.The kNN also has low accuracy with multidimensional data that has irrelevant features. (Saxena, Khan, and Singh 2014). Having to calculate the distance for all the datapoints can cause the knn to be slower when the number of datapoints gets too large as is the case with big data. The kNN takes a significant amount of time calculating the distances between at the datapoints in a big file. (Deng et al. 2016).
Analysis and Results
Data Exploration
We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.
The BRFSS is an annual telephone survey that has been ongoing since 1984 and each year, more than 400,000 Americans respond to the survey. It provides important data on health behaviors, chronic diseases, and preventive health care use to help researchers and policymakers understand the health status and risks of the public.
To transfer the data we used Python and the ucimlrepo package import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.
Code
from ucimlrepo import fetch_ucirepo import pandas as pd# Fetch the dataset from UCI repositorycdc_data = fetch_ucirepo(id=891) # Combine features and target into a single DataFramecdc_data_df = pd.concat([cdc_data.data.features, cdc_data.data.targets], axis=1)# Save to CSV for R environmentcdc_data_df.to_csv("cdc_data.csv", index=False)
Data Composition
The dataset consists of 253,680 survey responses collected through the CDC Behavioral Risk Factor Surveillance System (BRFSS). It includes:
1 binary target variable: Diabetes_binary
21 explanatory features covering demographics, health conditions, lifestyle habits, and healthcare access.
This large-scale dataset is well-suited for modeling diabetes risk, providing a mix of binary, ordinal, and continuous variables.
The following table displays the first few rows of the CDC Diabetes Health Indicators dataset.
Code
library(readr)library(knitr)# Load dataset in Rcdc_data_df <-read_csv("cdc_data.csv", show_col_types =FALSE)kable(head(cdc_data_df), caption ="Table 1. The First Few Raw of CDC Diabetes Dataset")
Table 1. The First Few Raw of CDC Diabetes Dataset
HighBP
HighChol
CholCheck
BMI
Smoker
Stroke
HeartDiseaseorAttack
PhysActivity
Fruits
Veggies
HvyAlcoholConsump
AnyHealthcare
NoDocbcCost
GenHlth
MentHlth
PhysHlth
DiffWalk
Sex
Age
Education
Income
Diabetes_binary
1
1
1
40
1
0
0
0
0
1
0
1
0
5
18
15
1
0
9
4
3
0
0
0
0
25
1
0
0
1
0
0
0
0
1
3
0
0
0
0
7
6
1
0
1
1
1
28
0
0
0
0
1
0
0
1
1
5
30
30
1
0
9
4
8
0
1
0
1
27
0
0
0
1
1
1
0
1
0
2
0
0
0
0
11
3
6
0
1
1
1
24
0
0
0
1
1
1
0
1
0
2
3
0
0
0
11
5
4
0
1
1
1
25
1
0
0
1
1
1
0
1
0
2
0
2
0
1
10
6
8
0
Feature Overview
The variables fall into four types, each encoded to preserve meaning and support distance-based modeling:
1. Target Variable (1)
Diabetes_binary: Binary classification (0 = No diabetes, 1 = Diabetes/prediabetes)
2. Binary Variables (14)
Encoded as 0 = No, 1 = Yes (except for Sex: 0 = Female, 1 = Male)
Health Conditions: HighBP, HighChol, CholCheck, Stroke, HeartDiseaseorAttack
Healthcare Access & Mobility: AnyHealthcare, NoDocbcCost, DiffWalk, Sex
3. Ordinal Variables (6)
Encoded using ranked integers to reflect meaningful progression:
Self-Reported Health: GenHlth, MentHlth, PhysHlth
Demographics: Age, Education, Income
(Higher values represent worse health or higher socioeconomic levels.)
4. Continuous Variable (1)
BMI: Numeric value for Body Mass Index
The table below provides a detailed breakdown of variable types, descriptions, and value ranges.
Code
# Load necessary packageslibrary(knitr)# Create a Data Frame with Variable Informationtable_data <-data.frame(Type =c("Target","Binary", "", "", "", "", "", "", "", "", "", "", "", "", "","Ordinal", "", "", "", "", "","Continuous" ),Variable =c("Diabetes_binary","HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "DiffWalk", "Sex","GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income","BMI" ),Description =c("Indicates whether a person has diabetes","High Blood Pressure", "High Cholesterol", "Cholesterol check in the last 5 years","Smoked at least 100 cigarettes in lifetime", "Had a stroke", "History of heart disease or attack","Engaged in physical activity in the last 30 days", "Regular fruit consumption", "Regular vegetable consumption", "Heavy alcohol consumption", "Has health insurance or healthcare access","Could not see a doctor due to cost", "Difficulty walking/climbing stairs", "Biological sex","Self-reported general health (1=Excellent, 5=Poor)", "Number of mentally unhealthy days in last 30 days", "Number of physically unhealthy days in last 30 days","Age Groups (1 = 18-24, ..., 13 = 80+)", "Highest education level (1 = No school, ..., 6 = College graduate)", "Household income category (1 = <$10K, ..., 8 = $75K+)", "Body Mass Index (BMI), measure of body fat" ),Range =c("(0 = No, 1 = Yes)","(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)","(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = Female, 1 = Male)","(1 = Excellent, ..., 5 = Poor)", "(0 - 30)", "(0 - 30)", "(1 = 18-24, ..., 13 = 80+)", "(1 = No school, ..., 6 = College grad)", "(1 = <$10K, ..., 8 = $75K+)", "(12 - 98)" ))# Print Table with knitr::kable()kable(table_data, caption ="Table 1. Summary of Explanatory Variables", align ="l")
Table 1. Summary of Explanatory Variables
Type
Variable
Description
Range
Target
Diabetes_binary
Indicates whether a person has diabetes
(0 = No, 1 = Yes)
Binary
HighBP
High Blood Pressure
(0 = No, 1 = Yes)
HighChol
High Cholesterol
(0 = No, 1 = Yes)
CholCheck
Cholesterol check in the last 5 years
(0 = No, 1 = Yes)
Smoker
Smoked at least 100 cigarettes in lifetime
(0 = No, 1 = Yes)
Stroke
Had a stroke
(0 = No, 1 = Yes)
HeartDiseaseorAttack
History of heart disease or attack
(0 = No, 1 = Yes)
PhysActivity
Engaged in physical activity in the last 30 days
(0 = No, 1 = Yes)
Fruits
Regular fruit consumption
(0 = No, 1 = Yes)
Veggies
Regular vegetable consumption
(0 = No, 1 = Yes)
HvyAlcoholConsump
Heavy alcohol consumption
(0 = No, 1 = Yes)
AnyHealthcare
Has health insurance or healthcare access
(0 = No, 1 = Yes)
NoDocbcCost
Could not see a doctor due to cost
(0 = No, 1 = Yes)
DiffWalk
Difficulty walking/climbing stairs
(0 = No, 1 = Yes)
Sex
Biological sex
(0 = Female, 1 = Male)
Ordinal
GenHlth
Self-reported general health (1=Excellent, 5=Poor)
(1 = Excellent, …, 5 = Poor)
MentHlth
Number of mentally unhealthy days in last 30 days
(0 - 30)
PhysHlth
Number of physically unhealthy days in last 30 days
(0 - 30)
Age
Age Groups (1 = 18-24, …, 13 = 80+)
(1 = 18-24, …, 13 = 80+)
Education
Highest education level (1 = No school, …, 6 = College graduate)
(1 = No school, …, 6 = College grad)
Income
Household income category (1 = <$10K, …, 8 = $75K+)
(1 = <$10K, …, 8 = $75K+)
Continuous
BMI
Body Mass Index (BMI), measure of body fat
(12 - 98)
Data Integrity Assessment
In this step, we checked for null values, missing data (NaNs), and duplicate rows to ensure data integrity. Additionally, we identified columns with invalid values such as strings with spaces in numeric fields.
Code
library(knitr)library(readr)# Load the datasetexploratory_df <-read_csv("eda.csv", show_col_types =FALSE)# Print table with a new title (caption)kable(exploratory_df, caption ="Table 2: Data Integrity Report")
Table 2: Data Integrity Report
Data Quality Check
Count
Number of Nulls
0
Missing Data
0
Duplicate Rows
24206
Total Rows
253680
There are no missing values, meaning no imputation is needed.
But 24,206 duplicate records were detected, which need to be analyzed to determine whether they need removal or weighting to prevent redundancy in model training.
Exploratory Data Analysis (EDA)
To effectively prepare data for a distance-based model like k-Nearest Neighbors (kNN), it’s critical to understand the statistical properties of the features - including scale, variability, and the presence of outliers.
Figures 3 and 4 summarize the central tendencies and distributional characteristics of selected ordinal and continuous variables: GenHlth, MentHlth, PhysHlth, Age, Education, Income, BMI
Summary Statistics Heatmap
The heatmap below presents descriptive statistics for each variable, including mean, standard deviation, min/max, and quartiles.
BMI stands out with the largest range (min = 12.0, max = 98.0) and highest standard deviation (6.6) - indicating significant variability.
MentHlth and PhysHlth also show wide spreads (standard deviations of 7.4 and 8.7, respectively), reinforcing the need for scaling to prevent these features from dominating distance calculations.
Age also shows moderate variability (std = 3.1), which may impact distance calculations if not scaled.
Ordinal features like GenHlth, Education, and Income are on a much smaller scale (e.g., GenHlth: 1-5), which may cause them to be underweighted unless scaling is applied.
Because features like BMI, MentHlth, PhysHlth, and Age have larger numeric ranges, they can disproportionately influence distance metrics in kNN. This is why feature scaling is essential - it ensures that each feature contributes fairly when calculating similarity.
Outliers in Distribution
The boxplot below further illustrates value distributions and highlights extreme values:
Code
import matplotlib.pyplot as pltimport seaborn as sns# Select numeric ordinal and continuous variablescols = ['GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income', 'BMI']# Create boxplotplt.figure(figsize=(12, 6))sns.boxplot(data=cdc_data_df[cols], orient="h", palette="Set2")plt.title("Figure 4: Boxplot of Ordinal and Continuous Variables")plt.xlabel("Value")plt.tight_layout()plt.show()
Notable Outliers:
MentHlth and PhysHlth exhibit outliers up to 30 — these may reflect long-term health issues but skew distributions.
BMI has a wide distribution with extreme values approaching 98, which can affect both scaling and model sensitivity.
Outliers can mislead distance calculations, making certain data points appear abnormally close or far in feature space.
Practical Steps for Preprocessing:
Problem
Solution
Scale imbalance
StandardScaler / MinMaxScaler
Outliers
RobustScaler / Clipping / Removal
Skewed features
Consider log or square root transform
Applying these transformations ensures the distance metric used in kNN remains balanced and sensitive to meaningful differences across all feature dimensions.
Class Imbalance in Diabetes Prevalence
A critical issue in classification problems is target class imbalance. For our Diabetes_binary variable, the majority class (No Diabetes) comprises over 86% of the dataset, while the minority class (Diabetes/Prediabetes) represents only about 14%.
Code
def plot_class_distribution():import matplotlib.pyplot as pltimport seaborn as sns target_variable ="Diabetes_binary"if target_variable in cdc_data_df.columns: class_counts = cdc_data_df[target_variable].value_counts() class_percentages = cdc_data_df[target_variable].value_counts(normalize=True) *100 plt.figure(figsize=(6, 4)) ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette="Set2")for i, value inenumerate(class_counts.values): percentage = class_percentages[i] ax.text(i, value +1000, f"{value} ({percentage:.2f}%)", ha="center", fontsize=12) plt.title(f"Figure 5: Class Distribution of {target_variable}") plt.ylabel("Count") plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)") plt.xticks([0, 1], ["No Diabetes", "Diabetes/Prediabetes"]) plt.tight_layout() plt.show()plot_class_distribution()
This imbalance can lead to biased model predictions, favoring the dominant class while under-detecting diabetes cases.
To handle this imbalance and improve classification performance, we can apply strategies such as oversampling the minority class (e.g., with SMOTE) or undersampling the majority class. Another effective approach is to use class-weighted algorithms, like setting weights=‘distance’ in KNeighborsClassifier, which has more influence on underrepresented classes during prediction.
Correlation Analysis
To better understand how variables relate to each other - and to our target - we generated a correlation heatmap. This helps detect redundant features, multicollinearity, and potential predictors of diabetes.
Code
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snscorr_matrix = cdc_data_df.corr()fig, ax = plt.subplots(figsize=(12, 8))sns.heatmap(corr_matrix, ax=ax, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5, vmin=-1, vmax=1)ax.set_title("Figure 6: Feature Correlation Heatmap")plt.tight_layout()plt.show()
Positive Correlations:
• General Health (GenHlth) is strongly correlated with Physical Health (PhysHlth) (0.52) and Difficulty Walking (DiffWalk) (0.45).
As individuals report poorer general health, they experience more physical health issues and mobility limitations.
• Physical Health (PhysHlth) and Difficulty Walking (DiffWalk) (0.47) show a strong link. Those with more days of poor physical health are likely to struggle with mobility.
• Age correlates with High Blood Pressure (0.34) and High Cholesterol (0.27), indicating an increased risk of cardiovascular conditions as people get older.
• Mental Health (MentHlth) and Physical Health (PhysHlth) (0.34) are positively associated. Worsening mental health often coincides with physical health problems.
Negative Correlations:
• Higher Income is associated with better General Health (-0.33), fewer Mobility Issues (-0.30), and better Physical Health (-0.24).
This suggests financial stability improves access to healthcare and promotes a healthier lifestyle.
• Higher Education is linked to better General Health (-0.28) and Mental Health (-0.19). Educated individuals may have better health awareness and coping strategies.
The heatmap confirms well-known health trends: age, high blood pressure, and cholesterol are major risk factors for diabetes. Poor physical and mental health are strongly related, and socioeconomic status (income, education) plays a key role in overall health. These insights highlight the importance of early intervention strategies and lifestyle modifications to prevent chronic diseases like diabetes.
No pair of features exceeds a correlation of ±0.52, so multicollinearity is not a concern. No features need to be dropped due to redundancy.
These patterns support the need for early interventions and lifestyle-focused health strategies.
Age and BMI Density Analysis by Diabetes Status
Age and Body Mass Index (BMI) are both recognized as key risk factors for diabetes. To explore these relationships, we compared their distributions between individuals with and without diabetes.
Age Distribution:
Figure 7 illustrates the distribution of age categories for individuals with and without diabetes. Although age is represented as an ordinal variable (1-13, likely corresponding to increasing age groups), several trends are apparent:
Individuals with diabetes or prediabetes show a higher density in the upper age categories, especially around values 10–13.
Conversely, the non-diabetic group is more prominent in the mid-range age categories (around 8–11).
The sharp peaks reflect the ordinal nature of the age variable and the likely grouping into discrete bands.
Code
plt.figure(figsize=(10, 6))sns.kdeplot(data=cdc_data_df, x="Age", hue="Diabetes_binary", fill=True, common_norm=False, palette={0: "#80cdc1", 1: "#d6604d"}, alpha=0.4, linewidth=1.5)plt.title("Figure 7: Age Density by Diabetes Status")plt.xlabel("Age")plt.ylabel("Density")plt.legend(title="Diabetes Status", labels=["No Diabetes (0)", "Diabetes/Prediabetes (1)"])plt.tight_layout()plt.show()
As expected, the prevalence of diabetes increases with age. This distribution confirms the importance of age as a predictive feature and suggests that older adults are at higher risk - aligning with clinical and epidemiological findings.
BMI Distribution:
BMI is a known risk factor for diabetes, and the analysis confirms that individuals with diabetes tend to have slightly higher BMI values on average. The KDE plot below shows a noticeable rightward shift in BMI values for diabetic individuals.
Code
# Set figure sizeplt.figure(figsize=(10, 6))# Ensure Diabetes_binary is integer for filtering and plottingcdc_data_df["Diabetes_binary"] = cdc_data_df["Diabetes_binary"].astype(int)# KDE plot for BMI distribution by diabetes statussns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] ==0]['BMI'], label='No Diabetes (0)', color="mediumaquamarine", fill=True)sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] ==1]['BMI'], label='Diabetes/Prediabetes (1)', color="salmon", fill=True)# Titles and labelsplt.title('Figure 8: BMI Density by Diabetes Status', fontsize=16)plt.xlabel('BMI', fontsize=14)plt.ylabel('Density', fontsize=14)plt.legend(title='Diabetes Status')# Show plotplt.show()
A significant portion of individuals with diabetes have BMI values above 30, supporting established links between obesity and diabetes risk. Despite this, there remains substantial overlap between the two groups, indicating that BMI alone is not a definitive predictor of diabetes.
These observations reinforce the importance of using multiple factors - not just BMI - when modeling diabetes risk.
EDA Summary
This Exploratory Data Analysis (EDA) provides a comprehensive overview of the dataset’s structure, distributions, and key correlations. The findings highlight several critical patterns:
Diabetes prevalence is low (13.9%), leading to a class imbalance that may require resampling techniques.
Age, BMI, and high blood pressure are strong risk factors for diabetes.
Socioeconomic factors (income, education) influence health status, supporting the need for targeted interventions.
The next phase involves data preprocessing, feature selection, and model development to enhance predictive performance.
Modeling and Results
This section explores the performance of the k-Nearest Neighbors (kNN) algorithm for predicting diabetes using the CDC Behavioral Risk Factor Surveillance System dataset. The primary objective was to evaluate how various modeling choices - such as scaling techniques, distance metrics, SMOTE resampling, feature selection, and hyperparameter tuning - impact classification performance, especially for identifying the minority diabetic class.
We tested four distinct kNN configurations, progressively applying different strategies to improve the model:
kNN 1: Baseline model trained on the imbalanced dataset (original distribution) using Euclidean distance and uniform weights.
kNN 2: Tuned model with Manhattan distance, distance-based weighting, and RobustScaler to address outliers.
kNN 3: SMOTE-resampled model using standard preprocessing, to mitigate class imbalance.
kNN 4: Feature-selected model combining SMOTE and top 12 features (via chi-squared test), along with distance weighting.
Each model was evaluated using accuracy, ROC-AUC, recall, precision, and f1-score, with particular emphasis on recall for the diabetic class, given the critical importance of minimizing false negatives in healthcare applications.
Results from the four configurations are summarized in Table 3, highlighting the effect of different preprocessing and tuning choices on kNN’s ability to detect diabetes accurately and fairly.
Data Preprocessing
Preprocessing is a critical step for models like k-Nearest Neighbors (kNN), which rely on distance-based calculations. If features are not properly scaled or class imbalance is not addressed, kNN’s performance - particularly for detecting minority classes - can degrade significantly.
The dataset originally contained 253,680 survey responses with 21 predictor variables and 1 binary outcome (Diabetes_binary). The following preprocessing steps were applied:
Code
import warningswarnings.filterwarnings("ignore", category=FutureWarning)# 1. Remove duplicatescdc_df = cdc_data_df.drop_duplicates()# Define features and targetX = cdc_df.drop(columns=['Diabetes_binary'])y = cdc_df['Diabetes_binary']# Calculate class distributionclass_distribution = (y.value_counts(normalize=True) *100).round(2)# Print clean outputprint("Class Distribution After Removing Duplicates (%):\n"+"\n".join([f"Class {label}: {percent:.2f}%"for label, percent in class_distribution.items()]))
Class Distribution After Removing Duplicates (%):
Class 0: 84.71%
Class 1: 15.29%
Duplicates Removed:
A total of 24,206 duplicate rows were dropped to reduce bias and redundancy. This cleanup slightly increased the proportion of the diabetic/prediabetic class - from 13.93% to 15.29% - as duplicates were more prevalent in the majority class.
Code
import warningswarnings.filterwarnings("ignore", category=FutureWarning)# 2. Separate features and targetX = cdc_df.drop(columns=["Diabetes_binary"])y = cdc_df["Diabetes_binary"]# 3. Train-test split (on original data)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=100, stratify=y)# 4. Identify continuous featurescontinuous = ["BMI", "MentHlth", "PhysHlth"]# 5. Scale continuous featuresscaler = StandardScaler()X_train = X_train.copy()X_test = X_test.copy()X_train[continuous] = scaler.fit_transform(X_train[continuous])X_test[continuous] = scaler.transform(X_test[continuous])# 6. Apply SMOTE to training set onlyX_train_smote, y_train_smote = SMOTE(random_state=42).fit_resample(X_train, y_train)
Train-Test Split:
After removing duplicates, the cleaned dataset was split into a training set (70%) and a test set (30%) using stratified sampling to maintain class proportions. Feature scaling was performed only on the training set prior to SMOTE to preserve correct feature distributions during oversampling. This approach prevented data leakage into the test set and ensured fair model evaluation.
Ordinal Features Retained as Numeric:
Variables such as Age, Education, Income, and GenHlth were preserved in their original numeric form due to their natural ordinal structure. Converting them to one-hot encoding would have removed meaningful ordering between categories (e.g., income levels or education attainment).
Continuous Features Scaled:
The variables BMI, MentHlth, and PhysHlth were standardized using StandardScaler to ensure equal contribution during distance calculations, a critical aspect for kNN’s accuracy.
Class Balancing with SMOTE:
To mitigate the impact of class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training set. This produced a balanced 50/50 class distribution, enhancing the model’s ability to identify minority-class (diabetic) cases.
The preprocessing pipeline ensured a clean, balanced, and distance-aware feature space for modeling and evaluation.
Training and Evaluating kNN Models
kNN 1: Baseline Model - Imbalanced Data
As a baseline, we trained a k-Nearest Neighbors (kNN) classifier on the original dataset without addressing class imbalance. The model used standard hyperparameters: 5 neighbors, Euclidean distance (p=2), and uniform weighting.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Evaluation was conducted on a hold-out test set that reflects the real-world class distribution:
The model achieved high overall accuracy (83.3%), driven largely by the dominant majority class. However, it performed poorly in identifying diabetic cases, confirming that kNN is sensitive to class imbalance and may overlook minority outcomes without intervention.
kNN 2: kNN with Robust Scaling on Imbalanced Data
To explore whether scaling and distance weighting could improve baseline performance, the second kNN model was trained on the same imbalanced dataset with several enhancements.
This configuration used k = 15, Manhattan distance (p=1), and distance-based weighting, allowing closer neighbors to have greater influence. To reduce the impact of outliers in health-related features like BMI, MentHlth, and PhysHlth, RobustScaler was applied to the continuous variables.
Code
from sklearn.preprocessing import RobustScaler# Scale continuous features using RobustScalerscaler = RobustScaler()X_train_imbal_scaled = scaler.fit_transform(X_train_imbal)X_test_imbal_scaled = scaler.transform(X_test_imbal)# Train kNN on imbalanced datasetknn_scaled_imbalanced = KNeighborsClassifier( n_neighbors=15, weights='distance', # Distance-weighted voting metric='minkowski', p=1# Manhattan distance)knn_scaled_imbalanced.fit(X_train_imbal_scaled, y_train_imbal)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Compared to the baseline, this model showed slightly improved accuracy (84.1%) and a better ROC-AUC score (0.75). However, recall for the diabetic class remained low (16%), indicating that class imbalance was still a major obstacle, even with enhanced distance metrics and scaling.
kNN 3: kNN Performance on SMOTE-Resampled Data
To directly address class imbalance, the training set was resampled using SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples for the minority class, resulting in a balanced 50/50 distribution.
This kNN model used a slightly different configuration from the baseline: k = 10 (increased from 5), Euclidean distance (p=2), and uniform weighting. The test set remained unchanged to simulate real-world deployment conditions with natural class imbalance.
Code
# Train kNN model on SMOTE-resampled training setknn_smote = KNeighborsClassifier(n_neighbors=10, metric='minkowski', p=2)knn_smote.fit(X_train_smote, y_train_smote)
KNeighborsClassifier(n_neighbors=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
After resampling, the model achieved 69.1% accuracy and a ROC-AUC of 0.73. Most notably, recall for the minority class jumped to 64%, reflecting SMOTE’s effectiveness in improving detection of diabetic cases.
However, precision dropped to 28%, indicating an increase in false positives - a common trade-off in recall - optimized models.
kNN 4: kNN with Feature Selection
To evaluate whether reducing dimensionality could improve model performance, we applied a Chi-Square feature selection technique. This method ranks features by their statistical dependency with the target (Diabetes_binary), helping identify those most relevant for classification.
Based on the scores (shown in Figure 9), the top 12 features were selected for training. These included a mix of continuous and ordinal variables, with continuous features (like BMI, MentHlth, and PhysHlth) scaled using StandardScaler when present.
SelectKBest(k='all', score_func=<function chi2 at 0x000001E22FA74550>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SelectKBest(k='all', score_func=<function chi2 at 0x000001E22FA74550>)
Code
# Store results in DataFramechi2_scores = selector.scores_chi2_results = pd.DataFrame({"Feature": X.columns,"Chi2 Score": chi2_scores}).sort_values(by="Chi2 Score", ascending=False)# Plot top 12 featuresplt.figure(figsize=(10, 6))sns.barplot(x="Chi2 Score", y="Feature", data=chi2_results.head(12), color="seagreen")plt.title("Figure 9: Top 12 Features by Chi-Square Score")plt.xlabel("Chi-Square Score")plt.ylabel("Feature")plt.tight_layout()plt.show()
The training data was resampled with SMOTE to address class imbalance. The model configuration included: k = 15, Euclidean distance (p=2), and distance-based weighting.
Code
# Get top 12 featurestop_12 = chi2_results.head(12)["Feature"].valuesX_top = X_sm[top_12].copy()# Scale only if neededcontinuous = ["BMI", "MentHlth", "PhysHlth"]for col in continuous:if col in X_top.columns: X_top[col] = StandardScaler().fit_transform(X_top[[col]])# Train/test splitX_train_fs, X_test_fs, y_train_fs, y_test_fs = train_test_split( X_top, y_sm, test_size=0.3, random_state=42)# Train kNN (Top 12 features)knn_fs = KNeighborsClassifier(n_neighbors=15, weights='distance', metric='minkowski', p=2)knn_fs.fit(X_train_fs, y_train_fs)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Store ROC for laterfpr_fs, tpr_fs, _ = roc_curve(y_test_fs, y_proba_fs)
This model achieved the best balance between recall and precision, indicating that combining feature selection with SMOTE yielded more stable and accurate classification of diabetic cases.
Comparison of kNN Models with Different Configurations
The first phase of model evaluation focused exclusively on optimizing the k-Nearest Neighbors (kNN) algorithm. Four versions of kNN were trained with varying configurations, including distance metrics, weighting strategies, feature scaling techniques, and class balancing using SMOTE.
Table 3 summarizes the performance of each configuration across key metrics: accuracy, ROC-AUC, precision, recall, and F1-score for the diabetic class (label = 1)
As shown, kNN 4, which combines SMOTE, feature selection, distance weighting, and k = 15, achieved the strongest performance overall. It yielded the highest ROC-AUC score (0.88) and recall (0.88) for diabetic cases, indicating its superior ability to identify high-risk individuals.
In contrast, the baseline models (kNN 1 and kNN 2), trained on the original imbalanced dataset, achieved good overall accuracy but performed poorly on recall, highlighting their limited utility for minority class detection.
These results underscore the importance of both class balancing and dimensionality reduction when applying kNN to imbalanced healthcare datasets. Based on these results, kNN 4 is selected for further comparison with tree-based models in the next analysis phase.
Classifier Comparison: Decision Tree and Random Forest
After identifying the best-performing kNN configuration, we compared it to two widely used tree-based classifiers: Decision Tree (DT) and Random Forest (RF). These models were chosen for their strong performance on structured data and their interpretability - key advantages in healthcare applications.
Both DT and RF were trained using the same train-test split as the kNN models to ensure a consistent basis for comparison. We evaluated each model’s performance on both the original (imbalanced) dataset and a SMOTE-resampled version to examine how they handle class imbalance relative to kNN.
Decision Tree – Inbalanced dataset
The Decision Tree classifier was first trained on the original, imbalanced dataset without any resampling or feature selection. This setup mirrors the baseline scenario used in kNN evaluation and allows us to observe how a simple tree-based model handles imbalance on its own.
# Store ROC for laterfpr_dt, tpr_dt, _ = roc_curve(y_test, y_proba_dt)
Despite high overall accuracy (86.2%), the model struggled to detect diabetic cases, with a recall of just 15% for the minority class.This reinforces the common issue in medical datasets where models optimize for the majority class at the expense of detecting critical minority outcomes.
Decision Tree – SMOTE-Resampled Dataset
Next, the Decision Tree model was retrained on a SMOTE-resampled dataset to evaluate whether balancing the classes improves performance.The model architecture was unchanged (max_depth = 10) to ensure a fair comparison with its imbalanced counterpart.
# Store ROC for laterfpr_dt_sm, tpr_dt_sm, _ = roc_curve(y_test_sm, y_proba_dt_sm)
SMOTE significantly improved the model’s ability to identify diabetic cases, as reflected in the jump in recall. While overall accuracy dropped slightly (from 86.2% to 72.5%), this trade-off is often acceptable in medical contexts where minimizing false negatives is critical.
Random Forest – Imbalanced Dataset
The Random Forest (RF) classifier was evaluated using the original, imbalanced dataset. Unlike kNN or a single Decision Tree, RF typically handles imbalance more gracefully due to its ensemble nature. Additionally, SMOTE was not applied, as oversampling synthetic data can increase the risk of overfitting in ensemble models.
The model was trained with 200 trees, a maximum depth of 15, and evaluated on the same test set used throughout.
Code
from sklearn.ensemble import RandomForestClassifier# Train Random Forestrf_model = RandomForestClassifier( n_estimators=200, max_depth=15, random_state=100, n_jobs=-1)rf_model.fit(X_train, y_train)
# Store ROC for laterfpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)
Random Forest showed the best overall accuracy (86.6%) but, like the basic Decision Tree, it suffered from low recall (13%) for the diabetic class. This highlights a recurring issue: even strong classifiers may underperform on minority classes unless explicitly adjusted for imbalance.
Code Implementation Comparison
To consolidate the modeling results, we compared the best-performing kNN configuration (kNN 4) with Decision Tree and Random Forest classifiers.
Table 4 summarizes key performance metrics, while Figure 10 displays ROC curves to visualize the models’ ability to distinguish diabetic from non-diabetic individuals.
Table 4: Performance Comparison of Best kNN vs. Tree-Based Models
Model
SMOTE
Accuracy
ROC_AUC
Precision_1
Recall_1
F1_1
KNN
Yes (Feature Selection)
0.78
0.88
0.73
0.88
0.80
Decision Tree
Yes
0.72
0.80
0.70
0.78
0.74
Decision Tree
No
0.86
0.81
0.52
0.15
0.24
Random Forest
No
0.87
0.82
0.59
0.13
0.21
From the comparison:
kNN with Feature Selection and SMOTE achieved the highest ROC-AUC (0.88) and recall for the diabetic class (88%), striking the best balance between sensitivity and precision.
Decision Tree with SMOTE also showed solid recall (78%), but lower AUC (0.80) and overall accuracy.
The non-resampled Decision Tree and Random Forest models performed well on accuracy and recall for the majority class (non-diabetic), but severely underperformed on recall for diabetic cases (15% and 13%, respectively).
These findings highlight the importance of targeted preprocessing and algorithm tuning when addressing imbalanced healthcare data - especially when the priority is to detect rare but critical conditions like diabetes.
Figure 10 (ROC curves) reinforces this performance difference, with kNN clearly outperforming other models in the high-sensitivity region - a critical aspect in medical screening tools.
Code
def plot_roc_auc():import matplotlib.pyplot as plt# Build a clean plot using object-oriented matplotlib fig, ax = plt.subplots(figsize=(8, 6))# Plot each ROC curve ax.plot(fpr_fs, tpr_fs, label=f"KNN (Feature Selection) (AUC = {roc_auc_fs:.2f})") ax.plot(fpr_dt_sm, tpr_dt_sm, label=f"Decision Tree (SMOTE) (AUC = {roc_auc_dt_sm:.2f})") ax.plot(fpr_dt, tpr_dt, label=f"Decision Tree (Imbalanced) (AUC = {roc_auc_dt:.2f})") ax.plot(fpr_rf, tpr_rf, label=f"Random Forest (Imbalanced) (AUC = {roc_auc_rf:.2f})")# Add diagonal reference line ax.plot([0, 1], [0, 1], linestyle="--", color="gray")# Set labels and title ax.set_xlabel("False Positive Rate") ax.set_ylabel("True Positive Rate") ax.set_title("Figure 10: ROC Curves for Different Models")# Add legend and grid ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.15), ncol=2, frameon=True) ax.grid(True)# Render the plot plt.tight_layout() plt.show()plot_roc_auc()
The ROC curve demonstrates the ability of each model to distinguish diabetic from non-diabetic cases. The kNN model with SMOTE and feature selection achieved the highest AUC value of 0.88 which demonstrated superior performance in distinguishing between diabetic and non-diabetic cases. Random Forest and Decision Tree performed reasonably well, but neither matched kNN in class separation. These results support the conclusion that a carefully tuned kNN model can offer both strong predictive accuracy and clinical utility.
Conclusion
This project investigated the behavior and performance of the k-Nearest Neighbors (kNN) algorithm for classifying diabetes using a real-world health dataset of over 253,000 observations. The focus was not only on achieving high accuracy, but on evaluating how model configuration, preprocessing, and resampling affect kNN’s ability to detect the minority (diabetic) class - critical in healthcare contexts where false negatives carry significant risk.
We explored four kNN variations, altering hyperparameters (k, distance metric), scaling techniques, and class balancing via SMOTE. Among these, Model 4 - which combined SMOTE, Chi-Square feature selection, and distance-weighted voting - emerged as the best configuration. It achieved a strong balance between accuracy (78%), ROC-AUC (0.88), and recall for diabetic cases (0.88), outperforming all other kNN models. Increasing the value of k improved model stability and performance, which is especially beneficial when working with large datasets like ours, where higher k helps smooth out noise and reduce variance in predictions.
To contextualize its performance, the best kNN model was compared to two tree-based classifiers: Decision Tree (DT) and Random Forest (RF). While RF achieved slightly higher overall accuracy (87%) on the imbalanced dataset, its recall for the diabetic class was just 13%, indicating poor sensitivity to minority cases. Similarly, DT models showed higher accuracy but struggled to match the recall and ROC-AUC of the tuned kNN. Only Decision Tree trained on SMOTE approached comparable recall levels (78%) but still lagged behind kNN in ROC-AUC (0.80 vs. 0.88).
These results highlight a crucial trade-off: although tree-based models excel in overall classification, they often underperform in minority class detection when not explicitly rebalanced. In contrast, a carefully tuned and resampled kNN model offers a more balanced and interpretable solution in medical classification tasks.
In sum, this analysis demonstrates how fine-tuning kNN and applying proper preprocessing strategies can significantly improve its performance, even on large, noisy datasets. The findings support kNN’s continued relevance as a simple yet powerful algorithm in healthcare analytics, particularly when paired with balancing and feature selection techniques.
References
Aggarwal, Charu C et al. 2015. Data Mining: The Textbook. Vol. 1. 3. Springer.
Ali, AMEER, MOHAMMED Alrubei, LF Mohammed Hassan, M Al-Ja’afari, and Saif Abdulwahed. 2020. “Diabetes Classification Based on KNN.”IIUM Engineering Journal 21 (1): 175–81.
Altamimi, Abdulaziz, Aisha Ahmed Alarfaj, Muhammad Umer, Ebtisam Abdullah Alabdulqader, Shtwai Alsubai, Tai-hoon Kim, and Imran Ashraf. 2024. “An Automated Approach to Predict Diabetic Patients Using KNN Imputation and Effective Data Mining Techniques.”BMC Medical Research Methodology 24 (1): 221.
Boateng, Ernest Yeboah, Joseph Otoo, and Daniel A Abaye. 2020. “Basic Tenets of Classification Algorithms k-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review.”Journal of Data Analysis and Information Processing 8 (4): 341–57.
Deng, Zhenyun, Xiaoshu Zhu, Debo Cheng, Ming Zong, and Shichao Zhang. 2016. “Efficient kNN Classification Algorithm for Big Data.”Neurocomputing 195: 143–48.
Iparraguirre-Villanueva, Orlando, Karina Espinola-Linares, Rosalynn Ornella Flores Castañeda, and Michael Cabanillas-Carbonell. 2023. “Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes.”Diagnostics 13 (14): 2383.
Kataria, Aman, and MD Singh. 2013. “A Review of Data Classification Using k-Nearest Neighbour Algorithm.”International Journal of Emerging Technology and Advanced Engineering 3 (6): 354–60.
Khateeb, Nida, and Muhammad Usman. 2017. “Efficient Heart Disease Prediction System Using k-Nearest Neighbor Classification Technique.” In Proceedings of the International Conference on Big Data and Internet of Thing, 21–26.
Mucherino, Antonio, Petraq J Papajorgji, Panos M Pardalos, Antonio Mucherino, Petraq J Papajorgji, and Panos M Pardalos. 2009. “K-Nearest Neighbor Classification.”Data Mining in Agriculture, 83–106.
Panwar, Madhuri, Amit Acharyya, Rishad A Shafik, and Dwaipayan Biswas. 2016. “K-Nearest Neighbor Based Methodology for Accurate Diagnosis of Diabetes Mellitus.” In 2016 Sixth International Symposium on Embedded Computing and System Design (ISED), 132–36. IEEE.
Saxena, Krati, Zubair Khan, and Shefali Singh. 2014. “Diagnosis of Diabetes Mellitus Using k Nearest Neighbor Algorithm.”International Journal of Computer Science Trends and Technology (IJCST) 2 (4): 36–43.
Suriya, S, and J Joanish Muthu. 2023. “Type 2 Diabetes Prediction Using k-Nearest Neighbor Algorithm.”Journal of Trends in Computer Science and Smart Technology 5 (2): 190–205.
Syriopoulos, Panos K, Nektarios G Kalampalikis, Sotiris B Kotsiantis, and Michael N Vrahatis. 2023. “K NN Classification: A Review.”Annals of Mathematics and Artificial Intelligence, 1–33.
Theerthagiri, Prasannavenkatesan, A Usha Ruby, and J Vidya. 2022. “Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms.”SN Computer Science 4 (1): 72.
Uddin, Shahadat, Ibtisham Haque, Haohui Lu, Mohammad Ali Moni, and Ergun Gide. 2022. “Comparative Performance Analysis of k-Nearest Neighbour (KNN) Algorithm and Its Different Variants for Disease Prediction.”Scientific Reports 12 (1): 6256.
Zhang, Shichao, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. “Efficient kNN Classification with Different Numbers of Nearest Neighbors.”IEEE Transactions on Neural Networks and Learning Systems 29 (5): 1774–85.
Zhang, Zhongheng. 2016. “Introduction to Machine Learning: K-Nearest Neighbors.”Annals of Translational Medicine 4 (11).