| |
- builtins.object
-
- DBSCAN
- KMeans
class DBSCAN(builtins.object) |
|
DBSCAN(X, eps=0.5, min_samples=5, compile_numba=False)
This class implements the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.
Args:
X: The data matrix (numpy array).
eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
min_samples: The number of samples in a neighborhood for a point to be considered as a core point.
Methods:
- __init__: Initializes the DBSCAN object with the input parameters.
- fit: Fits the DBSCAN model to the data and assigns cluster labels.
- predict: Predicts the cluster labels for new data points.
- fit_predict: Fits the DBSCAN model and returns cluster labels.
- silhouette_score: Calculates the Silhouette Score for evaluating clustering performance.
- _handle_categorical: Handles categorical columns by one-hot encoding.
- _convert_to_ndarray: Converts input data to a NumPy ndarray and handles categorical columns.
- _custom_distance_matrix: Calculates the pairwise distance matrix using a custom distance calculation method. |
|
Methods defined here:
- __init__(self, X, eps=0.5, min_samples=5, compile_numba=False)
- Initialize the DBSCAN object.
Args:
X: The data matrix (numpy array).
eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
min_samples: The number of samples in a neighborhood for a point to be considered as a core point.
compile_numba: Whether to compile the distance calculations using Numba for performance.
If not compiled, the first call to the numba fitting function will take longer, but subsequent calls will be faster.
- auto_eps(self, min=0.1, max=1.1, precision=0.01, return_scores=False, verbose=False)
- Find the optimal eps value for DBSCAN based on silhouette score.
Args:
min: The minimum eps value to start the search.
max: The maximum eps value to end the search.
precision: The precision of the search.
return_scores: Whether to return a dictionary of (eps, score) pairs.
verbose: Whether to print the silhouette score for each eps value.
Returns:
eps: The optimal eps value.
scores_dict (optional): A dictionary of (eps, score) pairs if return_scores is True.
- fit(self, metric='euclidean', numba=False)
- Fit the DBSCAN model to the data.
Algorithm Steps:
1. Calculate the distance matrix between all points in the dataset.
2. Identify core points based on the minimum number of neighbors within eps distance.
3. Assign cluster labels using depth-first search (DFS) starting from core points.
Args:
metric: The distance metric to use ('euclidean', 'manhattan', or 'cosine').
numba: Whether to use numba for faster computation.
Returns:
labels: The cluster labels for each data point.
- fit_predict(self, numba=False)
- Fit the DBSCAN model to the data and return the cluster labels.
Returns:
labels: The cluster labels for the data.
- predict(self, new_X)
- Predict the cluster labels for new data points.
Note: DBSCAN does not naturally support predicting new data points.
Args:
new_X: The data matrix to predict (numpy array).
Returns:
labels: The predicted cluster labels (-1 for noise).
- silhouette_score(self)
- Calculate the silhouette score for evaluating clustering performance.
Returns:
silhouette_score: The computed silhouette score.
Data descriptors defined here:
- __dict__
- dictionary for instance variables
- __weakref__
- list of weak references to the object
|
class KMeans(builtins.object) |
|
KMeans(X, n_clusters=3, max_iter=300, tol=0.0001)
This class implements the K-Means clustering algorithm along with methods for evaluating the optimal number of clusters and visualizing the clustering results.
Args:
X: The data matrix (numpy array).
n_clusters: The number of clusters.
max_iter: The maximum number of iterations.
tol: The tolerance to declare convergence.
Methods:
- __init__: Initializes the KMeans object with parameters such as the data matrix, number of clusters, maximum iterations, and convergence tolerance.
- _handle_categorical: Handles categorical columns in the input data by one-hot encoding.
- _convert_to_ndarray: Converts input data to a NumPy ndarray and handles categorical columns.
- initialize_centroids: Randomly initializes the centroids for KMeans clustering.
- assign_clusters: Assigns clusters based on the nearest centroid.
- update_centroids: Updates centroids based on the current cluster assignments.
- fit: Fits the KMeans model to the data by iteratively updating centroids and cluster assignments until convergence.
- predict: Predicts the closest cluster each sample in new_X belongs to.
- elbow_method: Implements the elbow method to determine the optimal number of clusters.
- calinski_harabasz_index: Calculates the Calinski-Harabasz Index for evaluating clustering performance.
- davies_bouldin_index: Calculates the Davies-Bouldin Index for evaluating clustering performance.
- silhouette_score: Calculates the Silhouette Score for evaluating clustering performance.
- find_optimal_clusters: Implements methods to find the optimal number of clusters using the elbow method,
Calinski-Harabasz Index, Davies-Bouldin Index, and Silhouette Score. It also plots the evaluation
metrics to aid in determining the optimal k value. |
|
Methods defined here:
- __init__(self, X, n_clusters=3, max_iter=300, tol=0.0001)
- Initialize the KMeans object.
Args:
X: The data matrix (numpy array, pandas DataFrame, or list).
n_clusters: The number of clusters.
max_iter: The maximum number of iterations.
tol: The tolerance to declare convergence.
- assign_clusters(self, centroids)
- Assign clusters based on the nearest centroid.
Args:
centroids: The current centroids.
Returns:
labels: The cluster assignments for each data point.
- calinski_harabasz_index(self, X, labels, centroids)
- Calculate the Calinski-Harabasz Index for evaluating clustering performance.
Args:
X: The data matrix (numpy array).
labels: The cluster labels for each data point.
centroids: The centroids of the clusters.
Returns:
ch_index: The computed Calinski-Harabasz Index.
- davies_bouldin_index(self, X, labels, centroids)
- Calculate the Davies-Bouldin Index for evaluating clustering performance.
Args:
X: The data matrix (numpy array).
labels: The cluster labels for each data point.
centroids: The centroids of the clusters.
Returns:
db_index: The computed Davies-Bouldin Index.
- elbow_method(self, max_k=10)
- Implement the elbow method to determine the optimal number of clusters.
Args:
max_k: The maximum number of clusters to test.
Returns:
distortions: A list of distortions for each k.
- find_optimal_clusters(self, max_k=10, true_k=None, save_dir=None)
- Find the optimal number of clusters using various evaluation metrics and plot the results.
Args:
X: The data matrix (numpy array).
max_k: The maximum number of clusters to consider.
true_k: The true number of clusters in the data.
save_dir: The directory to save the plot (optional).
Returns:
ch_optimal_k: The optimal number of clusters based on the Calinski-Harabasz Index.
db_optimal_k: The optimal number of clusters based on the Davies-Bouldin Index.
silhouette_optimal_k: The optimal number of clusters based on the Silhouette Score.
- fit(self)
- Fit the KMeans model to the data.
- initialize_centroids(self)
- Randomly initialize the centroids.
Returns:
centroids: The initialized centroids.
- predict(self, new_X)
- Predict the closest cluster each sample in new_X belongs to.
Args:
new_X: The data matrix to predict (numpy array).
Returns:
labels: The predicted cluster labels.
- silhouette_score(self, X, labels)
- Calculate the silhouette score for evaluating clustering performance.
Args:
X: The data matrix (numpy array).
labels: The cluster labels for each data point.
Returns:
silhouette_score: The computed silhouette score.
- update_centroids(self)
- Update the centroids based on the current cluster assignments.
Returns:
centroids: The updated centroids.
Data descriptors defined here:
- __dict__
- dictionary for instance variables
- __weakref__
- list of weak references to the object
| |