Python: module sega_learn.clustering.clustering

sega_learn.clustering.clustering

Modules

Classes

DBSCAN
KMeans

DBSCAN(X, eps=0.5, min_samples=5, compile_numba=False)

This class implements the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

Args:
    X: The data matrix (numpy array).
    eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
    min_samples: The number of samples in a neighborhood for a point to be considered as a core point.

Methods:
    - __init__: Initializes the DBSCAN object with the input parameters.
    - fit: Fits the DBSCAN model to the data and assigns cluster labels.
    - predict: Predicts the cluster labels for new data points.
    - fit_predict: Fits the DBSCAN model and returns cluster labels.
    - silhouette_score: Calculates the Silhouette Score for evaluating clustering performance.
    - _handle_categorical: Handles categorical columns by one-hot encoding.
    - _convert_to_ndarray: Converts input data to a NumPy ndarray and handles categorical columns.
    - _custom_distance_matrix: Calculates the pairwise distance matrix using a custom distance calculation method.

Methods defined here:

__init__(self, X, eps=0.5, min_samples=5, compile_numba=False): Initialize the DBSCAN object.

Args:
    X: The data matrix (numpy array).
    eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
    min_samples: The number of samples in a neighborhood for a point to be considered as a core point.
    compile_numba: Whether to compile the distance calculations using Numba for performance.
    If not compiled, the first call to the numba fitting function will take longer, but subsequent calls will be faster.

auto_eps(self, min=0.1, max=1.1, precision=0.01, return_scores=False, verbose=False): Find the optimal eps value for DBSCAN based on silhouette score.

Args:
    min: The minimum eps value to start the search.
    max: The maximum eps value to end the search.
    precision: The precision of the search.
    return_scores: Whether to return a dictionary of (eps, score) pairs.
    verbose: Whether to print the silhouette score for each eps value.

Returns:
    eps: The optimal eps value.
    scores_dict (optional): A dictionary of (eps, score) pairs if return_scores is True.

fit(self, metric='euclidean', numba=False): Fit the DBSCAN model to the data.

Algorithm Steps:
1. Calculate the distance matrix between all points in the dataset.
2. Identify core points based on the minimum number of neighbors within eps distance.
3. Assign cluster labels using depth-first search (DFS) starting from core points.

Args:
    metric: The distance metric to use ('euclidean', 'manhattan', or 'cosine').
    numba: Whether to use numba for faster computation.

Returns:
    labels: The cluster labels for each data point.

fit_predict(self, numba=False): Fit the DBSCAN model to the data and return the cluster labels.

Returns:
labels: The cluster labels for the data.

predict(self, new_X): Predict the cluster labels for new data points.

Note: DBSCAN does not naturally support predicting new data points.

Args:
new_X: The data matrix to predict (numpy array).

Returns:
labels: The predicted cluster labels (-1 for noise).

silhouette_score(self): Calculate the silhouette score for evaluating clustering performance.

Returns:
silhouette_score: The computed silhouette score.

Data descriptors defined here:

__dict__: dictionary for instance variables

__weakref__: list of weak references to the object

class KMeans(builtins.object)

KMeans(X, n_clusters=3, max_iter=300, tol=0.0001)

This class implements the K-Means clustering algorithm along with methods for evaluating the optimal number of clusters and visualizing the clustering results.

Args:
    X: The data matrix (numpy array).
    n_clusters: The number of clusters.
    max_iter: The maximum number of iterations.
    tol: The tolerance to declare convergence.

Methods:
    - __init__: Initializes the KMeans object with parameters such as the data matrix, number of clusters, maximum iterations, and convergence tolerance.
    - _handle_categorical: Handles categorical columns in the input data by one-hot encoding.
    - _convert_to_ndarray: Converts input data to a NumPy ndarray and handles categorical columns.
    - initialize_centroids: Randomly initializes the centroids for KMeans clustering.
    - assign_clusters: Assigns clusters based on the nearest centroid.
    - update_centroids: Updates centroids based on the current cluster assignments.
    - fit: Fits the KMeans model to the data by iteratively updating centroids and cluster assignments until convergence.
    - predict: Predicts the closest cluster each sample in new_X belongs to.
    - elbow_method: Implements the elbow method to determine the optimal number of clusters.
    - calinski_harabasz_index: Calculates the Calinski-Harabasz Index for evaluating clustering performance.
    - davies_bouldin_index: Calculates the Davies-Bouldin Index for evaluating clustering performance.
    - silhouette_score: Calculates the Silhouette Score for evaluating clustering performance.
    - find_optimal_clusters: Implements methods to find the optimal number of clusters using the elbow method,
                         Calinski-Harabasz Index, Davies-Bouldin Index, and Silhouette Score. It also plots the evaluation
                         metrics to aid in determining the optimal k value.

Methods defined here:

__init__(self, X, n_clusters=3, max_iter=300, tol=0.0001): Initialize the KMeans object.

Args:
    X: The data matrix (numpy array, pandas DataFrame, or list).
    n_clusters: The number of clusters.
    max_iter: The maximum number of iterations.
    tol: The tolerance to declare convergence.

assign_clusters(self, centroids): Assign clusters based on the nearest centroid.

Args:
centroids: The current centroids.

Returns:
labels: The cluster assignments for each data point.

calinski_harabasz_index(self, X, labels, centroids): Calculate the Calinski-Harabasz Index for evaluating clustering performance.

Args:
    X: The data matrix (numpy array).
    labels: The cluster labels for each data point.
    centroids: The centroids of the clusters.

Returns:
    ch_index: The computed Calinski-Harabasz Index.

davies_bouldin_index(self, X, labels, centroids): Calculate the Davies-Bouldin Index for evaluating clustering performance.

Args:
    X: The data matrix (numpy array).
    labels: The cluster labels for each data point.
    centroids: The centroids of the clusters.

Returns:
    db_index: The computed Davies-Bouldin Index.

elbow_method(self, max_k=10): Implement the elbow method to determine the optimal number of clusters.

Args:
max_k: The maximum number of clusters to test.

Returns:
distortions: A list of distortions for each k.

find_optimal_clusters(self, max_k=10, true_k=None, save_dir=None): Find the optimal number of clusters using various evaluation metrics and plot the results.

Args:
    X: The data matrix (numpy array).
    max_k: The maximum number of clusters to consider.
    true_k: The true number of clusters in the data.
    save_dir: The directory to save the plot (optional).

Returns:
    ch_optimal_k: The optimal number of clusters based on the Calinski-Harabasz Index.
    db_optimal_k: The optimal number of clusters based on the Davies-Bouldin Index.
    silhouette_optimal_k: The optimal number of clusters based on the Silhouette Score.

fit(self): Fit the KMeans model to the data.

initialize_centroids(self): Randomly initialize the centroids.

Returns:
centroids: The initialized centroids.

predict(self, new_X): Predict the closest cluster each sample in new_X belongs to.

Args:
new_X: The data matrix to predict (numpy array).

Returns:
labels: The predicted cluster labels.

silhouette_score(self, X, labels): Calculate the silhouette score for evaluating clustering performance.

Args:
    X: The data matrix (numpy array).
    labels: The cluster labels for each data point.

Returns:
    silhouette_score: The computed silhouette score.

update_centroids(self): Update the centroids based on the current cluster assignments.

Returns:
centroids: The updated centroids.

Data descriptors defined here:

__dict__: dictionary for instance variables

__weakref__: list of weak references to the object