Cluster

Create clusters out of our data

Sample Clustering with Kmeans

from sklearn.cluster import KMeans

# Load data
X = ... # Your data loading code here

# Define the number of clusters
n_clusters = ...

# Initialize the KMeans model
kmeans = KMeans(n_clusters=n_clusters, random_state=42)

# Fit the model to the data
kmeans.fit(X)

# Get the predicted cluster labels for each data point
labels = kmeans.labels_

# Get the cluster centers
centers = kmeans.cluster_centers_

# Evaluate the model (if ground truth labels are available)
from sklearn.metrics import adjusted_rand_score
y_true = ... # Your ground truth labels loading code here
ari = adjusted_rand_score(y_true, labels)
print(f"Adjusted Rand Index: {ari:.4f}")

You probably need to determine the most optimal number of clusters. This can be done by using the elbow method. You can opt out of this and siloutte score.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

def plot_elbow(X, max_clusters=10):
    """
    Plots the elbow curve to help choose the optimal number of clusters for KMeans clustering.

    Parameters:
        X (numpy array or pandas DataFrame): The data to be clustered.
        max_clusters (int): The maximum number of clusters to try (default = 10).
    """
    
    # Initialize empty list to store inertia values
    inertia = []

    # Try different numbers of clusters and compute the inertia for each
    for k in range(1, max_clusters+1):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(X)
        inertia.append(kmeans.inertia_)

    # Plot the elbow curve
    plt.plot(range(1, max_clusters+1), inertia)
    plt.xlabel('Number of Clusters')
    plt.ylabel('Inertia')
    plt.title('Elbow Curve')
    plt.show()

Other clustering models include various dbscan, agglomerative clustering, and spectral clustering.

reference material

  • https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning