Clustering time series data
Clustering time series data is a technique used to identify similar patterns within a set of time series. It can be used for a variety of applications such as detecting anomalies, grouping similar behaviors, and segmenting data. In this explanation, I will use the k-means algorithm as an example of how to cluster time series data in Python.
First, we need to prepare our data. For this example, I will use the Daily Minimum Temperatures in Melbourne dataset, which can be loaded directly from the statsmodels
library.
import numpy as np
import pandas as pd
from statsmodels.datasets import get_rdataset
# Load the Daily Minimum Temperatures in Melbourne dataset
data = get_rdataset("melbtemp", "MASS").data
ts = pd.Series(data["temp"].values, index=data["date"])
Next, we will need to transform our time series data into a feature space that can be used for clustering. One common method is to use sliding windows to generate a set of feature vectors, where each vector represents a segment of the time series. For this example, we will use a window size of 30, meaning each vector will represent a segment of 30 consecutive days.
# Define the window size
window_size = 30
# Generate a set of feature vectors using sliding windows
X = np.array([ts[i:i+window_size].values for i in range(len(ts)-window_size)])
Now, we can use the k-means algorithm to cluster our data. We will use the KMeans
class from the scikit-learn
library.
from sklearn.cluster import KMeans
# Define the number of clusters
n_clusters = 4
# Initialize the k-means algorithm with the number of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
# Fit the algorithm to the data
kmeans.fit(X)
# Get the cluster labels for each data point
labels = kmeans.labels_
Finally, we can visualize the results by plotting the time series in each cluster.
import matplotlib.pyplot as plt
# Plot the time series in each cluster
fig, axs = plt.subplots(n_clusters, figsize=(10, 8))
for i in range(n_clusters):
axs[i].set_title("Cluster " + str(i))
for j in range(len(X)):
if labels[j] == i:
axs[i].plot(X[j])
plt.show()
This will produce a set of plots, where each plot shows the time series in a different cluster.
Clustering time series data can be a useful technique for understanding and analyzing complex datasets. By identifying similar patterns in the data, we can gain insights into the underlying structure and behavior of the system being studied.
Leave a Comment