sci-kit learn kmeans , pca dimensionality reduction

i have dataset, 2m rows 7 columns, different measurements of home power consumption date each measurement.

date,
global_active_power,
global_reactive_power,
voltage,
global_intensity,
sub_metering_1,
sub_metering_2,
sub_metering_3

i put dataset pandas dataframe, selecting columns date column, perform cross validation split.

import pandas pd sklearn.cross_validation import train_test_split  data = pd.read_csv('household_power_consumption.txt', delimiter=';') power_consumption = data.iloc[0:, 2:9].dropna() pc_toarray = power_consumption.values hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01) power_consumption.head()

power table

i use k-means classification followed pca dimensionality reduction display.

from sklearn.cluster import kmeans import matplotlib.pyplot plt import numpy np sklearn.decomposition import pca  hpc = pca(n_components=2).fit_transform(hpc_fit) k_means = kmeans() k_means.fit(hpc)  x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1 y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5 xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02)) z = k_means.predict(np.c_[xx.ravel(), yy.ravel()]) z = z.reshape(xx.shape)  plt.figure(1) plt.clf() plt.imshow(z, interpolation='nearest',           extent=(xx.min(), xx.max(), yy.min(), yy.max()),           cmap=plt.cm.paired,           aspect='auto', origin='lower')  plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4) centroids = k_means.cluster_centers_ inert = k_means.inertia_ plt.scatter(centroids[:, 0], centroids[:, 1],            marker='x', s=169, linewidths=3,            color='w', zorder=8) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(()) plt.show()

pca output

now find out rows fell under given class dates fell under given class.

is there way relate points on graph index in dataset, after pca?
some method don't know of?
or approach fundamentally flawed?
any recommendations?

i new field , trying read through lots of code, compilation of several examples i've seen documented .

my goal classify data , dates fall under class.

thank you

kmeans().predict(x) ..docs here

predict closest cluster each sample in x belongs to.

in vector quantization literature, cluster_centers_ called code book , each value returned predict index of closest code in code book.

parameters: (new data predict)  x : {array-like, sparse matrix}, shape = [n_samples, n_features]  returns: (index of cluster each sample belongs to)    labels : array, shape [n_samples,]

the problem code submitted use of

train_test_split()

which returns 2 arrays of random rows in data-set, ruining dataset order making difficult correlate labels returned kmeans classification sequential dates in data set.

here's example:

import pandas pd import numpy np sklearn.cluster import kmeans  #read data pandas dataframe df = pd.read_csv('household_power_consumption.txt', delimiter=';')

raw dataset head

#convert merge date , time colums , convert datetime objects df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time']) df.set_index(pd.datetimeindex(df['datetime'],inplace=true)) df.drop(['date','time'], axis=1, inplace=true)  #put last column first cols = df.columns.tolist() cols = cols[-1:] + cols[:-1] df = df[cols] df = df.dropna()

preprocessed dates

#convert dataframe data array , removes date column not processed,  sliced = df.iloc[0:, 1:8].dropna() hpc = sliced.values  k_means = kmeans() k_means.fit(hpc)  # array of indexes corresponding classes around centroids, in order of dataset classified_data = k_means.labels_  #copy dataframe (may memory intensive illustration) df_processed = df.copy() df_processed['cluster class'] = pd.series(classified_data, index=df_processed.index)

finished

now can see result matched data-set on right side.
now it's classified, it's derive meaning.
this overall example of how can used, start finish.
displaying result, @ pca or making other graphs dependent on class.

Search This Blog

Deter

python - Sklearn.KMeans() : Get class centroid labels and reference to a dataset -

sci-kit learn kmeans , pca dimensionality reduction

kmeans().predict(x) ..docs here

predict closest cluster each sample in x belongs to.

Comments

Post a Comment

Popular posts from this blog

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

java - Unable to make sub reports with Jasper -

How can I utilize Yahoo Weather API in android -