Machine Learning With Spark and Python
K-Means Algorithm
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The
spark.mllib
implementation includes a parallelized variant of the k-means++ method called kmeans. The implementation in spark.mllib
has the following parameters:- k is the number of desired clusters.
- maxIterations is the maximum number of iterations to run.
- initializationMode specifies either random initialization or initialization via k-means||.
- runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).
- initializationSteps determines the number of steps in the k-means|| algorithm.
- epsilon determines the distance threshold within which we consider k-means to have converged.
- initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.
The following shows implementation of K-Means algorithm with Python and Spark.
The data file for the same can be downloaded from
- The following examples can be tested in the PySpark shell.
- In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters.
- The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE).
- You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
The final centers were
[ 0.1, 0.1, 0.1], [ 0.2, 0.2, 0.2], [ 9.2, 9.2, 9.2]
[ 0.1, 0.1, 0.1], [ 0.2, 0.2, 0.2], [ 9.2, 9.2, 9.2]
No comments:
Post a Comment