What is a cluster?
A collection of similar objects to each other.
Slightly Complex Definition:
A connected component of a level set of the probability density function of underlying (and unknown) distribution from which our data samples are drawn.
You are posed with a problem to solve. What you have is a large amount of data represented in a lot of dimensions. The data can not be read or understood by looking at it raw by a human.
Even before you start defining your problem (hypothesis), you need to understand the data, perform an EDA on it. There are multiple ways to do it.
What would be the first thing you do?
A. Perform Clustering
Perfect! Clustering is the right way of identifying interesting parts of data by grouping it.
What is clustering?
Clustering is a process of grouping a sample of data into smaller similar natural subgroups called clusters. Below you can see a plot of the iris dataset applied with the K-Means clustering algorithm.
What would be your first choice of a clustering algorithm?
A. K-Means, K-Mediods, Hierarchical, Spectral, DBSCAN ?
Hold On! Not so fast.
Clustering means different things to different applications. The results may vary according to the data it sees. Hence the choice of the algorithm also depends on data. If you are dealing with image data, you would want to be careful while selecting an appropriate algorithm because most clustering algorithms are instance-based learning methods and are expensive to compute and require a lot of memory. The more data you show to the algorithm, the more size it occupies.
These algorithms take a relatively long time to converge, time complexity (BigO) of some of these algorithms have complexities of O(n log(n)), few alternatives which provide linear complexities also exist.
What is the input to a clustering algorithm?
A. Just data.
Clustering is an unsupervised technique, i.e., the input required for the algorithm is just plain simple data instead of supervised algorithms like classification. Supervised algorithms require data mapped to a label for each record in the sample.
After you have finalized an algorithm and fed data to it, what would be next? To determine a good cluster.
What is a good cluster?
A cluster is good if it separates the data cleanly by that we mean it clearly identifies data which belong to different clusters and assigns cluster labels to it.
Some technical points to note are:
- The inter-cluster similarity should be high (distance should be less)
- The intra-cluster similarity should be low (distance should be more)
If the above properties are satisfied, we can say that the algorithm has resulted in good clusters.
How is similarity measured among data points?
The measure used to define similarity or dissimilarity is the distance among the spatial co-ordinates between two points.
There are multiple options for this parameter :
- Euclidean Distance
- Manhattan Distance
These are the two popular choices, but any other spatial distance metrics will work too.
Advantages of K-Means:
- Simple Model
- Easy to understand
- Assigns labels to data automatically
Disadvantages of K-Means:
- Determine K manually
- Converges to local minima
- Sensitive to outliers
- All items get labels.
Applications of Clustering
Clustering can have very wide applications in different domains, but the basic idea remains the same “Group data into its natural subgroups.”
- Customer Segmentation
- Market Research
- Exploratory Data Analysis
- Image Segmentation
These are, to name a few, but overall, any problem with an implicit group in it can use clustering.
For any questions and inquiries, visit us on Thinkitive Technologies