Silhouette (clustering)
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object lies within its cluster. It was first described by Peter J. Rousseeuw in 1986.[1]
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.
Definition
Assume the data have been clustered via any technique, such as k-means, into  clusters. For each datum
 clusters. For each datum  , let
, let  be the average dissimilarity of
 be the average dissimilarity of  with all other data within the same cluster. We can interpret
 with all other data within the same cluster. We can interpret  as how well
 as how well  is assigned to its cluster (the smaller the value, the better the assignment).  We then define the average dissimilarity of point
 is assigned to its cluster (the smaller the value, the better the assignment).  We then define the average dissimilarity of point  to a cluster
 to a cluster  as the average of the distance from
 as the average of the distance from  to all points in
 to all points in  .
.
Let  be the lowest average dissimilarity of
 be the lowest average dissimilarity of  to any other cluster, of which
 to any other cluster, of which  is not a member. The cluster with this lowest average dissimilarity is said to be the "neighbouring cluster" of
 is not a member. The cluster with this lowest average dissimilarity is said to be the "neighbouring cluster" of  because it is the next best fit cluster for point
 because it is the next best fit cluster for point  .
We now define a silhouette:
.
We now define a silhouette:
Which can be also written as:
From the above definition it is clear that
For  to be close to 1 we require
 to be close to 1 we require  . As
. As  is a measure of how dissimilar
 is a measure of how dissimilar  is to its own cluster, a small value means it is well matched. Furthermore, a large
 is to its own cluster, a small value means it is well matched. Furthermore, a large  implies that
 implies that  is badly matched to its neighbouring cluster. Thus an
 is badly matched to its neighbouring cluster. Thus an  close to one means that the datum is appropriately clustered.
If
 close to one means that the datum is appropriately clustered.
If  is close to negative one, then by the same logic we see that
 is close to negative one, then by the same logic we see that  would be more appropriate if it was clustered in its neighbouring cluster. An
 would be more appropriate if it was clustered in its neighbouring cluster. An  near zero means that the datum is on the border of two natural clusters.
 near zero means that the datum is on the border of two natural clusters.
The average  over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. Thus the average
 over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. Thus the average  over all data of the entire dataset is a measure of how appropriately the data has been clustered. If there are too many or too few clusters, as may occur when a poor choice of
 over all data of the entire dataset is a measure of how appropriately the data has been clustered. If there are too many or too few clusters, as may occur when a poor choice of  is used in the clustering algorithm (e.g.: k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and averages may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific.[2]
 is used in the clustering algorithm (e.g.: k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and averages may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific.[2]
See also
References
- ↑ Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics 20: 53–65. doi:10.1016/0377-0427(87)90125-7.
- ↑ R.C. de Amorim, C. Hennig (2015). "Recovering the number of clusters in data sets with noise features using feature rescaling factors". Information Sciences 324: 126–145. doi:10.1016/j.ins.2015.06.039.


