import matplotlib.pyplot as plt
import numpy as np
from Bio import Phylo
%matplotlib widgetIn this notebook, we continue to cluster the instances in the iris 2D dataset without looking at the class attribute. In particular, we will use the hierarchical clustering algorithm that implements the agglomerative nesting (AGNES) for different cluster distance measures.
Classes-to-clusters evaluation¶
Similar to the last tutorial, use the Cluster panel to cluster the iris.2D dataset (not not iris dataset):
- Using the
Preprocesspanel, loadiris.2D.arfffrom the Weka data folder. - Using the
Clusterpanel, choose theClustererasHierarchicalClusterer. - The default number of clusters is . Change it to instead, i.e., set
numClustersto 3. - Select
Classes to clusters evaluationas thecluster mode.[1] - Click
Startto run the clustering algorithm.
Single-linkage method¶
The default linkage criterion (linkType) is SINGLE, which means that the hierarchical clustering algorithm iteratively links/merges two clusters and into one to minimize the cluster distance defined as
namely, the minimum distance between points in different clusters. This gives the single-linkage clustering solution.
# YOUR CODE HERE
raise NotImplementedError()
error_rateYOUR ANSWER HERE
Complete-linkage method¶
Repeat the hierarchical clustering procedure with linkType set to COMPLETE. The distance between two clusters is now measured by
which is the maximum distance between points in different clusters. The measure gives the complete-linkage clustering solution.
Assign to error_rate the fraction (NOT percentage) of incorrectly clustered instances.
# YOUR CODE HERE
raise NotImplementedError()
error_rateVisualize the clustering assignments again and explain whether the complete linkage solution is better than the single-linkage solution.
YOUR ANSWER HERE
Dendrogram¶
To visualize the dendrogram in Weka,
- right-click the clustering result buffer, and
- select
Visualize tree.
You should see the dendrogram below for the complete linkage method:
There are two issues with the dendrogram:
- The dendrogram is incomplete and only contains nodes for the first cluster. The agglomerative clustering stopped immediately after having
numClusters = 3clusters. - The leaf nodes have names that are fractions. By default, the names are taken from the last attribute other than the ignored class attribute. In this case, a name such as 0.2 is a
petalwidth, which is not helpful in identifying the samples or evaluating the clustering solution.
To fix the above issues:
- Set
numClustersto 1 so that the agglomerative clustering continues to merge all nodes into a single cluster. - In the preprocess panel, create a new attribute,
ID, as the last attribute to name the leaf node uniquely. You can copy the following configuration into theFilterpanel:weka.filters.MultiFilter -F "weka.filters.unsupervised.attribute.AddID -C last -N ID" -F "weka.filters.unsupervised.attribute.NumericToNominal -R last" -F "weka.filters.unsupervised.attribute.NominalToString -C last"
The above uses MultiFilter to compose three filters sequentially into one meta-filter:
AddIDcreates a unique numeric ID for each sample.NumericToNominalfollowed byNominalToStringconverts the ID to a string. The conversion ensures that the clustering algorithm uses it to name the leaf but not in the calculation of cluster distances.
After applying the filter and rerunning the clustering algorithm with numClusters = 1 and class attribute ignored, the resulting dendrogram is as follows:
For the single-linkage method, give the dendrogram where
- the root is the cluster of all nodes, and
- each leaf name a unique ID.
YOUR ANSWER HERE
The visualization is hard to read/analyze, especially for a large data set. The height is also not labeled.
Indeed, the dendrogram is printed in the result buffer as a phylogenetic tree in the Newick format with distances and leaf names. E.g., the dendrogram obtained by the complete linkage method is printed as:
Cluster 0
(((((((((((1:0,2:0):0,29:0):0,(5:0,9:0):0):0,((34:0,48:0):0,50:0):0):0.01695,((3:0,43:0):0,(37:0,39:0):0):0.01695):0.01695,((4:0,(8:0,11:0):0):0,(28:0,(40:0,49:0):0):0):0.0339):0.01982,(((7:0,(18:0,46:0):0):0.01695,(41:0,42:0):0.01695):0.01695,20:0.0339):0.01982):0.03625,(((10:0,38:0):0,(33:0,35:0):0):0.01695,13:0.01695):0.07301):0.01746,(14:0.04498,((15:0,36:0):0.0339,23:0.0339):0.01108):0.06245):0.11748,(((((6:0.01695,27:0.01695):0.01695,((16:0,22:0):0,32:0):0.0339):0.0339,17:0.0678):0.02982,(24:0.04498,44:0.04498):0.05264):0.07663,((((((((12:0,26:0):0,30:0):0,31:0):0,47:0):0.01695,21:0.01695):0.02803,19:0.04498):0.00873,25:0.05371):0.04391,45:0.09762):0.07663):0.05066):1.11921,(((((((((51:0,64:0):0.01695,77:0.01695):0.06263,(((56:0.01695,59:0.01695):0.01695,88:0.0339):0.01982,((66:0,76:0):0.0339,92:0.0339):0.01982):0.02586):0.01804,74:0.09762):0.04423,(((((54:0,72:0):0,90:0):0.01695,(89:0,100:0):0.01695):0.0339,((75:0,98:0):0.01695,(95:0,97:0):0.01695):0.0339):0.02873,(91:0.0339,96:0.0339):0.04568):0.06227):0.10672,(((((((((52:0,(79:0,85:0):0):0,67:0):0,69:0):0.01695,55:0.01695):0.01695,87:0.0339):0.01982,(57:0.0339,86:0.0339):0.01982):0.03625,107:0.08996):0.00766,62:0.09762):0.06153,(((53:0,73:0):0.0339,(120:0.01695,134:0.01695):0.01695):0.05114,(78:0.04498,84:0.04498):0.04006):0.07411):0.08942):0.14931,(((((71:0,(127:0,139:0):0):0.01695,(124:0,128:0):0.01695):0.04879,(((102:0,143:0):0.01695,147:0.01695):0.02803,150:0.04498):0.02076):0.04169,(((111:0.01695,114:0.01695):0.01695,122:0.0339):0.04568,(112:0.04498,148:0.04498):0.03459):0.02785):0.1693,(((104:0.01695,(117:0,138:0):0.01695):0.0678,(109:0.0339,126:0.0339):0.05085):0.09518,(130:0.08996,135:0.08996):0.08996):0.0968):0.12116):0.1883,((((58:0,94:0):0.0339,(61:0,80:0):0.0339):0.06054,99:0.09443):0.10278,((60:0.06574,65:0.06574):0.10434,(((63:0.01695,68:0.01695):0.05085,82:0.0678):0.02982,((70:0.01695,81:0.01695):0.03676,(83:0.01695,93:0.01695):0.03676):0.04391):0.07246):0.02714):0.38897):0.24263,(((((101:0.01695,110:0.01695):0.07301,(136:0.0339,144:0.0339):0.05607):0.01746,((137:0,141:0):0.04498,145:0.04498):0.06245):0.09715,((((103:0.0339,125:0.0339):0.05085,((113:0.01695,129:0.01695):0.01695,140:0.0339):0.05085):0.01288,((105:0.0339,133:0.0339):0.01108,121:0.04498):0.05264):0.0868,(115:0.06574,(((116:0.01695,146:0.01695):0.01695,142:0.0339):0.01695,149:0.05085):0.01489):0.11868):0.02016):0.1177,(((106:0.04498,118:0.04498):0.05264,119:0.09762):0.13421,((108:0.05371,131:0.05371):0.05619,(123:0.05085,132:0.05085):0.05905):0.12193):0.09046):0.50654):0.5153)For the above to be a proper newick format,
Cluster 0should be removed as it is not part of the Newick format; and- a semi-colon
;should be added to the end to complete the definition of a tree.
The text file complete.tree fixes the above issues:
The tree can be loaded using:
from Bio import Phylocl_tree = Phylo.read("complete.tree", "newick")
def plot_tree(tree, **kwargs):
Phylo.draw(
tree,
branch_labels=lambda c: (l := c.branch_length)
and l > 0.1
and f"{l:.4g}"
or None,
**kwargs,
)
plt.gcf().set_size_inches(10, 20)
plt.title("Phylogenetic tree")
plt.tight_layout()
plot_tree(cl_tree)Different from a dendrogram which plots the tree against cophenetic distance, the above plots the phylogenetic tree against the branch length from the root.
To obtain the 3 clusters returned by Weka with :
cl_clusters = [cl_tree.clade[0], cl_tree.clade[1, 0], cl_tree.clade[1, 1]]
cl_clustersThe longer the branch length, the more stable the cluster is, as a longer branch length means the cluster is less easy to split up or merge with other clusters.
To plot the clusters in different colors:
cl_tree.root.color = "grey"
for c, color in zip(cl_clusters, ["red", "blue", "green"]):
c.color = color
plot_tree(cl_tree)To obtain the leaf nodes for each of the clusters:
cl_clusters_leaves = [[i.name for i in c.get_terminals()] for c in cl_clusters]
for j, leaves in enumerate(cl_clusters_leaves):
print(f'Cluster {j} with {len(leaves)} nodes: \n{" ".join(leaves)}\n')The larger the cluster, the more stable it is as such a cluster is supported by many similar instances.
To obtain the cophenetic distance of the root, i.e., the cluster containing all nodes, we can compute the total branch distance from the root to any leaf node, say node 1:
cl_d0 = cl_tree.distance("1")
cl_d0The second largest cophenetic distance at the root splits into 2 clusters is:
cl_d1 = cl_d0 - cl_tree.root.clades[1].branch_length
cl_d1The minimum and maximum cophenetic distance at which we have 3 clusters is:
cl_max_distance, cl_min_distance = cl_d1, cl_d1 - min(
c.branch_length for c in cl_clusters
)
cl_max_distance, cl_min_distanceplot_tree(
cl_tree,
axvspan=((cl_d0 - cl_max_distance, cl_d0 - cl_min_distance), {"facecolor": "0.8"}),
)For more examples, see Biopython Tutorial and Cookbook.
For the single-linkage clustering solution, assign to sl_max_distance and sl_min_distance the maximum and minimum cophenetic distance at which we can obtain 3 clusters.
# YOUR CODE HERE
raise NotImplementedError()
sl_max_distance, sl_min_distance# tests
assert np.isclose(sl_max_distance - sl_min_distance, 0.006629999999999997, rtol=1e-3)To obtain the largest cophenetic distance at which we have at least three clusters:
Similar to the complete-linkage method, plot the phylogenetic tree for the single-linkage method. Furthermore,
- shade the range of branch length at which we can obtain 3 clusters, and
- color the 3 clusters returned by Weka differently.
# YOUR CODE HERE
raise NotImplementedError()By comparing the dendrograms obtained by the single-linkage and complete-linkage methods, explain why the complete-linkage method gives a better solution for the iris dataset.
Hint
Consider the range of cophenetic distance that gives 3 clusters instead of 2.
YOUR ANSWER HERE