CS5483

import io
import logging
import os
import urllib.request

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import weka.core.jvm as jvm
from IPython import display
from joblib import Memory, Parallel, delayed, dump, load
from scipy.io import arff
from sklearn import ensemble, tree
from sklearn.model_selection import GridSearchCV
from weka.classifiers import Classifier, Evaluation, SingleClassifierEnhancer
from weka.core.converters import Loader

%matplotlib widget
jvm.start(logging_level=logging.ERROR)
# cache to private folder
os.makedirs("private", exist_ok=True)
memory = Memory(location="private", verbose=0)

# To tabulate results of ensemble methods
def tabulate(results):
    df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    df.insert(0, "n_estimators", n_estimators_list)
    df.loc[:, lambda df: ~df.columns.isin(["n_estimators"])] = np.reshape(
        results, (len(n_estimators_list), len(max_depth_list)), order="F"
    )
    return df

# To plot the dataframe
def plot(df):
    for col in df.columns[1:]:
        plt.plot(df["n_estimators"], df[col], label=col, marker="o")
    plt.legend()
    plt.xlabel("n_estimators")
    plt.ylabel("Accuracies")

# Load file
def load_file(filename):
    if os.path.exists(filename):
        os.replace(filename, "private/" + filename)
    return load("private/" + filename)

In this notebook, we will try to build the best machine to classify the image segmentation dataset:

segment-challenge.arff for training, and
segment-test.arff for testing.

Knowledge Flow Interface¶

Weka provides a KnowledgeFlow interface to flow data through a learning algorithm. Unlike other interfaces (Explorer and Experimenter), KnowledgeFlow interface can train a classifier incrementally as more and more data are available.

To have an overview of the interface, take a look at the video tutorial by Witten. For more details, refer to the manual here.

Open a layout¶

Run Weka.
Click KnowledgeFlow button under Applications.
An untitled new layout should have been opened. You can also create new layout using File->New Layout.
To load a layout, click File->Open.... Try opening segment-RF.kf and segment-Adaboost.kf.
To save the current layout to a new file, click the menu item File->Save/Save As....

Run a layout¶

With segment-RF.kf or segment-Adaboost.kf opened:

Click the play button to start the training/testing. Unlike the Explorer interface, we can flow data through multiple classification algorithms simultaneously.
If the Status panel at the bottom shows a successful run,
- right-click any TextViewer and select show results to show the collected result;^[1]
- right-click any GraphViewer and select show plots to show the plots.

The following demonstrate how to use the interface to create a layout to train and test a J48 decision tree.

Load the data¶

To load the test data:

Click ArffLoader (under the DataSources folder) from the Design panel on the left, then click anywhere on the layout panel to add it.^[2]
Similarly, add a ClassAssigner and a TrainingSetMaker (under Evaluation folder) to the layout.
Right-click the ArffLoader and select dataSet (under Connections:). Click ClassAssigner in the layout to connect the data to it.^[3]
Similarly, connect the data from the ClassAssigner to TrainingSetMaker.
Right-click the ArffLoader and click Browse... to select the training data segment-challenge.arff.

To load the test data:

Add another ArffLoader and ClassAssigner to the layout and connect the data from the prior to the latter. Alternatively, instead of adding the same block/connection multiple times, you can select, copy, and paste existing blocks.^[4]
Add a TestSetMaker (instead of a TrainingSetMaker under the Evaluation folder) and connect the data from the new ClassAssigner to it.
Configure the new ArffLoader to load segment-test.arff (instead of segment-challenge.arff).

Setup the classifier¶

Add a J48 (under Classifiers/trees folder)^[5] and a ClassifierPerformanceEvaluator^[6] (under Evaluation) to the layout.
Connect the trainingSet from TrainingSetMaker to J48.
Connect the testSet from TestSetMaker to J48.
Connect the batchClassifer from J48 to ClassifierPerformanceEvaluator.

Display the result¶

Add two TextViewers and a GraphViewer (under Visualization folder) to the layout.^[7]
Connect the text and graph from J48 to the first TextViewer and the GraphViewer, respectively.
Connect the text from the ClassifierPerformanceEvaluator to the second TextViewer.

Exercise 1

Using the KnowledgeFlow interface, add other classifiers: IBk, ZeroR, OneR, PART, and JRIP. Record their fractional accuracies in the dictionary performance as follows:

performance = {'J48': 0.961728,
               'IBk': ___,
               'ZeroR': ___,
               'OneR': ___,
               'PART': ___,
               'JRIP': ___}

Use the default parameters.

# YOUR CODE HERE
raise NotImplementedError()
performance

Ensemble Methods¶

Unlike training an individual classifier, ensemble methods

train an army of base classifiers and
combine their decisions into a final decision.

We will use the ensemble methods implemented in scikit-learn and Weka. To load the data for scikit-learn:

def load_url(url):
    ftpstream = urllib.request.urlopen(url)
    df = pd.DataFrame(arff.loadarff(io.StringIO(ftpstream.read().decode("utf-8")))[0])
    return df.loc[:, lambda df: ~df.columns.isin(["class"])], df["class"].astype(str)


weka_data_path = (
    "https://raw.githubusercontent.com/fracpete/wekamooc/master/dataminingwithweka/data/"
)
X_train, Y_train = load_url(weka_data_path + "segment-challenge.arff")
X_test, Y_test = load_url(weka_data_path + "segment-test.arff")

To load the data for python-weka-wrapper:

loader = Loader(classname="weka.core.converters.ArffLoader")
trainset = loader.load_url(
    weka_data_path + "segment-challenge.arff"
)  # use load_file to load from file instead
trainset.class_is_last()

testset = loader.load_url(weka_data_path + "segment-test.arff")
testset.class_is_last()

Bagging¶

Bagging (Bootstrap Aggregation) is a simple ensemble method that trains different base classifiers by applying a classification algorithm to different bootstrapped datasets.

For instance, the following uses the sklearn.ensemble.BaggingClassifier to train 10 decision trees with a maximum depth of 5:

from sklearn import ensemble

BAG = ensemble.BaggingClassifier(
    estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0,
)

BAG.fit(X_train, Y_train)
print(f"Accuracy: {BAG.score(X_test, Y_test):.4g}")

The ensemble method can be parallelized for both training and classification by setting the additional parameter n_jobs, the number of jobs to run in parallel. Different jobs will be run in different CPU cores or threads. To see the effect, execute the following cell and answer y to the prompt or just press enter. Your output may look like the following:

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.0s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.1s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.1s finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    1.8s remaining:    1.8s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    1.8s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    1.1s remaining:    3.4s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    1.2s finished

if input("execute? [Y/n] ").lower() != "n":
    for n_jobs in [1, 2, 4, 8]:
        BAG.set_params(n_estimators=1000, verbose=1, n_jobs=n_jobs)
        BAG.fit(X_train, Y_train)

Next, we would like to see the effect of changing the depth and number of estimators. The following are the lists of possible values to explore:

max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

We will define a function bagging(n_estimators, max_depth) that returns the accuracy of Bagging n_estimators decision trees of maximum depth max_depth. To avoid re-training/evaluating a classifier, we additionally cache the result using joblib.Memory:

from joblib import Memory
import os

# cache to private folder
os.makedirs("private", exist_ok=True)
memory = Memory(location="private", verbose=0)

@memory.cache
def bagging(n_estimators, max_depth):
    BAG = ensemble.BaggingClassifier(
        estimator=tree.DecisionTreeClassifier(max_depth=max_depth),
        n_estimators=n_estimators,
        random_state=0,
    )
    BAG.fit(X_train, Y_train)
    return BAG.score(X_test, Y_test)

To run bagging in parallel for different choices of parameters, we can use the following tools from joblib:

from joblib import Parallel, delayed

The following will divide the work into 4 jobs to run in parallel:

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  47 out of  54 | elapsed:    2.3s remaining:    0.3s
[Parallel(n_jobs=4)]: Done  54 out of  54 | elapsed:    3.0s finished

if input("execute? [Y/n] ").lower() != "n":
    BAG_results = Parallel(n_jobs=4, verbose=1)(
        delayed(bagging)(n_estimators, max_depth)
        for max_depth in max_depth_list
        for n_estimators in n_estimators_list
    )

To present the result nicely in a DataFrame:

BAG_df = tabulate(BAG_results)
display.display(BAG_df)

Although the above calls bagging again for the same combinations of arguments, the cached results are returned without re-training the classifiers. You can clear the cache with the following:

bagging.clear()

It is helpful to save the DataFrame to a particular file. This can be done using joblib.dump:

from joblib import dump

if input("(over-)write file? [Y/n] ").lower() != "n":
    dump(BAG_df, "BAG_df.gz")

Unlike caching, we can load the data anywhere beyond this notebook:

from joblib import load

BAG_df = load("BAG_df.gz")

To remove the file:

os.remove("BAG_df.gz")

To plot the DataFrame:

plt.figure()
plot(BAG_df)
plt.title(r"Bagging decision trees")
plt.show()

YOUR ANSWER HERE

To apply the ensemble method using python-weka-wrapper instead of scikit-learn:

REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ["-L", "5"]
BAG_weka = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging")
BAG_weka.options = ["-I", "10", "-S", "1"]
BAG_weka.classifier = REPTree
BAG_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(BAG_weka, testset)
print(f"Accuracy: {evl.percent_correct/100:.4g}")

The base classifiers for Bagging are trained using REPTree, which is a fast decision tree induction algorithm that is neither C4.5 nor CART.

max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

if input("execute? [Y/n] ").lower() != "n":
    BAG_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    BAG_weka_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()

    display.display(BAG_weka_df.round(4))

    plt.figure()
    plot(BAG_weka_df)
    plt.title(r"Bagging decision trees")
    plt.show()

    dump(BAG_weka_df, "BAG_weka_df.gz")

Caution

To avoid re-training, the last line above saves your result to a file called BAG_weka_df.gz. Otherwise, the server cannot auto-grade your submission as it aborts execution that takes excessive time or memory. Ensure that your code is indented correctly, so it is part of the body of the conditional in the solution cell:

...
if input('execute? [Y/n] ').lower() != 'n':
    ...

Random Forest¶

Another ensemble method, called random forest, is similar to Bagging decision trees. However, it randomly selects or combines features to further diversify the base classifiers before building each tree. The following trains a random forest of 10 decision trees with a maximum depth of 5.

RF = ensemble.RandomForestClassifier(max_depth=5, n_estimators=10, random_state=0)
RF.fit(X_train, Y_train)
print(f"Accuracy: {RF.score(X_test, Y_test):.4g}")

Like Bagging, we can also parallelize the training and classification by setting the n_jobs parameter.

max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]

if input("execute? [Y/n] ").lower() != "n":
    RF_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    RF_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(RF_df.round(4))

    plt.figure()
    plot(RF_df)
    plt.title(r"Random forest")
    plt.show()

    dump(RF_df, "RF_df.gz")

To train a random forest of 10 decision trees with a maximum depth of 5 using python-weka-wrapper:

RF_weka = Classifier(classname="weka.classifiers.trees.RandomForest")
RF_weka.options = ["-I", "10", "-depth", "5", "-S", "1"]
RF_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(RF_weka, testset)
print(f"Accuracy {evl.percent_correct/100:.4g}")

max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]

if input("execute? [Y/n] ").lower() != "n":
    RF_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    RF_weka_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(RF_weka_df.round(4))

    plt.figure()
    plot(RF_weka_df)
    plt.title(r"Random forest")
    plt.show()

    dump(RF_weka_df, "RF_weka_df.gz")

YOUR ANSWER HERE

For sklearn, we can tune the parameters of a classification algorithm using GridSearchCV imported as follows:

from sklearn.model_selection import GridSearchCV

For instance, to tune n_estimators by searching for the best value from n_estimators_list that maximizes the cross-validated accuracy on the training set:

if input("execute? [Y/n] ").lower() != "n":
    max_depth_list = [1, 2, 3, 5, 10, 20]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]
    param_grid = {"n_estimators": n_estimators_list, "max_depth": max_depth_list}

    grid_search = GridSearchCV(
        ensemble.RandomForestClassifier(random_state=0), param_grid, verbose=1, n_jobs=4
    )

    grid_search.fit(X_train, Y_train)

    print(f"Accuracy: {grid_search.score(X_test, Y_test):.4g}")
    print(f"Best parameters: {grid_search.best_params_}")

YOUR ANSWER HERE

AdaBoost¶

Using AdaBoost, we can boost the performance by adding base classifiers one-by-one to improve the error made by previously trained classifiers. To train AdaBoost with 10 decision trees of a maximum depth of 5:

ADB = ensemble.AdaBoostClassifier(
    estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0,
    algorithm='SAMME',
)
ADB.fit(X_train, Y_train)
print(f"Accuracy: {ADB.score(X_test, Y_test):.4g}")

YOUR ANSWER HERE

max_depth_list = [1, 2, 3, 5, 10]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

if input("execute? [Y/n] ").lower() != "n":
    ADB_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    ADB_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(ADB_df.round(4))

    plt.figure()
    plot(ADB_df)
    plt.title(r"Adaboost decision trees")
    plt.show()

    dump(ADB_df, "ADB_df.gz")

To train AdaBoost with 10 decision trees of maximum depth 5 using python-weka-wrapper:

REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ["-L", "5"]
ADB_weka = SingleClassifierEnhancer(classname="weka.classifiers.meta.AdaBoostM1")
ADB_weka.options = ["-I", "10", "-S", "1"]
ADB_weka.classifier = REPTree
ADB_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(ADB_weka, testset)
print(f"Accuracy {evl.percent_correct/100:.4g}")

Weka uses the multi-class extension called AdaboostM1, which is different from Adaboost-SAMME.

if input("execute? [Y/n] ").lower() != "n":
    max_depth_list = [1, 2, 3, 5, 10]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
    ADB_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    ADB_weka_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(ADB_weka_df.round(4))

    plt.figure()
    plot(ADB_weka_df)
    plt.title(r"Adaboost decision trees")
    plt.show()

    dump(ADB_weka_df, "ADB_weka_df.gz")

YOUR ANSWER HERE

Challenge¶

Train your own classifier to achieve the highest possible accuracy. You may:

choose different classification algorithms or ensemble methods such as Bagging, Stacking, Voting, and XGBoost.
tune the hyper-parameters manually or automatically using GridSearchCV in scikit-learn or CVParameterSelection in Weka.

Post your model and results on Canvas to compete with others. To include your code in this notebook, make sure you avoid excessive time or memory by putting your code in the body of the conditional if input('execute? [Y/n] ').lower() != 'n':.

The following is an example using XGBoost with its default parameters.

if input("execute? [Y/n] ").lower() != "n":
    # NOTE: Restart the kernel after installing xgboost
    %pip install xgboost
    import xgboost
    codes, uniques = pd.concat([Y_train, Y_test]).factorize()
    Y_train_codes, Y_test_codes = codes[:len(Y_train)], codes[len(Y_train):]
    XGB = xgboost.XGBClassifier(n_jobs=1, use_label_encoder=False)
    XGB.fit(X_train, Y_train_codes)
    print(f"Accuracy: {XGB.score(X_test, Y_test_codes):.4g}")

Machine vs Machine

Knowledge Flow Interface¶

Open a layout¶

Run a layout¶

Load the data¶

Setup the classifier¶

Display the result¶

Ensemble Methods¶

Bagging¶

Random Forest¶

AdaBoost¶

Challenge¶