CS5483

import logging
import pprint

import numpy as np
import weka.core.jvm as jvm
import weka.core.packages as packages
from weka.classifiers import (
    Classifier,
    Evaluation,
    FilteredClassifier,
    SingleClassifierEnhancer,
)
from weka.core.classes import Random, complete_classname
from weka.core.converters import Loader
from weka.filters import Filter

Setup¶

In this notebook, we will train classifiers properly on the skewed dataset for detecting microcalcifications in mammograms.

In particular, we will use the meta classifier ThresholdSelector and the filter SMOTE Synthetic Minority Over-sampling Technique. They need to be installed as additional packages in WEKA. To do so, we have imported packages:

import weka.core.packages as packages

packages must also be enabled for the java virtual machine:

jvm.start(packages=True, logging_level=logging.ERROR)

The following prints the information of the packages we will install:

pkgs = ["thresholdSelector", "SMOTE"]
for item in packages.all_packages():
    if item.name in pkgs:
        pprint.pp(item.metadata)

You may install the packages directly using the Weka package manager instead of downloading the zip files. To install them in python-weka-wrapper, run the following code:

for pkg in pkgs:
    if not packages.is_installed(pkg):
        print(f"Installing {pkg}...")
        packages.install_package(pkg)
    else:
        print(f"Skipping {pkg}, already installed. ")
else:
    print("Done.")

The first time you run the above cell, you should see

Installing thresholdSelector...
Installing SMOTE...
Done.

The next time you run the cell, you should see

Skipping thresholdSelector, already installed. 
Skipping SMOTE, already installed. 
Done.

because the packages have already been installed.

By default, packages are installed under your home directory ~/wekafiles/packages/:

!ls ~/wekafiles/packages

After restarting the kernel, check that the packages have been successfully installed using complete_classname imported by

from weka.core.classes import complete_classname

print(complete_classname("ThresholdSelector"))
print(complete_classname("SMOTE"))
print(packages.installed_packages())

We will use the same mammography dataset from OpenML and J48 as the base classifier. The following loads the dataset into the notebook:

loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url("https://www.openml.org/data/download/52214/phpn1jVwe")
data.class_is_last()
pos_class = 1
clf = Classifier(classname="weka.classifiers.trees.J48")

Threshold Selector¶

The meta classifier ThresholdSelector uses the threshold-moving technique to optimize a performance measure you specify, which can be the precision, recall, $F$ -score, etc. See an explanation of the threshold moving technique here.

The following shows how to maximize recall:

tsc = SingleClassifierEnhancer(classname="weka.classifiers.meta.ThresholdSelector")
tsc.options = ["-M", "RECALL"]
tsc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(tsc, data, 10, Random(1))

print(f"maximum recall: {evl.recall(pos_class):.3g}")

The maximum recall is 100%, as expected by setting the threshold to 1.

Exercise 1

Using J48 as the base classifier and 10-fold cross-validation, obtain the highest precision and F-score. Assign the values to max_precision and max_f, respectively.

If you use python-weka-wrapper, be careful that resetting tsc.options may also reset the base classifier to the default one, which is not J48. To ensure that you are using J48, set the base classifier again after the options:

tsc.options=['-M', ___]
tsc.classifier = clf

# YOUR CODE HERE
raise NotImplementedError()
max_precision, max_f

Cost-sensitive Classifier¶

Weka provides a convenient interface for cost/benefit analysis:

In the explorer interface, train J48 on the mammography dataset with 10-fold cross-validation.
Right-click on the result in the result list.
Choose Cost/Benefit analysis and 1 as the positive class value.
Specify the cost matrix.
Click Minimize Cost/Benefit to minimize the cost.

Exercise 2

Assign to cost_matrix the cost matrix that achieves the maximum precision. You can define the cost matrix as follows:

cost_matrix = np.array([[__, __],
                        [__, __]])

Hint

Pay attention to the row and column labels of the confusion matrix. It changes after you specify 1 as the positive class value.

# YOUR CODE HERE
raise NotImplementedError()
cost_matrix

The following test cell demonstrates how to train a meta classifier to minimize the cost defined using the cost matrix you provided.

# tests
csc = SingleClassifierEnhancer(
    classname="weka.classifiers.meta.CostSensitiveClassifier",
    options=[
        "-cost-matrix",
        "["
        + " ; ".join(
            " ".join(str(entry) for entry in cost_matrix[:, i]) for i in range(2)
        )
        + "]",
        "-S",
        "1",
    ],
)
csc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(csc, data, 10, Random(1))

precision = evl.precision(pos_class)
print(f"maximum precision: {precision:.3g}")

SMOTE¶

Synthetic Minority Over-sampling TEchnique (SMOTE) is a filter that up-samples the minority class. Instead of duplicates of the same instance, it creates new samples as convex combinations of existing ones. See a more detailed explanation of SMOTE here.

Exercise 3

Using the FilteredClassifier with J48 as the classifier and SMOTE as the filter, try to tweak the setting of SMOTE to give the highest possible value of $F$ score larger than the maximum one achieved by ThresholdSelector. Assign to smote.options your choice of filter. E.g., you can change the percentage of SMOTE instances to 150% as follows:

smote.options = ['-P', '150']

smote = Filter(classname="weka.filters.supervised.instance.SMOTE")
print("Default smote.options:", smote.options)
# YOUR CODE HERE
raise NotImplementedError()
print("Your smote.options:", smote.options)

# tests
fc = FilteredClassifier()
fc.filter = smote
fc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))

f_score = evl.f_measure(pos_class)
print(f"F-score by SMOTE: {f_score:.3g}")

References¶

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. 10.1613/jair.953

Tutorial 5

Evaluation for Skewed Dataset

Tutorial 6