Skip to article frontmatterSkip to article content

Frequent-Pattern Analysis

City University of Hong Kong
import logging
import numpy as np
import weka.core.jvm as jvm
from weka.associations import Associator
from weka.core.converters import Loader

jvm.start(logging_level=logging.ERROR)

Association Rule Mining using Weka

We will conduct the market-basket analysis on the supermarket dataset in Weka.

Transaction data

Each instance of the dataset is a transaction, i.e., a customer’s purchase of items in a supermarket. The dataset can be represented as follows:

Using the Explorer interface, load the supermarket.arff dataset in Weka.

Note that most attribute contains only one possible value, namely t. Click the button Edit... to open the data editor. Observe that most attributes have missing values:

In supermarket.arff:

  • Each attribute specified by @attribute can be a product category, a department, or a product with one possible value t:
...
@attribute 'grocery misc' { t}
@attribute 'department11' { t}
@attribute 'baby needs' { t}
@attribute 'bread and cake' { t}
...
  • The last attribute 'total' has two possible values {low, high}:
@attribute 'total' { low, high} % low < 100

To understand the dataset further:

  1. Select the Associate tab. By default, Apriori is chosen as the Associator.
  2. Open the GenericObjectEditor and check for a parameter called treatZeroAsMissing. Hover the mouse pointer over the parameter to see more details.
  3. Run the Apriori algorithm with different choices of the parameter treatZeroAsMissing. Observe the difference in the generated rules.

YOUR ANSWER HERE

Association rule

An association rule for market-basket analysis is defined as follows:

We will use python-weka-wrapper for illustration. To load the dataset:

loader = Loader(classname="weka.core.converters.ArffLoader")
weka_data_path = (
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
)
dataset = loader.load_url(
    weka_data_path + "supermarket.arff"
)  # use load_file to load from file instead

To apply the apriori algorithm with the default settings:

from weka.associations import Associator
apriori = Associator(classname="weka.associations.Apriori")
apriori.build_associations(dataset)
apriori

YOUR ANSWER HERE

To retrieve the rules as a list, and print the first rule:

rules = list(apriori.association_rules())
rules[0]

To obtain the set AA (in premise) and BB (in consequence):

rules[0].premise, rules[0].consequence
premise_support = rules[0].premise_support
total_support = rules[0].total_support

The apriori algorithm returns rules with large enough support:

support(A    B)=support(AB):=count(AB)Dwherecount(A∪B):={TDTAB}.\begin{align} \op{support}(A \implies B) &= \op{support}(A \cup B) := \frac{\op{count}(A \cup B)}{|D|}\quad \text{where}\\ \op{count(A \cup B)} &:= \abs{\Set{T\in D|T\supseteq A\cup B}}. \end{align}

Support is the fraction of transactions containing AA and BB.

For the first rule, the number 723 at the end of the rule corresponds to the total support count count(AB)\op{count}(A\cup B).

# YOUR CODE HERE
raise NotImplementedError()
support

<conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.35) printed after the first rule indicates that

  • confidence is used for ranking the rules and
  • the rule has a confidence of 0.92.

By default, the rules are ranked by confidence, which is defined as follows:

In python-weka-wrapper, we can print different metrics as follows:

for n, v in zip(rules[0].metric_names, rules[0].metric_values):
    print(f"{n}: {v:.3g}")
# YOUR CODE HERE
raise NotImplementedError()
premise_support

Lift is another rule quality measure defined as follows:

apriori_lift = Associator(classname="weka.associations.Apriori", options=['-T', '1'])
...

where the value 1 corresponds to Lift.

# YOUR CODE HERE
raise NotImplementedError()
lift

YOUR ANSWER HERE

YOUR ANSWER HERE