An overview on Association Rules

Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Using the definition of “arules” R package:

—————————————–

Mining frequent itemsets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro (1991) describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal, Imielinski, and Swami (1993) introduced the problem of mining association rules from transaction data as follows:

I = {i1,i2,…,in}   -> set of n binary attributes called items

D = {t1,t2,…,tm}  -> set of transactions called the database

Each transaction in D has an unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X⇒Y where X,Y ⊆I and X∩Y =∅. The sets of items(for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule.

To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items.

1 – milk, bread

2 – bread, butter

3 – beer

4 – milk, bread, butter

5 – bread, butter

An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also buy butter.

—————————————–

Below find a summary (see complete code here) of what you can do with R and association rules using its packages “arules” and “arulesViz”:

library(arules)

##### Epub example #####

data(“Epub”)

Epub

transactions in sparse format with 15729 transactions (rows) and 936 items (columns)

summary(Epub)

transactions as itemMatrix in sparse format with 15729 rows (elements/itemsets/transactions) and 936 columns (items) and a density of 0.001758755

includes extended item information – examples: labels 1 doc_11d 2 doc_13d 3 doc_14c

includes extended transaction information – examples: transactionID TimeStamp 10792 session_4795 2003-01-02 02:59:00

# size function
transactionInfo(Epub2003[size(Epub2003) > 20])

transactionID TimeStamp 11092 session_56e2 2003-04-29 19:30:38 11371 session_6308 2003-08-18 00:16:12

##### Arules Viz #####

library(arulesViz)
data(“Groceries”)
summary(Groceries)

transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146

most frequent items: whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other) 34055

# Mining association rules using the Apriori algorithm
rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
rules

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE

algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

# Top three rules with respect to the lift measure
inspect(head(sort(rules, by =”lift”),3))

lhs rhs support confidence lift 1 {Instant food products, soda} => {hamburger meat} 0.001220132 0.6315789 18.99565 2 {soda, popcorn} => {salty snack} 0.001220132 0.6315789 16.69779 3 {flour, baking powder} => {sugar} 0.001016777 0.5555556 16.40807

# Plotting rules
plot(x, method = NULL, measure = “support”, shading = “lift”, interactive = FALSE, data)
plot(rules)

arules_1

plot(rules, measure=c(“support”, “lift”), shading=”confidence”)

arules_2

plot(rules, shading=”order”, control=list(main = “Two-key plot”))

arules_3

# Interactive plotting

sel <- plot(rules, measure=c(“support”, “lift”), shading=”confidence”, interactive=TRUE)

arules_4

arules_4_bis

# Matrix based visualizations
subrules <- rules[quality(rules)$confidence > 0.8]
subrules
plot(subrules, method=”matrix”, measure=”lift”)

arules_5

plot(subrules, method=”matrix”, measure=”lift”, control=list(reorder=TRUE))

arules_6

# Grouped matrix based visualizations
plot(rules, method=”grouped”)

arules_7

# Graph based visualizations

subrules2 <- head(sort(rules, by=”lift”), 10)
plot(subrules2, method=”graph”)

arules_8

plot(subrules2, method=”graph”,control=list(type=”items”))

arules_9

# Export graph as graphml
saveAsGraph(head(sort(rules, by=”lift”),1000), file=”rules.graphml”)

# Parallel coordinates
plot(subrules2, method=”paracoord”)

arules_10

plot(subrules2, method=”paracoord”, control=list(reorder=TRUE))

arules_11

# Double decker plot
oneRule <- sample(rules, 1)
inspect(oneRule)

lhs rhs support confidence lift 1 {other vegetables, frozen vegetables, soda} => {whole milk} 0.001626843 0.5 1.956825

plot(oneRule, method=”doubledecker”, data = Groceries)

arules_12