Archive

Diagrams

Back in 2013 we developed a simple tool for helping Antonio Gutierrez Rubí to explain the benefits of Open Data and Transparency to Spanish politicians.
The tool was called MAPO, and you can see it in action here. Several months later, I was testing some hierarchical cluster techniques for this set of text files and I came up with this (as far as I remember, this is the best clustering output that I never had). Clusters in the image simply transmit the set of words which are really linked both in the text and in the historical contextual situation.

 

transp_text_clsuter_dendogram

A simple yet clarifier of how Spanish population is aging. Index of youth (% of people < 20 / % of people > 60) from 1991 to 2009.

Source: INE

indice_juventud

 

and the corresponding graph for total population ( Y = millions):

poblacion_1991_2009

 

+ the immigrants one ( Y = thousands):

migrants_1991_2009

As you can read in the official description, “the goal of ggvis is to make it easy to describe interactive web graphics in R”. It combines:

> a grammar of graphics from ggplot2,

> reactive programming from shiny, and

> data transformation pipelines from dplyr.

ggvis graphics are rendered with vega, so you can generate both raster graphics with HTML5 canvas and vector graphics with svg. ggvis is less flexible than raw d3 or vega, but is much more succinct and is tailored to the needs of exploratory data analysis.

Find here a simple script with ggvis basics.

Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Using the definition of “arules” R package:

—————————————–

Mining frequent itemsets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro (1991) describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal, Imielinski, and Swami (1993) introduced the problem of mining association rules from transaction data as follows:

I = {i1,i2,…,in}   -> set of n binary attributes called items

D = {t1,t2,…,tm}  -> set of transactions called the database

Each transaction in D has an unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X⇒Y where X,Y ⊆I and X∩Y =∅. The sets of items(for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule.

To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items.

1 – milk, bread

2 – bread, butter

3 – beer

4 – milk, bread, butter

5 – bread, butter

An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also buy butter.

—————————————–

Below find a summary (see complete code here) of what you can do with R and association rules using its packages “arules” and “arulesViz”:

library(arules)

##### Epub example #####

data(“Epub”)

Epub

transactions in sparse format with 15729 transactions (rows) and 936 items (columns)

summary(Epub)

transactions as itemMatrix in sparse format with 15729 rows (elements/itemsets/transactions) and 936 columns (items) and a density of 0.001758755

includes extended item information – examples: labels 1 doc_11d 2 doc_13d 3 doc_14c

includes extended transaction information – examples: transactionID TimeStamp 10792 session_4795 2003-01-02 02:59:00

# size function
transactionInfo(Epub2003[size(Epub2003) > 20])

transactionID TimeStamp 11092 session_56e2 2003-04-29 19:30:38 11371 session_6308 2003-08-18 00:16:12

##### Arules Viz #####

library(arulesViz)
data(“Groceries”)
summary(Groceries)

transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146

most frequent items: whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other) 34055

# Mining association rules using the Apriori algorithm
rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
rules

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE

algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

# Top three rules with respect to the lift measure
inspect(head(sort(rules, by =”lift”),3))

lhs rhs support confidence lift 1 {Instant food products, soda} => {hamburger meat} 0.001220132 0.6315789 18.99565 2 {soda, popcorn} => {salty snack} 0.001220132 0.6315789 16.69779 3 {flour, baking powder} => {sugar} 0.001016777 0.5555556 16.40807

# Plotting rules
plot(x, method = NULL, measure = “support”, shading = “lift”, interactive = FALSE, data)
plot(rules)

arules_1

plot(rules, measure=c(“support”, “lift”), shading=”confidence”)

arules_2

plot(rules, shading=”order”, control=list(main = “Two-key plot”))

arules_3

# Interactive plotting

sel <- plot(rules, measure=c(“support”, “lift”), shading=”confidence”, interactive=TRUE)

arules_4

arules_4_bis

# Matrix based visualizations
subrules <- rules[quality(rules)$confidence > 0.8]
subrules
plot(subrules, method=”matrix”, measure=”lift”)

arules_5

plot(subrules, method=”matrix”, measure=”lift”, control=list(reorder=TRUE))

arules_6

# Grouped matrix based visualizations
plot(rules, method=”grouped”)

arules_7

# Graph based visualizations

subrules2 <- head(sort(rules, by=”lift”), 10)
plot(subrules2, method=”graph”)

arules_8

plot(subrules2, method=”graph”,control=list(type=”items”))

arules_9

# Export graph as graphml
saveAsGraph(head(sort(rules, by=”lift”),1000), file=”rules.graphml”)

# Parallel coordinates
plot(subrules2, method=”paracoord”)

arules_10

plot(subrules2, method=”paracoord”, control=list(reorder=TRUE))

arules_11

# Double decker plot
oneRule <- sample(rules, 1)
inspect(oneRule)

lhs rhs support confidence lift 1 {other vegetables, frozen vegetables, soda} => {whole milk} 0.001626843 0.5 1.956825

plot(oneRule, method=”doubledecker”, data = Groceries)

arules_12