# Diagrams

# Testing hierarchical clustering

Back in 2013 we developed a simple tool for helping Antonio Gutierrez Rubí to explain the benefits of Open Data and Transparency to Spanish politicians.

The tool was called MAPO, and you can see it in action here. Several months later, I was testing some hierarchical cluster techniques for this set of text files and I came up with this (as far as I remember, this is the best clustering output that I never had). Clusters in the image simply transmit the set of words which are really linked both in the text and in the historical contextual situation.

# distance matrix

# Circular singularity

# Index of youth

A simple yet clarifier of how Spanish population is aging. Index of youth (% of people < 20 / % of people > 60) from 1991 to 2009.

Source: INE

and the corresponding graph for total population ( Y = millions):

+ the immigrants one ( Y = thousands):

# ggvis package

As you can read in the official description, **“the goal of ggvis is to make it easy to describe interactive web graphics in R”**. It combines:

> a grammar of graphics from ggplot2,

> reactive programming from shiny, and

> data transformation pipelines from dplyr.

ggvis graphics are rendered with vega, so you can generate both raster graphics with HTML5 canvas and vector graphics with svg. ggvis is less flexible than raw d3 or vega, but is much more succinct and is tailored to the needs of exploratory data analysis.

Find here a simple script with ggvis basics.

# World Bank Development indicators

# Corporate financial performance using Quantmod

Quantmod is one of the very best libraries I´ve seen in a long time. Specially if you like economics, this is like a dream come true. The limits are in your own imagination, you´ve plenty of stocks to chart, analyze and maybe invest ¿¿¿???!

Here find a succinct summary of Quantmod possibilities. If you prefer to access directly to the code, go here

# World Exports from 1948 to 2012

World Exports (Mill.USD) by country from 1948 to 2012. Data: World Trade Organization

# An overview on Association Rules

Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Using the definition of “arules” R package:

—————————————–

Mining frequent itemsets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro (1991) describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal, Imielinski, and Swami (1993) introduced the problem of mining association rules from transaction data as follows:

I = {i1,i2,…,in} -> set of n binary attributes called items

D = {t1,t2,…,tm} -> set of transactions called the database

Each transaction in D has an unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X⇒Y where X,Y ⊆I and X∩Y =∅. The sets of items(for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule.

To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items.

1 – milk, bread

2 – bread, butter

3 – beer

4 – milk, bread, butter

5 – bread, butter

An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also buy butter.

—————————————–

Below find a summary (see complete code here) of what you can do with R and association rules using its packages “arules” and “arulesViz”:

library(arules)

##### Epub example #####

data(“Epub”)

Epub

transactions in sparse format with 15729 transactions (rows) and 936 items (columns)

summary(Epub)

transactions as itemMatrix in sparse format with 15729 rows (elements/itemsets/transactions) and 936 columns (items) and a density of 0.001758755

includes extended item information – examples: labels 1 doc_11d 2 doc_13d 3 doc_14c

includes extended transaction information – examples: transactionID TimeStamp 10792 session_4795 2003-01-02 02:59:00

# size function

transactionInfo(Epub2003[size(Epub2003) > 20])

transactionID TimeStamp 11092 session_56e2 2003-04-29 19:30:38 11371 session_6308 2003-08-18 00:16:12

##### Arules Viz #####

library(arulesViz)

data(“Groceries”)

summary(Groceries)

transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146

most frequent items: whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other) 34055

# Mining association rules using the Apriori algorithm

rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))

rules

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE

algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

# Top three rules with respect to the lift measure

inspect(head(sort(rules, by =”lift”),3))

lhs rhs support confidence lift 1 {Instant food products, soda} => {hamburger meat} 0.001220132 0.6315789 18.99565 2 {soda, popcorn} => {salty snack} 0.001220132 0.6315789 16.69779 3 {flour, baking powder} => {sugar} 0.001016777 0.5555556 16.40807

# Plotting rules

plot(x, method = NULL, measure = “support”, shading = “lift”, interactive = FALSE, data)

plot(rules)

plot(rules, measure=c(“support”, “lift”), shading=”confidence”)

plot(rules, shading=”order”, control=list(main = “Two-key plot”))

# Interactive plotting

sel <- plot(rules, measure=c(“support”, “lift”), shading=”confidence”, interactive=TRUE)

# Matrix based visualizations

subrules <- rules[quality(rules)$confidence > 0.8]

subrules

plot(subrules, method=”matrix”, measure=”lift”)

plot(subrules, method=”matrix”, measure=”lift”, control=list(reorder=TRUE))

# Grouped matrix based visualizations

plot(rules, method=”grouped”)

# Graph based visualizations

subrules2 <- head(sort(rules, by=”lift”), 10)

plot(subrules2, method=”graph”)

plot(subrules2, method=”graph”,control=list(type=”items”))

# Export graph as graphml

saveAsGraph(head(sort(rules, by=”lift”),1000), file=”rules.graphml”)

# Parallel coordinates

plot(subrules2, method=”paracoord”)

plot(subrules2, method=”paracoord”, control=list(reorder=TRUE))

# Double decker plot

oneRule <- sample(rules, 1)

inspect(oneRule)

lhs rhs support confidence lift 1 {other vegetables, frozen vegetables, soda} => {whole milk} 0.001626843 0.5 1.956825

plot(oneRule, method=”doubledecker”, data = Groceries)