Introduction to this chapter

This chapter covers data analysis of public bike sharing systems around the world. It’s the first of a few chapters covering different transport means, such as airplanes, buses, cars and others. The idea of this project is to apply data analysis techniques to real world data sets. In this first chapter I will show a full data analysis cycle, starting in gathering data and ending by creating a prediction model of public bike sharing system capacity in London city.

This document is built using markdown syntax and then generated HTML code for publishing the whole on the Internet. The flow is basically splitted in three parts, gathering, storing and modeling data.

Get Public Bicicles Sharing Systems

The following code is coming from get_citybik.R In order to gather data, I will be using CityBikes API www.citybik.es where you can access in real time to data of dozens of public bike sharing systems around the world.

1.- Get list of all systems involved

First we need to set our current working directory. Beware of changing this working directory if you happen to be working with these data by your own.

# 1.-GOAL: Get list of bikes systems and complete with full URL 

# Load jsonlite package
library(jsonlite)

# Set working directory (Remember to change this directory on demand)
#setwd("/home/albertogonzalez/Desktop/TDA/chapters/bikes")

Get a list of all networks using the network endpoint. Basically I want to know which bike systems are available using this endpoint.

# Get all networks involved
network_endpoint = "http://api.citybik.es/v2/networks"
network_endpoint_df <- jsonlite::fromJSON(network_endpoint, simplifyDataFrame = TRUE)

Get valid data from the list of all networks and incorporate a column with the suffix URL.

# Refine endpoint output
company_names = network_endpoint_df$networks
company_ids = company_names$id
company_data = company_names$location
company_url = company_names$href

# Build a suffix to endpoint
suffix = "http://api.citybik.es"
company_url = paste(suffix,company_names$href,sep="")
company_df = cbind(company_ids,company_data,company_url)

This is the resulting table:

head(company_df)
##                company_ids     city country latitude longitude
## 1               opole-bike    Opole      PL 50.66450  17.92760
## 2 wroclawski-rower-miejski  Wrocław      PL 51.10970  17.04850
## 3                 veturilo Warszawa      PL 52.23450  21.00240
## 4                     bubi Budapest      HU 47.49791  19.04023
## 5                  kvb-rad     Köln      DE 50.93640   6.96053
## 6              bemowo-bike Warszawa      PL 52.26060  20.92900
##                                                  company_url
## 1               http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3                 http://api.citybik.es/v2/networks/veturilo
## 4                     http://api.citybik.es/v2/networks/bubi
## 5                  http://api.citybik.es/v2/networks/kvb-rad
## 6              http://api.citybik.es/v2/networks/bemowo-bike

Export a CSV file with the previous dataframe just in case we need this information at any time.

# Export list of companies
write.csv(company_df,file = "list_of_bikes_system_worldwide.csv",row.names = FALSE)

2.- Get data from all systems involved

Now I am going to create a function to get data from all our systems.

head(company_df)
##                company_ids     city country latitude longitude
## 1               opole-bike    Opole      PL 50.66450  17.92760
## 2 wroclawski-rower-miejski  Wrocław      PL 51.10970  17.04850
## 3                 veturilo Warszawa      PL 52.23450  21.00240
## 4                     bubi Budapest      HU 47.49791  19.04023
## 5                  kvb-rad     Köln      DE 50.93640   6.96053
## 6              bemowo-bike Warszawa      PL 52.26060  20.92900
##                                                  company_url
## 1               http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3                 http://api.citybik.es/v2/networks/veturilo
## 4                     http://api.citybik.es/v2/networks/bubi
## 5                  http://api.citybik.es/v2/networks/kvb-rad
## 6              http://api.citybik.es/v2/networks/bemowo-bike

I will start with one system and then apply the same function to a set of cities, although you could do it for the >400 cities included in this magnificent API.

# 2.- For each system, get data (build DB and cron job)

# Build a function to read, parse and store endpoint data
bike_input = "http://api.citybik.es/v2/networks/wroclawski-rower-miejski"
# Read one system
bike_input_df = jsonlite::fromJSON(bike_input, simplifyDataFrame = TRUE)

bike_input_df_2 = bike_input_df$network
bike_input_df_3 = bike_input_df_2$stations

bike_df = bike_input_df_3[c(1,3:8)]
total_slots = bike_input_df_3$extra$slots
bike_df$total_slots = total_slots

The resulting table of capacity data in a given city:

head(bike_df)
##   empty_slots free_bikes                               id latitude
## 1          15          0 52515b944e4a64584def861922aa6db1 51.12528
## 2          15          0 dfe66f3e0cb8ac29b2386df9d68341d7 51.13208
## 3          13          0 0138e16d65d5db5b6e15970b73686b1a 51.10723
## 4          15          0 7daa95e8b08a9d4b1401d5d7ed71e5c7 51.10098
## 5          15          0 ae5d14252719f50d17084268b9906747 51.09804
## 6          15          0 d56c50f2935797bcffcba9f4237ded56 51.11943
##   longitude                                    name
## 1  16.98439                  Legnicka - Wejherowska
## 2  17.06550                             Pl. Kromera
## 3  17.06133 Politechnika Wrocławska - Gmach Główny 
## 4  17.00830                   Pereca - Grabiszyńska
## 5  17.00680                        Żelazna - Pereca
## 6  17.05151                    Wyszyńskiego - Prusa
##                     timestamp total_slots
## 1 2016-12-31T09:23:28.508000Z          15
## 2 2016-12-31T09:23:28.442000Z          15
## 3 2016-12-31T09:23:28.471000Z          14
## 4 2016-12-31T09:23:28.500000Z          15
## 5 2016-12-31T09:23:28.507000Z          15
## 6 2016-12-31T09:23:28.534000Z          15

You can see some of the cities included in the API.

# Get id + url
input_array = company_df[c(1,6)]
head(input_array)
##                company_ids
## 1               opole-bike
## 2 wroclawski-rower-miejski
## 3                 veturilo
## 4                     bubi
## 5                  kvb-rad
## 6              bemowo-bike
##                                                  company_url
## 1               http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3                 http://api.citybik.es/v2/networks/veturilo
## 4                     http://api.citybik.es/v2/networks/bubi
## 5                  http://api.citybik.es/v2/networks/kvb-rad
## 6              http://api.citybik.es/v2/networks/bemowo-bike

I will build two arrays with key information, the city id and its API url, which are needed to build the final query.

array_id = input_array$company_ids
array_id = as.data.frame(array_id)
array_url = input_array$company_url
array_url = as.data.frame(array_url)
head(array_id)
##                   array_id
## 1               opole-bike
## 2 wroclawski-rower-miejski
## 3                 veturilo
## 4                     bubi
## 5                  kvb-rad
## 6              bemowo-bike
head(array_url)
##                                                    array_url
## 1               http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3                 http://api.citybik.es/v2/networks/veturilo
## 4                     http://api.citybik.es/v2/networks/bubi
## 5                  http://api.citybik.es/v2/networks/kvb-rad
## 6              http://api.citybik.es/v2/networks/bemowo-bike

As previously shown, let’s create a function to gather data from one city, and then apply that same function to more than one city.

# Create the function for one system
foo = function(url,id){
  input = as.character(url)
  read_input = jsonlite::fromJSON(input, simplifyDataFrame = TRUE)
  input_df_2 = read_input$network
  input_df_3 = input_df_2$stations
  input_df = input_df_3[c(1,3:8)]
  input_df$company_id = as.character(id)
  return (input_df)
}

Now it’s time to test the function in one city and then to a small amouny of cities and ultimately, if you wish, to the whole array of cities.

# Test fuction for one:
aa = array_url[2, ] 
aa
## [1] http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 458 Levels: http://api.citybik.es/v2/networks/adcb-bikeshare ...
a = foo(aa,"wroclawski-rower-miejski")
head(a)
##   empty_slots free_bikes                               id latitude
## 1          15          0 52515b944e4a64584def861922aa6db1 51.12528
## 2          15          0 dfe66f3e0cb8ac29b2386df9d68341d7 51.13208
## 3          13          0 0138e16d65d5db5b6e15970b73686b1a 51.10723
## 4          15          0 7daa95e8b08a9d4b1401d5d7ed71e5c7 51.10098
## 5          15          0 ae5d14252719f50d17084268b9906747 51.09804
## 6          15          0 d56c50f2935797bcffcba9f4237ded56 51.11943
##   longitude                                    name
## 1  16.98439                  Legnicka - Wejherowska
## 2  17.06550                             Pl. Kromera
## 3  17.06133 Politechnika Wrocławska - Gmach Główny 
## 4  17.00830                   Pereca - Grabiszyńska
## 5  17.00680                        Żelazna - Pereca
## 6  17.05151                    Wyszyńskiego - Prusa
##                     timestamp               company_id
## 1 2016-12-31T09:23:28.508000Z wroclawski-rower-miejski
## 2 2016-12-31T09:23:28.442000Z wroclawski-rower-miejski
## 3 2016-12-31T09:23:28.471000Z wroclawski-rower-miejski
## 4 2016-12-31T09:23:28.500000Z wroclawski-rower-miejski
## 5 2016-12-31T09:23:28.507000Z wroclawski-rower-miejski
## 6 2016-12-31T09:23:28.534000Z wroclawski-rower-miejski

The next rows are meant to apply the foo (get data from one system) function to the 452 systems available in Citybikes.

# Apply function to all the bike systems

# u = array_url[1:452,]
# i = array_id[1:452,]
# 
# foo_all <- lapply(u,foo,i)

So the above chunk shows the techniques to play with the API. Now I will start to get real time data of eight european cities each hour and store that data in a SQLite Database.

3.- Get data from selected systems and store them in SQLite DB

Using the function created before, I test first one city.

# Test one city.

u = "http://api.citybik.es/v2/networks/dublinbikes"
i = "dublin"

city_ind = foo(u,i)
head(city_ind)
##   empty_slots free_bikes                               id latitude
## 1          19         16 b1c774f178240d5d757854364dafc840 53.34143
## 2          28          2 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3          38          2 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4           2         28 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5          29         11 6bff19daf1747a1e956a899ef8021232 53.34071
## 6          12         23 11b359924d83d802a41856c405d13e61 53.33878
##   longitude                    name                   timestamp company_id
## 1 -6.246720           FENIAN STREET 2016-12-31T09:22:10.080000Z     dublin
## 2 -6.246154               CITY QUAY 2016-12-31T09:22:10.111000Z     dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-12-31T09:22:10.216000Z     dublin
## 4 -6.300217         BROOKFIELD ROAD 2016-12-31T09:22:10.052000Z     dublin
## 5 -6.308191              EMMET ROAD 2016-12-31T09:22:10.219000Z     dublin
## 6 -6.303950             ROTHE ABBEY 2016-12-31T09:22:10.214000Z     dublin

Then I create (just the first time) the SQLite database nd store the results for future analysis.

# Create a DB (sqlite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Create DB & Table
# my_db <- src_sqlite( "bikes.sqlite3", create = TRUE)   # crear DB          
# copy_to( my_db, city_ind, "bikes_records_2", temporary = FALSE)   # crear tabla        

Set up a cron job to get and store data for the selected list of systems with the following code (see store_citybik.R)

In this following code chunk you will see the case of one city (Dublin) but if you go to store_citybik.R you will see how I got data from eight different cities in the cron job.

#  libraries
library(RSQLite)
#setwd("/home/albertogonzalez/Desktop/bestiario/Quadrigram/2016/monografias/bikes")

# Open DB connection
con <- dbConnect(SQLite(),dbname="bikes.sqlite3")
my_db <- src_sqlite( "bikes.sqlite3", create = FALSE)


# Function to get data
foo = function(url,id){
  input = as.character(url)
  read_input = jsonlite::fromJSON(input, simplifyDataFrame = TRUE)
  input_df_2 = read_input$network
  input_df_3 = input_df_2$stations
  input_df = input_df_3[c(1,3:8)]
  input_df$company_id = as.character(id)
  return (input_df)
}


# DUBLIN
u = "http://api.citybik.es/v2/networks/dublinbikes"
i = "dublin"

city_ind = foo(u,i)
dim(city_ind)
## [1] 101   8
head(city_ind)
##   empty_slots free_bikes                               id latitude
## 1          19         16 b1c774f178240d5d757854364dafc840 53.34143
## 2          28          2 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3          38          2 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4           2         28 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5          29         11 6bff19daf1747a1e956a899ef8021232 53.34071
## 6          12         23 11b359924d83d802a41856c405d13e61 53.33878
##   longitude                    name                   timestamp company_id
## 1 -6.246720           FENIAN STREET 2016-12-31T09:22:10.080000Z     dublin
## 2 -6.246154               CITY QUAY 2016-12-31T09:22:10.111000Z     dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-12-31T09:22:10.216000Z     dublin
## 4 -6.300217         BROOKFIELD ROAD 2016-12-31T09:22:10.052000Z     dublin
## 5 -6.308191              EMMET ROAD 2016-12-31T09:22:10.219000Z     dublin
## 6 -6.303950             ROTHE ABBEY 2016-12-31T09:22:10.214000Z     dublin

Once you got data, you need to insert it into the database.

# Append rows to DB:
db_insert_into( con = my_db$con, table = "bikes_records_2", values = city_ind) # append rows
## [1] TRUE

4.- Read DB and Model

The final part of this process is coming from model_bikes.R I have been gathering data from those eight cities for a couple of months, therefore I have a rich dataset which in reality is a bunch of datasets that can be used to apply data analysis to any of those european cities.

Read DB

library(jsonlite)
library(RSQLite)
library(lubridate)
library(tidyr)
library(ggmap)
library(ggthemes)
library(dplyr)

The first thing to do is to connect to the database where I stored data and get it all.

# Open connection
con <- dbConnect(SQLite(),dbname="bikes.sqlite3")
my_db <- src_sqlite( "bikes.sqlite3", create = FALSE)

# Get all data
all_data = dbGetQuery(con,"SELECT * FROM bikes_records_2")
str(all_data)
## 'data.frame':    1075613 obs. of  8 variables:
##  $ empty_slots: int  15 3 8 27 36 34 30 17 40 39 ...
##  $ free_bikes : int  20 26 32 0 4 1 0 3 0 1 ...
##  $ id         : chr  "b1c774f178240d5d757854364dafc840" "135d1f3410c5dbcc49c55a1d522dbbe6" "153ff4dfb7bd8912ef91c10849129c2e" "516ee41c56889c29b8fa5f4ac0e824f4" ...
##  $ latitude   : num  53.3 53.3 53.3 53.3 53.3 ...
##  $ longitude  : num  -6.25 -6.25 -6.25 -6.3 -6.31 ...
##  $ name       : chr  "FENIAN STREET" "CITY QUAY" "FITZWILLIAM SQUARE EAST" "BROOKFIELD ROAD" ...
##  $ timestamp  : chr  "2016-10-17T12:47:15.653000Z" "2016-10-17T12:47:15.673000Z" "2016-10-17T12:47:15.755000Z" "2016-10-17T12:47:15.625000Z" ...
##  $ company_id : chr  "dublin" "dublin" "dublin" "dublin" ...
head(all_data)
##   empty_slots free_bikes                               id latitude
## 1          15         20 b1c774f178240d5d757854364dafc840 53.34143
## 2           3         26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3           8         32 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4          27          0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5          36          4 6bff19daf1747a1e956a899ef8021232 53.34071
## 6          34          1 11b359924d83d802a41856c405d13e61 53.33878
##   longitude                    name                   timestamp company_id
## 1 -6.246720           FENIAN STREET 2016-10-17T12:47:15.653000Z     dublin
## 2 -6.246154               CITY QUAY 2016-10-17T12:47:15.673000Z     dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z     dublin
## 4 -6.300217         BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z     dublin
## 5 -6.308191              EMMET ROAD 2016-10-17T12:47:15.766000Z     dublin
## 6 -6.303950             ROTHE ABBEY 2016-10-17T12:47:15.744000Z     dublin

Let’s apply some time transformations.

# Insert dates columns
#all_data$timestamp = as.Date(all_data$timestamp)
all_data$time_2 = ymd_hms(all_data$timestamp)
all_data$week_day = wday(all_data$time_2,label = TRUE)
all_data$hour = hour(all_data$time_2)
all_data$shift = cut(all_data$hour,c(0,7,15,23),c("night","morning","afternoon"))

We can define a capacity status in terms of how many bikes are being used in a given station. We define a station as overflow when there is few empty slots to park. As far as I have seen is other similar analysis, thresholds to score a station as overflow, balanced or shortage are subjective. In this analysis, stations with empty slots from 0% to 25% are scored as overflow. Stations with empty slots from 75% to 100% are scored as shortage and everything in between is balanced.

# Total slots and shifts calculations
all_data$total_slots = all_data$empty_slots + all_data$free_bikes

all_data$empty_status = (all_data$empty_slots / all_data$total_slots) * 100
all_data$empty_status_label = cut(all_data$empty_status,c(0,25,75,100),c("overflow","balanced","shortage"))

all_data$empty_status_label[all_data$empty_status==0] <- "overflow"
head(all_data)
##   empty_slots free_bikes                               id latitude
## 1          15         20 b1c774f178240d5d757854364dafc840 53.34143
## 2           3         26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3           8         32 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4          27          0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5          36          4 6bff19daf1747a1e956a899ef8021232 53.34071
## 6          34          1 11b359924d83d802a41856c405d13e61 53.33878
##   longitude                    name                   timestamp company_id
## 1 -6.246720           FENIAN STREET 2016-10-17T12:47:15.653000Z     dublin
## 2 -6.246154               CITY QUAY 2016-10-17T12:47:15.673000Z     dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z     dublin
## 4 -6.300217         BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z     dublin
## 5 -6.308191              EMMET ROAD 2016-10-17T12:47:15.766000Z     dublin
## 6 -6.303950             ROTHE ABBEY 2016-10-17T12:47:15.744000Z     dublin
##                time_2 week_day hour   shift total_slots empty_status
## 1 2016-10-17 12:47:15      Mon   12 morning          35     42.85714
## 2 2016-10-17 12:47:15      Mon   12 morning          29     10.34483
## 3 2016-10-17 12:47:15      Mon   12 morning          40     20.00000
## 4 2016-10-17 12:47:15      Mon   12 morning          27    100.00000
## 5 2016-10-17 12:47:15      Mon   12 morning          40     90.00000
## 6 2016-10-17 12:47:15      Mon   12 morning          35     97.14286
##   empty_status_label
## 1           balanced
## 2           overflow
## 3           overflow
## 4           shortage
## 5           shortage
## 6           shortage

Now it’s time to focus on a single city, in this case, London.

# One City
ci = "london" # line 38 in model bikes.R

ci_df = all_data %>%
  filter(company_id == ci)

dim(ci_df)
## [1] 274291     15
ci_df = na.omit(ci_df)
dim(ci_df)
## [1] 271950     15
head(ci_df)
##   empty_slots free_bikes                               id latitude
## 1           5         11 d79e95513c3b5e1d3c78a65874054481 51.49887
## 2          20          0 090f297e79c030688a8a51755ebeafe3 51.50172
## 3           9          9 7f3020118e56165ed8b2f61899edb971 51.52916
## 4          13         22 67e6c16bce05410ba4b1f0f5000726ea 51.49961
## 5          15          8 d0ca154129878ce71261d563d8da9238 51.53006
## 6           7         19 8e74e96bcc8d74d7621f8201cee921dd 51.49313
##    longitude                                      name
## 1 -0.1374246     200178 - Buckingham Gate, Westminster
## 2 -0.1798542 001231 - Queen's Gate, Kensington Gardens
## 3 -0.1099705       001023 - River Street , Clerkenwell
## 4 -0.1975742   001018 - Phillimore Gardens, Kensington
## 5 -0.1209737  001013 - St. Chad's Street, King's Cross
## 6 -0.1568760    003420 - Sedding Street, Sloane Square
##                     timestamp company_id              time_2 week_day hour
## 1 2016-10-17T14:08:07.773000Z     london 2016-10-17 14:08:07      Mon   14
## 2 2016-10-17T14:08:06.788000Z     london 2016-10-17 14:08:06      Mon   14
## 3 2016-10-17T14:08:04.972000Z     london 2016-10-17 14:08:04      Mon   14
## 4 2016-10-17T14:08:04.980000Z     london 2016-10-17 14:08:04      Mon   14
## 5 2016-10-17T14:08:04.989000Z     london 2016-10-17 14:08:04      Mon   14
## 6 2016-10-17T14:08:04.993000Z     london 2016-10-17 14:08:04      Mon   14
##     shift total_slots empty_status empty_status_label
## 1 morning          16     31.25000           balanced
## 2 morning          20    100.00000           shortage
## 3 morning          18     50.00000           balanced
## 4 morning          35     37.14286           balanced
## 5 morning          23     65.21739           balanced
## 6 morning          26     26.92308           balanced

Mean capacity by weekday, hour and station.

## Group by weekday, hour and station - mean of used_capacity
ci_df_2 = ci_df %>%
  group_by(week_day,hour,name) %>%
  summarise(avg_used_capacity = mean(empty_status, na.rm=TRUE)) %>%
  arrange(desc(avg_used_capacity))


ci_df_ref = as.data.frame(ci_df_2)

head(ci_df_ref)
##   week_day hour                                       name
## 1      Sun    8           000970 - Scala Street, Fitzrovia
## 2      Sun    8              000999 - Queen Street 1, Bank
## 3      Sun    8        001006 - Bayley Street , Bloomsbury
## 4      Sun    8          001016 - Hereford Road, Bayswater
## 5      Sun    8        001043 - Museum of London, Barbican
## 6      Sun    8 001048 - Queen Victoria Street, St. Paul's
##   avg_used_capacity
## 1               100
## 2               100
## 3               100
## 4               100
## 5               100
## 6               100

We can apply a rank technique to get X number of stations:

# Top and bottom stations
top_stations_weekday = ci_df_ref[c(1:250),]
bottom_stations_weekday = tail(ci_df_ref,250)

Although it’s simple to get that top and bottom used stations grouping just by station:

## Group by station - mean de used_capacity
ci_df_3 = ci_df %>%
  group_by(name) %>%
  summarise(avg_used_capacity = mean(empty_status, na.rm=TRUE)) %>%
  arrange(desc(avg_used_capacity))

ci_df_3_ref = as.data.frame(ci_df_3)
head(ci_df_3_ref)
##                                    name avg_used_capacity
## 1    300079 - Manfred Road, East Putney          91.54527
## 2      200177 - Santos Road, Wandsworth          91.00370
## 3        300046 - Alma Road, Wandsworth          90.65369
## 4            200174 - Mostyn Grove, Bow          89.82960
## 5 300026 - All Saints' Road, Portobello          89.63444
## 6   200063 - Mexfield Road, East Putney          89.46055

See top and bottom used stations:

# Top and Bottom stations
top_stations = ci_df_3_ref[c(1:10),]
top_stations
##                                          name avg_used_capacity
## 1          300079 - Manfred Road, East Putney          91.54527
## 2            200177 - Santos Road, Wandsworth          91.00370
## 3              300046 - Alma Road, Wandsworth          90.65369
## 4                  200174 - Mostyn Grove, Bow          89.82960
## 5       300026 - All Saints' Road, Portobello          89.63444
## 6         200063 - Mexfield Road, East Putney          89.46055
## 7  200220 - Blythe Road West, Shepherd's Bush          88.22927
## 8           200186 - Nantes Close, Wandsworth          88.12582
## 9            200219 - Osiers Road, Wandsworth          88.06595
## 10            200098 - Aintree Street, Fulham          87.16813
bottom_stations = tail(ci_df_3_ref,10)
bottom_stations
##                                       name avg_used_capacity
## 769   022183 - Crinan Street, King's Cross          26.13413
## 770  300234 - Stratford Station, Stratford          25.49022
## 771    001114 - Pont Street, Knightsbridge          25.47857
## 772      002695 - Cadogan Gardens, Chelsea          24.60526
## 773      200238 - Westferry DLR, Limehouse          24.47227
## 774          200050 - Naval Row, Blackwall          24.27386
## 775    200193 - Snowsfields, London Bridge          23.56908
## 776      300022 - Evesham Street, Avondale          21.96792
## 777          200118 - Parkway, Camden Town          21.89531
## 778 003420 - Sedding Street, Sloane Square          18.56592

If we want to have a better view, we can geocode those stations.

## Geolocate main stations
ci_df_geo = ci_df[,c("name","latitude","longitude")] 
ci_df_geo_uni = unique(ci_df_geo)

top_stations_geo = merge(top_stations,ci_df_geo_uni,by = "name")
bottom_stations_geo = merge(bottom_stations,ci_df_geo_uni,by = "name")

head(top_stations_geo)
##                                  name avg_used_capacity latitude
## 1 200063 - Mexfield Road, East Putney          89.46055 51.45682
## 2     200098 - Aintree Street, Fulham          87.16813 51.48102
## 3          200174 - Mostyn Grove, Bow          89.82960 51.53054
## 4    200177 - Santos Road, Wandsworth          91.00370 51.45706
## 5   200186 - Nantes Close, Wandsworth          88.12582 51.46193
## 6    200219 - Osiers Road, Wandsworth          88.06595 51.46067
##    longitude
## 1 -0.2028021
## 2 -0.2099735
## 3 -0.0254920
## 4 -0.2008063
## 5 -0.1807918
## 6 -0.1987354
head(bottom_stations_geo)
##                                     name avg_used_capacity latitude
## 1    001114 - Pont Street, Knightsbridge          25.47857 51.49689
## 2      002695 - Cadogan Gardens, Chelsea          24.60526 51.49246
## 3 003420 - Sedding Street, Sloane Square          18.56592 51.49313
## 4   022183 - Crinan Street, King's Cross          26.13413 51.53447
## 5          200050 - Naval Row, Blackwall          24.27386 51.50898
## 6          200118 - Parkway, Camden Town          21.89531 51.53807
##    longitude
## 1 -0.1612038
## 2 -0.1599190
## 3 -0.1568760
## 4 -0.1222030
## 5 -0.0069900
## 6 -0.1446640

Now that we have geocoded top and bottom stations, we can plot them in a map.

m <- get_map("London",zoom=12,maptype="terrain-lines",source="stamen")
ggmap(m, base_layer = ggplot(aes(x = longitude, y = latitude), data = top_stations_geo))  + geom_point(color="green") + theme_minimal() + ggtitle("Top Stations")