(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

The genomation package is a toolkit for annotation and visualization of various genomic data. The package is currently in developmental version of BioC. It allows to analyze high-throughput data, including bisulfite sequencing data. Here, we will visualize the distribution of CpG methylation around promoters and their locations within gene structures on human chromosome 3.

Heatmap and plot of meta-profiles of CpG methylation around promoters

In this example we use data from Reduced Representation Bisulfite Sequencing (RRBS) and
Whole-genome Bisulfite Sequencing (WGBS) techniques and H1 and IMR90 cell types
derived from the ENCODE and the Roadmap Epigenomics Project databases.

We download the datasets and convert them to GRanges objects. Using rtracklayer and genomation functions. We also use a refseq bed file for annotation and extraction of promoter regions using readTranscriptFeatures function.

Since we have read the files now we can build base-pair resolution matrices of scores(methylation values) for each experiment. The returned list of matrices can be used to draw heatmaps or meta profiles of methylation ratio around promoters.

Distribution of covered CpGs across gene regions

genomation facilitates visualization of given locations of features aggregated by exons, introns, promoters and TSSs. To find the distribution of covered CpGs within these gene structures, we will use transcript features we previously obtained. Here is the breakdown of the code

Count overlap statistics between our CpGs from WGBS and RRBS H1 cell type and gene structures
Calculate percentage of CpGs overlapping with annotation
plot them in a form of pie charts

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

(This article was first published on Plotly Blog, and kindly contributed to R-bloggers)

This post shows how to make graphs like The Economist, New York Times, Vox, 538, Pew, and Quartz. And you can share–embed your beautiful, interactive graphs in apps, blog posts, and web sites. Read on to learn how. If you like interactive graphs and need to securely collaborate with your team, contact us about Plotly Enterprise.

Graphing Political Opinion In The New Nork Times

The Upshot, a New York Times blog, publishes articles and data visualizations about politics, policy, economics, and everyday life. The visualization below comes from a study of political opinions. Events that occur between the ages of 14-24 are most impactful for the voting patterns and political preferences of the next generations of voters.

<b>The Formative Years</b><br>Ages 14-24 are of paramount importance for forming<br>long-term presidential voting preferences.

We’ve used Plotly’s fill to option to show the confidence intervals. Hover your mouse to see data; click and drag to zoom. Click the source link to see the NYT original piece (you can add links to Plotly graphs).

When To Show Up At A Party In 538

538 is a news site started by statistician Nate Silver. Their staff studied when people show up at parties. They concluded that “The median arrival time of the 803 guests was a whopping 58 minutes after the party’s designated start time.” We used a line of best fit and subplots.

<br>How to Estimate When People Will Arrive at a Party

What People Think Of The News In Pew

Pew Research publishes polls about issues, attitudes, and trends. The heatmap below comes from a study by Pew concluding that among liberals and conservatives, “[t]here is little overlap in the news sources they turn to and trust.”

Trust Levels of News Sources by Ideological Group

The Illegal Trade In Animal Products In The Economist

The Economist publishes news and analysis on politics, business, finance, science, technology and the connections between them. This plot shows the price per kg of illegal animal products, with a logarithmic x axis.

<b>Too high a price: The illegal trade in animal products</b>

The History Of Cigarettes In Vox

Vox is a general interest news site, with the goal to explain the news. This plot was published in an academic journal then used in a Vox article on tobacco. Vox points out that after 1890, “Cigarettes only went from niche product to mass-market success after the rolling machine improved dramatically.”

<br><b>Per Capita Consumption of Tobacco in the United States, 1880-1995</b>

The Economics Of Unemployment In Quartz

Quartz is a news outlet for the new global economy. This plot comes from a piece concluding that “America has an unemployment problem, but specifically, it has a long-term unemployment problem.” We’ve styled the notes to be the same color as and reside beside the lines they identify.

<br>Indexed Unemployment Levels Since the Recession Began

How We Made These Plots & How You Can Too

The most difficult part about making these charts is accessing the data. We often use WebPlotDigitzer to access the data in graphs. Then we embed plots in our blog. To match Plotly’s colors with the original graphic, there are a number of tools available to you, including:

WPD’s own color picker under → Auto → or →
The DigitalColorMeter program that comes standard with all Mac computers
Downloadable programs such as Instant Eyedropper (Windows), Art Director’s Toolkit (Mac), xScope (Mac), or the Chrome colorZilla extension

If you’ve made a style you like, you can save and apply that style as a theme. Or, you can save themes from the plots in this post (or any plots from the Plotly feed).

If you’re a developer, you can specify your colors, fonts, data, or styles with our APIs. Python users can embed in IPython Notebooks with matplotlib; R users in RPubs and Shiny with ggplot2; MATLAB users can share MATALB figures. Every plot is accessible as a static image or as code in Python, R, MATLAB, Julia, JavaScript, or JSON. For example, for the last plot, see:

To leave a comment for the author, please follow the link and comment on his blog: Plotly Blog.

(This article was first published on Benomics » R, and kindly contributed to R-bloggers)

In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine:

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

How hard would it be to recreate an R version?

Base R version

Quite recently Mick Watson, a computational biologist based here in Edinburgh, put together a base R version of this figure using heatmap.2 from the gplots package.

Recreating a beautiful heatmap https://t.co/JGomRKjgFR pic.twitter.com/Ps7moKGkhO

— Mick Watson (@BioMickWatson) April 9, 2015

If you’re interested in the code for this, I suggest you check out his blog post where he walks the reader through creating the figure, beginning from heatmap defaults.

However, it didn’t take long for someone to pipe up asking for a ggplot2 version (3 minutes in fact…) and that’s my preference too, so I decided to have a go at putting one together.

ggplot2 version

Thankfully the hard work of tracking down the data had already been done for me, to get at it follow these steps:

Register and login to “Project Tycho“
Go to level 1 data, then Search and retrieve data
Now change a couple of options: geographic level := state; disease outcome := incidence
Add all states (highlight all at once with Ctrl+A (or Cmd+A on Macs)
Hit submit and scroll down to Click here to download results to excel
Open in excel and export to CSV

Simple right!

Now all that’s left to do is a bit of tidying. The data comes in wide format, so can be melted to our ggplot2-friendly long format with:

measles &lt;- melt(measles, id.var=c(&quot;YEAR&quot;, &quot;WEEK&quot;))

After that we can clean up the column names and use dplyr to aggregate weekly incidence rates into an annual measure:

colnames(measles) &lt;- c(&quot;year&quot;, &quot;week&quot;, &quot;state&quot;, &quot;cases&quot;)
mdf &lt;- measles %&gt;% group_by(state, year) %&gt;% 
       summarise(c=if(all(is.na(cases))) NA else 
                 sum(cases, na.rm=T))
mdf$state &lt;- factor(mdf$state, levels=rev(levels(mdf$state)))

It’s a bit crude but what I’m doing is summing the weekly incidence rates and leaving NAs if there’s no data for a whole year. This seems to match what’s been done in the WSJ article, though a more intepretable method could be something like average weekly incidence, as used by Robert Allison in his SAS version.

After trying to match colours via the OS X utility “digital colour meter” without much success, I instead grabbed the colours and breaks from the original plot’s javascript to make them as close as possible.

In full, the actual ggplot2 command took a fair bit of tweaking:

ggplot(mdf, aes(y=state, x=year, fill=c)) + 
  geom_tile(colour=&quot;white&quot;, linewidth=2, 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colours=cols, limits=c(0, 4000),
                        breaks=seq(0, 4e3, by=1e3), 
                        na.value=rgb(246, 246, 246, max=255),
                        labels=c(&quot;0k&quot;, &quot;1k&quot;, &quot;2k&quot;, &quot;3k&quot;, &quot;4k&quot;),
                        guide=guide_colourbar(ticks=T, nbin=50,
                               barheight=.5, label=T, 
                               barwidth=10)) +
  scale_x_continuous(expand=c(0,0), 
                     breaks=seq(1930, 2010, by=10)) +
  geom_segment(x=1963, xend=1963, y=0, yend=51.5, size=.9) +
  labs(x=&quot;&quot;, y=&quot;&quot;, fill=&quot;&quot;) +
  ggtitle(&quot;Measles&quot;) +
  theme(legend.position=c(.5, -.13),
        legend.direction=&quot;horizontal&quot;,
        legend.text=element_text(colour=&quot;grey20&quot;),
        plot.margin=grid::unit(c(.5,.5,1.5,.5), &quot;cm&quot;),
        axis.text.y=element_text(size=6, family=&quot;Helvetica&quot;, 
                                 hjust=1),
        axis.text.x=element_text(size=8),
        axis.ticks.y=element_blank(),
        panel.grid=element_blank(),
        title=element_text(hjust=-.07, face=&quot;bold&quot;, vjust=1, 
                           family=&quot;Helvetica&quot;),
        text=element_text(family=&quot;URWHelvetica&quot;)) +
  annotate(&quot;text&quot;, label=&quot;Vaccine introduced&quot;, x=1963, y=53, 
           vjust=1, hjust=0, size=I(3), family=&quot;Helvetica&quot;)

Result

I’m pretty happy with the outcome but there are a few differences: the ordering is out (someone pointed out the original is ordered by two letter code rather than full state name) and the fonts are off (as far as I can tell they use “Whitney ScreenSmart” among others).

Obviously the original is an interactive chart which works great with this data. It turns out it was built with the highcharts library, which actually has R bindings via the rCharts package, so in theory the original chart could be entirely recreated in R! However, for now at least, that’ll be left as an exercise for the reader…

Full code to reproduce this graphic is on github.

To leave a comment for the author, please follow the link and comment on his blog: Benomics » R.

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.

library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.

First, get a look at the data with the featurePlot function in the caret package:

library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:

	setosa	versicolor	virginica
Sensitivity	1.00	1.00	0.00
Specificity	1.00	0.50	1.00
Pos Pred Value	1.00	0.50	NaN
Neg Pred Value	1.00	1.00	0.67
Prevalence	0.33	0.33	0.33
Detection Rate	0.33	0.33	0.00
Detection Prevalence	0.33	0.67	0.00
Balanced Accuracy	1.00	0.75	0.50

A big thanks to Mark and Steve at Hemoshear for putting this together!

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Last week, Mick Watson posted a terrific article on using R to recreate the visualizations in this WSJ article on the impact of vaccination. Someone beat me to the obvious joke.

@BioMickWatson @pathogenomenick Nice quilt plot.

— Ed Yong (@edyong209) April 9, 2015

Someone also beat me to the standard response whenever base R graphics are used.

@BioMickWatson very nice! now a ggplot2 version please

— Nick Loman (@pathogenomenick) April 9, 2015

And despite devoting much of Friday morning to it, I was beaten to publication of a version using ggplot2.

Why then would I even bother to write this post. Well, because I did things a little differently; diversity of opinion and illustration of alternative approaches are good. And because on the internet, it’s quite acceptable to appropriate great ideas from other people when you lack any inspiration yourself. And because I devoted much of Friday morning to it.

Here then is my “exploration of what Mick did already, only using ggplot2 like Ben did already.”

You know what: since we’re close to the 60th anniversary of Salk’s polio vaccine field trial results, let’s do poliomyelitis instead of measles. And let’s normalise cases per 100 000 population, since Mick and Ben did not. There, I claim novelty.

1. Getting the disease data
Follow Mick’s instructions, substituting POLIOMYELITIS for MEASLES. The result is a CSV file. Except: the first 2 lines are not CSV and the header row contains an extra blank field (empty quotes), not found in the data rows. The simplest way to deal with this was to use read.csv(), which auto-creates the extra column (62), calls it “X” and fills it with NA. You can then ignore it. Here’s a function to read the CSV file and aggregate cases by year.

library(plyr)
library(reshape2)
library(ggplot2)

readDiseaseData <- function(csv) {
  dis <- read.csv(csv, skip = 2, header = TRUE, stringsAsFactors = FALSE)
  dis[, 3:62] <- sapply(dis[, 3:62], as.numeric)  
  dis.m   <- melt(dis[, c(1, 3:61)], variable.name = "state", id.vars = "YEAR")
  dis.agg <- aggregate(value ~ state + YEAR, dis.m, sum)
  return(dis.agg)
}

We supply the downloaded CSV file as an argument:

polio <- readDiseaseData("POLIOMYELITIS_Cases_1921-1971_20150410002846.csv")

and here are the first few rows:

head(polio)
          state YEAR value
1      NEBRASKA 1921     1
2    NEW.JERSEY 1921     1
3      NEW.YORK 1921    40
4      NEW.YORK 1922     2
5 MASSACHUSETTS 1923     1
6      NEW.YORK 1923     8

2. Getting the population data
I discovered US state population estimates for the years 1900 – 1990 at the US Census Bureau. The URLs are HTTPS, but omitting the “s” works fine. The data are plain text…which is good but…although the data are somewhat structured (delimited), the files themselves vary a lot. Some contain only estimates, others contain in addition census counts. For earlier decades the numbers are thousands with a comma (so “1,200” = 1 200 000). Later files use millions with no comma. The decade years are split over several lines with different numbers of lines before and inbetween.

To make a long story short, any function to read these files requires many parameters to take all this into account and it looks like this:

getPopData <- function(years = "0009", skip1 = 23, skip2 = 81, rows = 49, names = 1900:1909, keep = 1:11) {
  u  <- paste("http://www.census.gov/popest/data/state/asrh/1980s/tables/st", years, "ts.txt", sep = "")
  p1 <- read.table(u, skip = skip1, nrows = rows, header = F, stringsAsFactors = FALSE)
  p2 <- read.table(u, skip = skip2, nrows = rows, header = F, stringsAsFactors = FALSE)
  p12 <- join(p1, p2, by = "V1")
  p12 <- p12[, keep]
  colnames(p12) <- c("state", names)
  # 1900-1970 are in thousands with commas
  if(as.numeric(substring(years, 1, 1)) < 7) {
    p12[, 2:11] <- sapply(p12[, 2:11], function(x) gsub(",", "", x))
    p12[, 2:11] <- sapply(p12[, 2:11], as.numeric)
    p12[, 2:11] <- sapply(p12[, 2:11], function(x) 1000*x)
  }
  return(p12)
}

So now we can create a list of data frames, one per decade, then use plyr::join_all to join on state and get a big date frame of 51 states x 91 years with population estimates.

popn <- list(p1900 = getPopData(),
             p1910 = getPopData(years = "1019", names = 1910:1919),
             p1920 = getPopData(years = "2029", names = 1920:1929),
             p1930 = getPopData(years = "3039", names = 1930:1939),
             p1940 = getPopData(years = "4049", skip1 = 21, skip2 = 79, , names = 1940:1949),
             p1950 = getPopData(years = "5060", skip1 = 27, skip2 = 92, rows = 51, names = 1950:1959, keep = c(1, 3:7, 9:13)),
             p1960 = getPopData(years = "6070", skip1 = 24, skip2 = 86, rows = 51, names = 1960:1969, keep = c(1, 3:7, 9:13)),
             p1970 = getPopData(years = "7080", skip1 = 14, skip2 = 67, rows = 51, names = 1970:1979, keep = c(2:8, 11:14)),
             p1980 = getPopData(years = "8090", skip1 = 11, skip2 = 70, rows = 51, names = 1980:1990, keep = 1:12))

popn.df <- join_all(popn, by = "state", type = "full")

3. Joining the datasets
Next step: join the disease and population data. Although we specified states in the original data download, it includes things that are not states like “UPSTATE.NEW.YORK”, “DISTRICT.OF.COLUMBIA” or “PUERTO.RICO”. So let’s restrict ourselves to the 50 states helpfully supplied as variables in R. First we create a data frame containing state names and abbreviations, then match the abbreviations to the polio data.

statenames <- toupper(state.name)
statenames <- gsub(" ", ".", statenames)
states <- data.frame(sname = statenames, sabb = state.abb)

m <- match(polio$state, states$sname)
polio$abb <- states[m, "sabb"]

Now we can melt the population data, join to the polio data on state abbreviation and calculate cases per 100 000 people.

popn.m <- melt(popn.df)
colnames(popn.m) <- c("abb", "YEAR", "pop")
popn.m$YEAR <- as.numeric(as.character(popn.m$YEAR))
polio.pop <- join(polio, popn.m, by = c("YEAR", "abb"))
polio.pop$cases <- (100000 / polio.pop$pop) * polio.pop$value

head(polio.pop)
          state YEAR value abb      pop      cases
1      NEBRASKA 1921     1  NE  1309000 0.07639419
2    NEW.JERSEY 1921     1  NJ  3297000 0.03033060
3      NEW.YORK 1921    40  NY 10416000 0.38402458
4      NEW.YORK 1922     2  NY 10589000 0.01888752
5 MASSACHUSETTS 1923     1  MA  4057000 0.02464876
6      NEW.YORK 1923     8  NY 10752000 0.07440476

Success! Let’s get plotting.

4. Plotting
We should really indicate where data are missing but for the purposes of this post, I’ll just drop incomplete rows using na.omit().

Technically my first attempt is an abuse of geom_dotplot, but I think it generates quite a nice effect (assuming you’re not colour-blind). Note that years have to be factorised here.

ggplot(na.omit(polio.pop)) + geom_dotplot(aes(x = factor(YEAR), fill = cases), color = "white", binwidth = 1, 
dotsize = 0.4, binpositions = "all", method = "histodot") + facet_grid(abb~.) + theme_bw() + 
scale_fill_continuous(low = "floralwhite", high = "red") + geom_vline(xintercept = 32) + 
scale_y_discrete(breaks = NULL) + theme(panel.border = element_blank(), strip.text.y = element_text(angle = 0)) + 
scale_x_discrete(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + 
labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")

Polio cases, dotplot

For a shape more like the WSJ plots, I use geom_rect. This plot is generated quite a lot faster.

ggplot(na.omit(polio.pop)) + geom_rect(aes(xmin = YEAR, xmax = YEAR+1, ymin = 0, ymax = 12, fill = cases)) + 
facet_grid(abb~.) + theme_bw() + scale_y_discrete(breaks = NULL) + scale_fill_continuous(low = "floralwhite", high = "red") + 
theme(panel.border = element_blank(), panel.margin = unit(1, "mm"), strip.text.y = element_text(angle = 0)) + 
geom_vline(xintercept = 1955) + scale_x_continuous(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + 
labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")

Polio cases, geom_rect

Finally, let’s try the colour palette generated by Mick. From his post, it’s clear that the WSJ fiddled with bin sizes and break points to generate more yellow/orange/red for pre-vaccine years. I haven’t bothered with that here, so things look a little different.

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10), colorRampPalette(c("yellow", "red"))(30))

ggplot(na.omit(polio.pop)) + geom_rect(aes(xmin = YEAR, xmax = YEAR+1, ymin = 0, ymax = 12, fill = cases), color = "white") + 
facet_grid(abb~.) + theme_bw() +scale_y_discrete(breaks = NULL) + scale_fill_gradientn(colours = cols) + 
theme(panel.border = element_blank(), panel.margin = unit(1, "mm"), strip.text.y = element_text(angle = 0)) + 
geom_vline(xintercept = 1955) + scale_x_continuous(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + 
labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")

Polio cases, geom_rect, new palette

Summary
Not too much work for some quite attractive output, thanks to great R packages; Hadley, love your work.
As ever, the main challenge is getting the raw data into shape. At some point I’ll wrap all this up as Rmarkdown and send it off to RPubs.

Filed under: R, statistics Tagged: disease, ggplot2, project tycho, vaccination

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Vidisha Vachharajani
Freelance Statistical Consultant

R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A "colorful visual representation of data in a matrix" or "a (thematic) map in which areas are represented in patterns ("heat" colors) that are proportionate to the measurement of some information being displayed on the map"? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.

The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, "heatmap.2") a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).

Consider the following simple example which uses the "States" data sets in the car package. States contains the following features:

region: U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
pop: Population: in 1,000s.
SATV: Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
SATM: Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
percent: Percentage of graduating high-school students in the state who took the SAT exam.
dollars: State spending on public education, in $1000s per student.
pay: Average teacher's salary in the state, in $1000s.

We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.

# R CODE (output = "initial_plot.png")
library(gplots)   # contains the heatmap.2 package
library(car)    
States[1:3,] # look at the data
 
scaled <- scale(States[,-1]) # scale all but the first column to make information comparable
heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap
            cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels
            scale="none", # we have already scaled the data
            trace="none") # cleaner heatmap

This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an "hclust()" rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an "all-the-information-in-one-glance" look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a "red-yellow-green" effect for the scaling, which will render the underlying information even more clearly. Finally, we'll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).

# R CODE (output = "final_plot.png")
 
# Use color brewer
library(RColorBrewer)
my_palette <- colorRampPalette(c('red','yellow','green'))(256)
 
scaled <- scale(States[,-1])    # scale all but the first column to make information comparable
heatmap.2(scaled,               # specify the (scaled) data to be used in the heatmap
          cexRow=0.5, 
          cexCol=0.95,          # decrease font size of row/column labels
          col = my_palette,     # arguments to read in custom colors
          colsep=c(2,4,5),      # Adding on the separators that will clarify plot even more
          rowsep = c(6,14,18,25,30,36,42,47), 
          sepcolor="black", 
          sepwidth=c(0.01,0.01),  
          scale="none",         # we have already scaled the data 
          dendrogram="none",    # no need to see dendrograms in this one 
          trace="none")         # cleaner heatmap

This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.

This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

(This article was first published on Plotly Blog, and kindly contributed to R-bloggers)

Many graphs use a time series, meaning they measure events over time. William Playfair (1759 - 1823) was a Scottish economist and pioneer of this approach. Playfair invented the line graph. The graph below–one of his most famous–depicts how in the 1750s the Brits started exporting more than they were importing.

This post shows how you can use Playfair’s approach and many more for making a time series graph. To embed Plotly graphs in your applications, dashboards, and reports, check out Plotly Enterprise.

1. By Year

First we’ll show an example of a standard time series graph. The data is drawn from a paper on shaving trends. The author concludes that the “dynamics of taste”, in this case facial hair, are “common expressions of underlying conditions and sequences in social behavior.” Time is on the x-axis. The y-axis shows the respective percentages of men’s facial hair styles.

<br><b>Men's Facial Hair Trends, 1842 to 1972</b>

You can click and drag to move the axis, click and drag to zoom, or toggle traces on and off in the legend. The temperatue graph below shows how Plotly adjusts data from years to nanoseconds as you zoom. The first timestamp is 2014-12-15 08:55:13.961347, which is how Plotly reads dates. That is, `yyyy-mm-dd HH:MM:SS.ssssss`. Now that’s drilling down.

One of the special things about Plotly is that you can translate plots and data between programming lanuguages, file formats, and data types. For example, the multiple axis plot below uses stacked plots on the same time scale for different eonomic indicators. This plot was made using ggplot2’s time scale. We can convert the plot into Plotly, allowing anyone to edit the figure from different programming languages or the Plotly web app.

We have a time series tutorial that explains time series graphs, custom date formats, custom hover text labels, and time series plots in MATLAB, Python , and R.

2. Subplots & Small Multiples

Another way to slice your data is by subplots. These histograms were made with R and compare yearly data. Each plot shows the annual number of players who had a given batting average in Major League Baseball.

2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013

You can also display your data using small multiples, a concept developed by Edward Tufte. Small multiples are “illustrations of postage-stamp” size. They use the same graph type to index data by a cateogry or label. Using facets, we’ve plotted a dataset of airline passengers. Each subplot shows the overall travel numbers and a reference line for the thousands of passengers travelling that month.

, , , , , , , , , , , , Jan, Feb, Mar, Apr, May, June, July, Aug, Sep, Oct, Nov, Dec

3. By Month

The heatmap below shows the percentages of people’s birthdays on a given date, gleaned from 480,040 life isurance applications. The x-axis shows months, the y-axis shows the day of the month, and the z shows the % of birthdays on each date.

To show how values in your data are spaced over different months, we can use seasonal boxplots. The boxes represent how the data is spaced for each month; the dots represent outliers. We’ve used ggplot2 to make our plot and added a smoothed fit with a confidence interval. See our box plot tutorial to learn more.

We can use a bar chart with error bars to look at data over a monthly interval. In this case, we’re using R to make a graph with error bars showing snowfall in Montreal.

4. A Repeated Event With A Category

We may want to look at data that is not stricly a time series plot, but still represents changes over time. For example, we may want hourly event data. Below we’re showing the most popular hourly reasons to call 311 in NYC, a number you can call for non-emergency help. The plot is from our pandas and SQLite guide.

The 6 Most Common 311 Complaints by Hour in a Day

We can also show a before and after effect to examine changes from an event. The plot below, made in an IPython Notebook, tracks Conservative and Labour election impacts on Pounds and Dollars.

GBP USD during UK general elections by winning party

5. A 3D Graph

We can also use a 3D chart to show events over time. For example, our surface chart below shows the UK Swaps Term Structure with historical dates along the X axis, the Term Structure on the Y axis, and the swap rates over the Z Axis. The message: rates are lower than ever. At the long end of the curve we don’t see a massive increase. This example was made using cufflinks, a Python library by Jorge Santos. For more on 3D graphing see our Python, MATLAB, R, and web tutorials.

Sharing & Deploying Plotly

If you liked this post, please consider sharing. We’re @plotlygraphs, or email us at feedback at plot dot ly. We have tutorials that show how to make and embed graphs in your website, blog, or apps. To learn more about how companies are using Plotly Enterprise across different industries, see our customer stories.

To leave a comment for the author, please follow the link and comment on his blog: Plotly Blog.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While movies have been getting more orange with time, painting have been going the other direction. Paintings today are generally more blue than they were a few hundred years ago.

The image above shows the color spectrum of almost 100,000 paintings created since 1800. Martin Bellander used R to create the image, by scraping images from the BBC YourPaintings site with the help of the rvest package. He then extracted the spectrum from each of the images using the readbitmap and colorspace packages, before combining the data into the time-ordered heatmap above using the plotrix package. (You can find all of the R code in page linked at the end of this post.)

In an article for Significance magazine, Martin suggests a few possible reasons why paintings are getting bluer with time:

The colour blue is a relatively new colour word.
An increase in dark colours or black might drive the effect if these contain more blue or if the camera register them as blue to a larger extent.
The colours in paintings tend to change over time, e.g. due to the aging of resins.
Blue has historically been a very expensive colour, and the decreasing price and increased supply might explain the increased use.

He explores these hypotheses by (for example) looking at just the oil paintings over time, but the result is inconclusive. One possibility that occurs to me is the rising popularity of landscape paintings over time, which might have led to more blue skies being represented in painting. (Any art historians want to chime in?) Check out all the details of Martin's analysis at the link below.

I cannot make bricks without clay: The colors of paintings: blue is the new orange

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

This post is part II of a series detailing the GitHub package, wakefield, for generating random data sets. The First Post (part I) was a test run to gauge user interest. I received positive feedback and some ideas for improvements, which I’ll share below.

The post is broken into the following sections:

You can view just the R code HERE or PDF version HERE

1 Brief Package Description

First we’ll use the pacman package to grab the wakefield package from GitHub and then load it as well as the handy dplyr package.

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_install_gh("trinker/wakefield")
p_load(dplyr, wakefield)

The main function in wakefield is r_data_frame. It takes n (the number of rows) and any number of variable functions that generate random columns. The result is a data frame with named, randomly generated columns. Below is an example, for details see Part I or the README

set.seed(10)

r_data_frame(n = 30,
    id,
    race,
    age(x = 8:14),
    Gender = sex,
    Time = hour,
    iq,
    grade, 
    height(mean=50, sd = 10),
    died,
    Scoring = rnorm,
    Smoker = valid
)

## Source: local data frame [30 x 11]
## 
##    ID     Race Age Gender     Time  IQ Grade Height  Died    Scoring
## 1  01    White  11   Male 01:00:00 110  90.7     52 FALSE -1.8227126
## 2  02    White   8   Male 01:00:00 111  91.8     36  TRUE  0.3525440
## 3  03    White   9   Male 01:30:00  87  81.3     39 FALSE -1.3484514
## 4  04 Hispanic  14   Male 01:30:00 111  83.2     46  TRUE  0.7076883
## 5  05    White  10 Female 03:30:00  95  80.1     51  TRUE -0.4108909
## 6  06    White  13 Female 04:00:00  97  93.9     61  TRUE -0.4460452
## 7  07    White  13 Female 05:00:00 109  89.5     44  TRUE -1.0411563
## 8  08    White  14   Male 06:00:00 101  92.3     63  TRUE -0.3292247
## 9  09    White  12   Male 06:30:00 110  90.1     52  TRUE -0.2828216
## 10 10    White  11   Male 09:30:00 107  88.4     47 FALSE  0.4324291
## .. ..      ... ...    ...      ... ...   ...    ...   ...        ...
## Variables not shown: Smoker (lgl)

2 Improvements

2.1 Repeated Measures Series

Big thanks to Ananda Mahto for suggesting better handing of repeated measures series and providing concise code to extend this capability. The user may now specify the same variable function multiple times and it is named appropriately:

set.seed(10)

r_data_frame(
    n = 500,
    id,
    age, age, age,
    grade, grade, grade
)

## Source: local data frame [500 x 7]
## 
##     ID Age_1 Age_2 Age_3 Grade_1 Grade_2 Grade_3
## 1  001    28    33    32    80.2    87.2    85.6
## 2  002    24    35    31    89.7    91.7    86.8
## 3  003    26    33    23    92.7    85.7    88.7
## 4  004    31    24    28    82.2    90.0    86.0
## 5  005    21    21    29    86.5    87.0    88.4
## 6  006    23    28    25    85.6    93.5    86.7
## 7  007    24    22    26    89.3    90.3    87.6
## 8  008    24    21    23    92.4    88.3    89.3
## 9  009    29    23    32    86.4    84.4    88.2
## 10 010    26    34    32    97.6    84.2    90.6
## .. ...   ...   ...   ...     ...     ...     ...

But he went further, recommending a short hand for variable, variable, variable. The r_series function takes a variable function and j number of columns. It can also be renamed with the name argument:

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    r_series(gpa, 2),
    r_series(likert, 3, name = "Question")
)

## Source: local data frame [100 x 8]
## 
##     ID Age    Sex GPA_1 GPA_2        Question_1        Question_2
## 1  001  28   Male  3.00  4.00 Strongly Disagree   Strongly Agree 
## 2  002  24   Male  3.67  3.67          Disagree           Neutral
## 3  003  26   Male  3.00  4.00          Disagree Strongly Disagree
## 4  004  31   Male  3.67  3.67           Neutral   Strongly Agree 
## 5  005  21 Female  3.00  3.00             Agree   Strongly Agree 
## 6  006  23 Female  3.67  3.67             Agree             Agree
## 7  007  24 Female  3.67  4.00          Disagree Strongly Disagree
## 8  008  24   Male  2.67  3.00   Strongly Agree            Neutral
## 9  009  29 Female  4.00  3.33           Neutral Strongly Disagree
## 10 010  26   Male  4.00  3.00          Disagree Strongly Disagree
## .. ... ...    ...   ...   ...               ...               ...
## Variables not shown: Question_3 (fctr)

2.2 Dummy Coding Expansion of Factors

It is sometimes nice to expand a factor into j (number of groups) dummy coded columns. Here we see a factor version and then a dummy coded version of the same data frame:

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    political
)

## Source: local data frame [100 x 4]
## 
##     ID Age    Sex    Political
## 1  001  28   Male Constitution
## 2  002  24   Male Constitution
## 3  003  26   Male     Democrat
## 4  004  31   Male     Democrat
## 5  005  21 Female Constitution
## 6  006  23 Female     Democrat
## 7  007  24 Female     Democrat
## 8  008  24   Male   Republican
## 9  009  29 Female Constitution
## 10 010  26   Male     Democrat
## .. ... ...    ...          ...

The dummy coded version…

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    r_dummy(sex, prefix = TRUE),
    r_dummy(political)
)

## Source: local data frame [100 x 9]
## 
##     ID Age Sex_Male Sex_Female Constitution Democrat Green Libertarian
## 1  001  28        1          0            1        0     0           0
## 2  002  24        1          0            1        0     0           0
## 3  003  26        1          0            0        1     0           0
## 4  004  31        1          0            0        1     0           0
## 5  005  21        0          1            1        0     0           0
## 6  006  23        0          1            0        1     0           0
## 7  007  24        0          1            0        1     0           0
## 8  008  24        1          0            0        0     0           0
## 9  009  29        0          1            1        0     0           0
## 10 010  26        1          0            0        1     0           0
## .. ... ...      ...        ...          ...      ...   ...         ...
## Variables not shown: Republican (int)

2.3 Factor to Numeric Conversion

There are times when you feel like a factor and the when you feel like an integer version. This is particularly useful with Likert-type data and other ordered factors. The as_integer function takes a data.frame and allows the user t specify the indices (j) to convert from factor to numeric. Here I show a factor data.frame and then the integer conversion:

set.seed(10)

r_data_frame(5,
    id, 
    r_series(likert, j = 4, name = "Item")
)

## Source: local data frame [5 x 5]
## 
##   ID          Item_1   Item_2          Item_3            Item_4
## 1  1         Neutral    Agree        Disagree           Neutral
## 2  2           Agree    Agree         Neutral   Strongly Agree 
## 3  3         Neutral    Agree Strongly Agree              Agree
## 4  4        Disagree Disagree         Neutral             Agree
## 5  5 Strongly Agree   Neutral           Agree Strongly Disagree

As integers…

set.seed(10)

r_data_frame(5,
    id, 
    r_series(likert, j = 4, name = "Item")
) %>% 
    as_integer(-1)

## Source: local data frame [5 x 5]
## 
##   ID Item_1 Item_2 Item_3 Item_4
## 1  1      3      4      2      3
## 2  2      4      4      3      5
## 3  3      3      4      5      4
## 4  4      2      2      3      4
## 5  5      5      3      4      1

2.4 Viewing Whole Data Set

dplyr has a nice print method that hides excessive rows and columns. Typically this is great behavior. Sometimes you want to quickly see the whole width of the data set. We can use View but this is still wide and shows all columns. The peek function shows minimal rows, truncated columns, and prints wide for quick inspection. This is particularly nice for text strings as data. dplyr prints wide data sets like this:

r_data_frame(100,
    id, 
    name,
    sex,
    sentence    
)

## Source: local data frame [100 x 4]
## 
##     ID     Name    Sex
## 1  001   Gerald   Male
## 2  002    Jason   Male
## 3  003 Mitchell   Male
## 4  004      Joe Female
## 5  005   Mickey   Male
## 6  006   Michal   Male
## 7  007   Dannie Female
## 8  008   Jordan   Male
## 9  009     Rudy Female
## 10 010   Sammie Female
## .. ...      ...    ...
## Variables not shown: Sentence (chr)

Now use peek:

r_data_frame(100,
    id, 
    name,
    sex,
    sentence    
) %>% peek

## Source: local data frame [100 x 4]
## 
##     ID    Name    Sex   Sentence
## 1  001     Jae Female Excuse me.
## 2  002 Darnell Female Over the l
## 3  003  Elisha Female First of a
## 4  004  Vernon Female Gentlemen,
## 5  005   Scott   Male That's wha
## 6  006   Kasey Female We don't h
## 7  007 Michael   Male You don't 
## 8  008   Cecil Female I'll get o
## 9  009    Cruz Female They must 
## 10 010  Travis Female Good night
## .. ...     ...    ...        ...

2.5 Visualizing Column Types and NAs

When we build a large random data set it is nice to get a sense of the column types and the missing values. The table_heat (also plot for tbl_df class) does this. Here I’ll generate a data set, add missing values (r_na), and then plot:

set.seed(10)

r_data_frame(n=100,
    id,
    dob,
    animal,
    grade, grade,
    death,
    dummy,
    grade_letter,
    gender,
    paragraph,
    sentence
) %>%
   r_na() %>%
   plot(palette = "Set1")

3 Table of Variable Functions

There are currently 66 wakefield based variable functions to chose for building columns. Use variables() to see them or variables(TRUE) to see a list of them broken into variable types. Here’s an HTML table version:

age	dob	height_in	month	speed
animal	dummy	income	name	speed_kph
answer	education	internet_browser	normal	speed_mph
area	employment	iq	normal_round	state
birth	eye	language	paragraph	string
car	gender	level	pet	upper
children	gpa	likert	political	upper_factor
coin	grade	likert_5	primary	valid
color	grade_letter	likert_7	race	year
date_stamp	grade_level	lorem_ipsum	religion	zip_code
death	group	lower	sat
dice	hair	lower_factor	sentence
died	height	marital	sex
dna	height_cm	military	smokes

4 Possible Uses

4.1 Testing Methods

I personally will use this most frequently when I’m testing out a model. For example say you wanted to test psychometric functions, including the cor function, on a randomly generated assessment:

dat <- r_data_frame(120,
    id, 
    sex,
    age,
    r_series(likert, 15, name = "Item")
) %>% 
    as_integer(-c(1:3))

dat %>%
    select(contains("Item")) %>%
    cor %>%
    heatmap</code></pre>
<p><img src="random_data_blog2_files/figure-html/unnamed-chunk-15-1.png" title="" alt="" width="450" /></p>
</div>
<div id="unique-student-data-for-course-assignments" class="section level2">
<h2><span class="header-section-number">4.2</span> Unique Student Data for Course Assignments</h2>
<p>Sometimes it’s nice if students each have their own data set to work with but one in which you control the parameters. Simply supply the students with a unique integer id and they can use this inside of <code>set.seed</code> with a <strong>wakefield</strong> <code>r_data_frame</code> you’ve constructed for them in advance. Viola 25 instant data sets that are structurally the same but randomly different.</p>
1student_id <- ## INSERT YOUT ID HERE
    
set.seed(student_id)

dat <- function(1000,
    id, 
    gender,
    religion,
    internet_browser,
    language,
    iq,
    sat,
    smokes
)

4.3 Blogging and Online Help Communities

wakefield can make data sharing on blog posts and online hep communities (e.g., TalkStats, StackOverflow) fast, accessible, and with little space or cognitive effort. Use variables(TRUE) to see variable functions by class and select the ones you want:

variables(TRUE)

## $character
## [1] "lorem_ipsum" "lower"       "name"        "paragraph"   "sentence"   
## [6] "string"      "upper"       "zip_code"   
## 
## $date
## [1] "birth"      "date_stamp" "dob"       
## 
## $factor
##  [1] "animal"           "answer"           "area"            
##  [4] "car"              "coin"             "color"           
##  [7] "dna"              "education"        "employment"      
## [10] "eye"              "gender"           "grade_level"     
## [13] "group"            "hair"             "internet_browser"
## [16] "language"         "lower_factor"     "marital"         
## [19] "military"         "month"            "pet"             
## [22] "political"        "primary"          "race"            
## [25] "religion"         "sex"              "state"           
## [28] "upper_factor"    
## 
## $integer
## [1] "age"      "children" "dice"     "level"    "year"    
## 
## $logical
## [1] "death"  "died"   "smokes" "valid" 
## 
## $numeric
##  [1] "dummy"        "gpa"          "grade"        "height"      
##  [5] "height_cm"    "height_in"    "income"       "iq"          
##  [9] "normal"       "normal_round" "sat"          "speed"       
## [13] "speed_kph"    "speed_mph"   
## 
## $`ordered factor`
## [1] "grade_letter" "likert"       "likert_5"     "likert_7"

Then throw the inside of r_data_fame to make a quick data set to share.

r_data_frame(8,
    name,
    sex,
    r_series(iq, 3)
) %>%
    peek %>%
    dput

5 Getting Involved

If you’re interested in getting involved with use or contributing you can:

Install and use wakefield
Provide feedback via comments below
Provide feedback (bugs, improvements, and feature requests) via wakefield’s Issues Page
Fork from GitHub and give a Pull Request

Thanks for reading, your feedback is welcomed.

*Get the R code for this post HERE *Get a PDF version this post HERE

To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R.

(This article was first published on AnalyzeCore » R language, and kindly contributed to R-bloggers)

Previously I shared the data visualization approach for descriptive analysis of progress of cohorts with the “layer-cake” chart (part I and part II). In this post, I want to share another interesting visualization that not only can be used for descriptive analysis as well but would be more helpful for analyzing a large number of cohorts. For instance, if you need to form and analyze weekly cohorts, you would have 52 cohorts within a year.

The Heatmap chart would be helpful for primary analysis and we will study how to create it with the R programming language. But firstly, I would like to give credit to John Egan who shared the idea of using the Cohort Activity Heatmap and to Ben Moore whose great post helped me to reproduce such a beautiful color palette.

The following is my interpretation of using the Heatmap for Cohort Analysis.

Let’s assume we form weekly cohorts and have 100 ones as of the reporting date. We’ve tracked the number of customers who made a purchase and the total gross margin per weekly cohort per time lapse (a week in our case). We can easily calculate two extra values based on these data:

per customer gross margin per cohort per week,
customer lifetime value (CLV) to date as accumulated gross margin per cohort divided by initial number of customers in the cohort.

In addition, I’ve simulated some purchase patterns that can be plausible, specifically:

customers tend to buy actively during the first weeks of the lifetime,
we have two seasonal growths of sales every year (e.g. Back to school sales and Black Friday that are accompanied with higher discounts).

Based on these data we can plot at least four types of charts using Heatmap:

Cohort activity, based on the number of customers who made a purchase each week (active customers),
Cohort gross margin, based on the total amount of money that the cohort brought each week,
Per customer gross margin, based on the average gross margin that the cohort brought each week,
Cohort CLV to date, based on cumulative CLV to date.

Furthermore, charts can be represented based on calendar dates and the serial number of the week of the lifetime (e.g. 1^st week, 2^nd week, etc. from the first purchase date) as well. Therefore, we can see the influence of seasonality or other occurrences on all existing cohorts as of calendar date and the progress of each cohort comparing to the others based on the serial number of the week of the lifetime.

And our eight charts are the following:

We have placed dates (calendar or week of lifetime) on the x-axis and cohorts on the y-axis. The color of the heatmap represents the value (number of customers, gross margin, per customer gross margin and CLV to date).

Based on this type of visualization we can easily identify general purchasing behaviors, for instance:

number of customers who made a purchase was at its highest at the beginning of the cohort’s lifetime (first four weeks) and has increased in the sales seasons,
although the number of customers who made purchases increased in the sales seasons, the gross margin hasn’t increased accordingly. In other words, lower prices haven’t been compensated by a higher number of customers,
if our average customer acquisition cost (CAC) is $50, for example, we can find that almost all cohorts have been repaid by CLV to date within the 32-35 weeks of lifetime,
and so on.

You can produce this example via the following R code:

click to expand R code

#loading libraries
library(dplyr)
library(ggplot2)
library(reshape2)

#simulating dataset
cohorts <- data.frame()
set.seed(10)
for (i in c(1:100)) {
 coh <- data.frame(cohort=i,
 date=c(i:100),
 week.lt=c(1:(100-i+1)),
 num=replicate(1, sample(c(1:40), 100-i+1, rep=TRUE)),
 av=replicate(1, sample(c(5:10), 100-i+1, rep=TRUE)))
 coh$num[coh$week.lt==1] <- sample(c(90:100), 1, rep=TRUE)
 ifelse(max(coh$date)>1, coh$num[coh$week.lt==2] <- sample(c(75:90), 1, rep=TRUE), NA)
 ifelse(max(coh$date)>2, coh$num[coh$week.lt==3] <- sample(c(60:75), 1, rep=TRUE), NA)
 ifelse(max(coh$date)>3, coh$num[coh$week.lt==4] <- sample(c(40:60), 1, rep=TRUE), NA)
 ifelse(max(coh$date)>34,
 {coh$num[coh$date==35] <- sample(c(60:85), 1, rep=TRUE)
 coh$av[coh$date==35] <- 4},
 NA)
 ifelse(max(coh$date)>47,
 {coh$num[coh$date==48] <- sample(c(60:85), 1, rep=TRUE)
 coh$av[coh$date==48] <- 4},
 NA)
 ifelse(max(coh$date)>86,
 {coh$num[coh$date==87] <- sample(c(60:85), 1, rep=TRUE)
 coh$av[coh$date==87] <- 4},
 NA)
 ifelse(max(coh$date)>99,
 {coh$num[coh$date==100] <- sample(c(60:85), 1, rep=TRUE)
 coh$av[coh$date==100] <- 4},
 NA)
 coh$gr.marg <- coh$av*coh$num
 cohorts <- rbind(cohorts, coh)
}

cohorts$cohort <- formatC(cohorts$cohort, width=3, format='d', flag='0')
cohorts$cohort <- paste('coh:week:', cohorts$cohort, sep='')
cohorts$date <- formatC(cohorts$date, width=3, format='d', flag='0')
cohorts$date <- paste('cal_week:', cohorts$date, sep='')
cohorts$week.lt <- formatC(cohorts$week.lt, width=3, format='d', flag='0')
cohorts$week.lt <- paste('week:', cohorts$week.lt, sep='')

#calculating CLV to date
cohorts <- cohorts %>%
 group_by(cohort) %>%
 mutate(clv=cumsum(gr.marg)/num[week.lt=='week:001'])

#color palette
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")

#Heatmap based on Number of active customers
t <- max(cohorts$num)

ggplot(cohorts, aes(y=cohort, x=date, fill=num)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Cohort Activity Heatmap (number of customers who purchased - calendar view)")

ggplot(cohorts, aes(y=cohort, x=week.lt, fill=num)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Cohort Activity Heatmap (number of customers who purchased - lifetime view)")

# Heatmap based on Gross margin
t <- max(cohorts$gr.marg)

ggplot(cohorts, aes(y=cohort, x=date, fill=gr.marg)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Heatmap based on Gross margin (calendar view)")

ggplot(cohorts, aes(y=cohort, x=week.lt, fill=gr.marg)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Heatmap based on Gross margin (lifetime view)")

# Heatmap of per customer gross margin
t <- max(cohorts$av)

ggplot(cohorts, aes(y=cohort, x=date, fill=av)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Heatmap based on per customer gross margin (calendar view)")

ggplot(cohorts, aes(y=cohort, x=week.lt, fill=av)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Heatmap based on per customer gross margin (lifetime view)")

# Heatmap of CLV to date
t <- max(cohorts$clv)

ggplot(cohorts, aes(y=cohort, x=date, fill=clv)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Heatmap based on CLV to date of customers who ever purchased (calendar view)")

ggplot(cohorts, aes(y=cohort, x=week.lt, fill=clv)) +
 theme_minimal() +
 geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
 scale_fill_gradientn(colours=cols, limits=c(0, t),
 breaks=seq(0, t, by=t/4),
 labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
 guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
 theme(legend.position='bottom',
 legend.direction="horizontal",
 plot.title = element_text(size=20, face="bold", vjust=2),
 axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
 ggtitle("Heatmap based on CLV to date of customers who ever purchased (lifetime view)")

To leave a comment for the author, please follow the link and comment on his blog: AnalyzeCore » R language.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

The following multi-panel graph, which graces the cover of the most recent issue of the Journal of Computational and Graphical Statistics ,JCGS, (Vol 24, Num 1, March 2015) is from the paper by Grolemund and Wickham entitled Visualizing Complex Data With Embedded Plots. The four plots are noteworthy for a couple or reasons:

They present superb example of how an embedded plot with its additional set of axes can pack more information into the same area required for a traditional scatter plot or heatmap
They provide clear and prominent testimony to the dreadful toll of civilian casualties from the war in Afghanistan.

Each plot provides a different view of casualty data collected by the U.S. military between 2004 and 2010 and made available by the WikiLeaks organization. Among other variables the dataset contains longitude and latitude coordinates and casualty statistics for more than 76,000 events. Casualties counts are recorded for four groups: civilian, enemy, Afghanistan police and coalition forces. The first plot is a simple scatter plot which suffers from severe over plotting that obscures the patterns in the data. The second plot, a heatmap, does show how the number of casualties varies by geography but provides no information as to how casualties are distributed among the various groups. The third and fourth plots are embedded plots which respectively show marginal and conditional distribution summaries of the data for different locations.

From looking at the enlarged version below of the plot in the lower right corner it is clear that there are locations where civilian casualties dominate. For example, looking at the bar plot in the box on the seventh row down from the top and second column in from the left, the region around Herat, a city with a population of approximately 435,000 residents, it appears that civilian casualties exceed the sum of all others.

Regarding the second point above: I think it was courageous of both the authors and the editors of the journal to call attention to the human tragedy of the Afghan War at time (2013) when the United States was still heavily invested with "boots on the ground" and the war was generating considerable controversy.

One more mundane reason that I enjoy reading the JCGS is that it is apparently their policy to encourage authors to provide "supplementary materials" including code and data sets where feasible. Many times the supplementary material includes R code, and as you might expect, this was the case with the paper by Grolemund and Wickham. They provide R code for all of their examples as well as the data sets.

I was surprised, however, that my attempt to recreate the cover plot from the code provided (see below) turned out to be a small exercise in reproducible research. Running the code with a recent version of R will most likely generate the error:

Error in layout_base(data, vars, drop = drop) : 
At least one layer must contain all variables used for facetting

Things change, including R. In the sixteen months or so it took for the paper to be published it turns out that the code provided with the paper is no longer compatible with more current releases of R. See the discussion on Github.

Fortunately, R is fairly robust when it comes to reproducing past research. To generate the cover graph I downloaded the Windows binaries for R 3.0.2 and used the checkpoint function: checkpoint("2014-09-18") to download an internally consistent set of packages that are required by the R scripts. (Note that MRAN archive used by checkpoint goes back to 2014-09-17.) The final step was to use some clever code from Cookbook for R to get all of the plots in a single graph.

My take is that even if it involves a bit of digital archaeology it is well worth the effort to explore embedded subplots. This form of visualization has been percolating for some time. (Grolemand and Wickham trace them back to the 1862 work of Charles Minard.) As the authors point out, embedded subplots are not always appropriate. There is a danger that they could easily lead to a visual complexity that would make them completely uninterpretable. Nevertheless, when they do work embedded subplots can be spectacularly informative.

The following code, abstracted from the supplementary materials at the link above, will produce the produce the plots in the cover graphic.

# load and clean data that appears in the figures
 
library(reshape2)
library(plyr)
library(maps)
library(ggplot2)
library(ggsubplot)
 
# getbox by Heike Hoffman, trims map polygons for figure backgrounds
# https://github.com/ggobi/paper-climate/blob/master/code/maps.r
getbox <- function (map, xlim, ylim) {
  # identify all regions involved
  small <- subset(map, (long > xlim[1]) & (long < xlim[2]) & (lat > ylim[1]) & (lat < ylim[2]))
  regions <- unique(small$region)
  small <- subset(map, region %in% regions)  
 
  # now shrink all nodes back to the bounding box
  small$long <- pmax(small$long, xlim[1])
  small$long <- pmin(small$long, xlim[2])
  small$lat <- pmax(small$lat, ylim[1])
  small$lat <- pmin(small$lat, ylim[2])
 
  # Remove slivvers
  small <- ddply(small, "group", function(df) {
    if (diff(range(df$long)) < 1e-6) return(NULL)
    if (diff(range(df$lat)) < 1e-6) return(NULL)
    df
  })
 
  small
}
 
 
## Afghanistan for Figures 2 and 3
afghanistan <- getbox(world, c(60,75), c(28, 39))
map_afghan <- list(
  geom_polygon(aes(long, lat, group = group), data = afghanistan, 
    fill = "grey80", colour = "white", inherit.aes = FALSE, 
    show_guide = FALSE),
  scale_x_continuous("", breaks = NULL, expand = c(0.02, 0)),
  scale_y_continuous("", breaks = NULL, expand = c(0.02, 0)))
 
## Mexico and lower US for Figure 4
north_america <- getbox(both, xlim = c(-107.5, -80), ylim = c(11, 37.5))
map_north <- list(
  geom_polygon(aes(long, lat, group = group), data = north_america, fill = "grey80", 
    colour = "grey70", inherit.aes = FALSE, show_guide = FALSE),
  scale_x_continuous("", breaks = NULL, expand = c(0.02, 0)),
  scale_y_continuous("", breaks = NULL, expand = c(0.02, 0))) 
 
###############################################################
###                wikileaks Afghan War Diary               ###
###############################################################
 
# casualties data set loaded with ggsubplot and used as is in figure 2
# regional casualty data included as a supplemental file to paper
# how about casualties over time in different parts of the country?
load("casualties-by-region.RData")
 
###############################################################
###                       Figure 2                          ###
###############################################################
 
# Figure 2.a. raw Afghanistan casualty data
ggplot(casualties) + 
  map_afghan +
  geom_point(aes(lon, lat, color = victim), size = 1.75) +
  ggtitle("location of casualties by type") + 
  coord_map() +
  scale_colour_manual(values = rev(brewer.pal(5,"Blues"))[1:4])
ggsave("afgpoints.pdf", width = 7, height = 7)
 
 
 
# Figure 2.b. Afghanistan casualty heat map
ggplot(casualties) + 
  map_afghan +
  geom_bin2d(aes(lon, lat), bins = 15) +
  ggtitle("number of casualties by location") +
  scale_fill_continuous(guide = guide_legend()) +
  coord_map()
ggsave("afgtile.pdf", width = 7, height = 7)
 
 
 
# Figure 2.c. Afghanistan casualty embedded bar graphs (marginal distributions)
ggplot(casualties) + 
  map_afghan +
  geom_subplot2d(aes(lon, lat, 
    subplot = geom_bar(aes(victim, ..count.., fill = victim), 
      color = rev(brewer.pal(5,"Blues"))[1], size = 1/4)), bins = c(15,12), 
      ref = NULL, width = rel(0.8), height = rel(1)) + 
  ggtitle("casualty type by locationn(Marginal distribution)") + 
  coord_map() +
  scale_fill_manual(values = rev(brewer.pal(5,"Blues"))[c(1,4,2,3)]) 
ggsave("casualties.pdf", width = 7, height = 7)
 
 
 
# Figure 2.d. Afghanistan casualty embedded bar graphs (conditional distributions)
ggplot(casualties) + 
  map_afghan +
  geom_subplot2d(aes(lon, lat,
    subplot = geom_bar(aes(victim, ..count.., fill = victim), 
      color = rev(brewer.pal(5,"Blues"))[1], size = 1/4)), bins = c(15,12), 
      ref = ref_box(fill = NA, color = rev(brewer.pal(5,"Blues"))[1]), width = rel(0.7), height = rel(0.6), y_scale = free) + 
  ggtitle("casualty type by locationn(Conditional distribution)") +
  coord_map() +
  scale_fill_manual(values = rev(brewer.pal(5,"Blues"))[c(1,4,2,3)]) 
 
ggsave("casualties2.pdf", width = 7, height = 7)

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

Work has kept myself & @jayjacobs quite busy of late, but a small data set posted by @jw_sec this morning made for an opportunity for a quick blog post to show how to do some data maniupation and visualization in R for both security and non-security folk (hey, this may even get more non-security folk looking at security data which is a definite “win” if so). We sometimes aim a bit high in our posts and forget that many folks are really just starting to learn R. For those just getting started, here’s what’s in store for you:

reading and processing JSON data
using dplyr and pipe idioms to do some data munging
using ggplot for basic data visualization
getting away from geography when looking at IPv4 addresses

All the code (and Jordan’s data) is up on github.

Reading in the data

Jordan made the honeypot logs available as a gzip’d JSON file. We’ll use GET from the httr package to read the data to disk to not waste Jordan’s bandwidth. We’ll save the data to disk via write_disk which will help it act like a cache (it won’t try to re-download the file if it exists locally, unless you specify that it should overwrite the file). I wrap it with try just to suppress the “error” message. Note that fromJSON reads gzip’d files just like it does straight JSON files.

source_url <- "http://jordan-wright.github.io/downloads/elastichoney_logs.json.gz"
resp <- try(GET(source_url, write_disk("data/elastichoney_logs.json.gz")), silent=TRUE)
elas <- fromJSON("data/elastichoney_logs.json.gz")

Cleaning up the data

You can view Jordan’s blog post to see the structure of the JSON file. It’s got some nested structures that we won’t be focusing on in this post and some that will cause dplyr some angst (some dplyr operations do not like data frames in data frames), so we’ll whittle it down a bit while we also:

convert the timestamp text to an actual time format
ensure the request method is uniform (all uppercase)

elas %>%
  select(major, os_name, name, form, source, os, timestamp=`@timestamp`, method,
         device, honeypot, type, minor, os_major, os_minor, patch) %>%
  mutate(timestamp=as.POSIXct(timestamp, format="%Y-%m-%dT%H:%M:%OS"),
         day=as.Date(timestamp),
         method=toupper(method)) -> elas

For those still new to the magrittr (or pipeR) piping idiom, the %>% notation is just a way of avoiding a bunch of nested function calls, which generally makes the code cleaner and helps (IMO) compartmentalize operations. Here we compartmentalize the “select” and “mutate” operations. Here is the resultant data frame:

glimpse(elas)
## Observations: 7808
## Variables:
## $ major     (chr) "2", "2", "6", "6", "6", "6", "6", "2", "6", "2", "2", "2", "2", "2", "2", "2", "2"...
## $ os_name   (chr) "Windows", "Windows", "Windows 2000", "Windows 2000", "Windows 2000", "Windows 2000...
## $ name      (chr) "Python Requests", "Python Requests", "IE", "IE", "IE", "IE", "IE", "Python Request...
## $ form      (chr) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ source    (chr) "58.220.3.207", "58.220.3.207", "115.234.254.53", "115.234.254.53", "115.234.254.53...
## $ os        (chr) "Windows", "Windows", "Windows 2000", "Windows 2000", "Windows 2000", "Windows 2000...
## $ timestamp (time) 2015-03-21 11:39:23, 2015-03-21 11:39:24, 2015-03-21 04:09:27, 2015-03-21 04:29:06...
## $ method    (chr) "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "POST", "GET", "GET", "GET"...
## $ device    (chr) "Other", "Other", "Other", "Other", "Other", "Other", "Other", "Other", "Other", "O...
## $ honeypot  (chr) "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x...
## $ type      (chr) "attack", "attack", "attack", "attack", "attack", "attack", "attack", "attack", "at...
## $ minor     (chr) "4", "4", "0", "0", "0", "0", "0", "4", "0", "4", "4", "4", "4", "4", "4", "4", "4"...
## $ os_major  (chr) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ os_minor  (chr) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ patch     (chr) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ day       (date) 2015-03-21, 2015-03-21, 2015-03-21, 2015-03-21, 2015-03-21, 2015-03-21, 2015-03-21...

You could also look at elas$headers and elas$geoip from the original structure we read in with fromJSON if you want to look a bit more at those. Unless you’re digging deeper or correlating with other data, we’re pretty much left with reporting “what happened” (i.e. basic counting), so let’s visualize a few of the fields.

Attacks vs Recons

There is a type field in the JSON data which classifies the server contact as either an “attack” (attempt at actually doing something bad) vs “recon” (which I assume is just a test to see if the instance is vulnerable). We can see what the volume looks like per-day pretty easily:

gg <- ggplot(count(elas, day, type), aes(x=day, y=n, group=type))
gg <- gg + geom_bar(stat="identity", aes(fill=type), position="stack")
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(NA, 700))
gg <- gg + scale_x_date(expand=c(0,0))
gg <- gg + scale_fill_manual(name="Type", values=c("#1b6555", "#f3bc33"))
gg <- gg + labs(x=NULL, y="# sources", title="Attacks/Recons per day")
gg <- gg + theme_bw()
gg <- gg + theme(panel.background=element_rect(fill="#96c44722"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg

Here we use dplyr‘s count function to count the number of contacts per day by type and then plot it with bars using some Elasticsearch corporate colors). Some of the less-obvious things to note are:

the stat="identity" in geom_bar means to just take the raw y data we gave the function (many of the geom‘s in ggplot are pretty smart and can apply various statistical operations as part of the layer rendering)
position="stack" and fill=type will give us a stacked bar chart colord by type. I generally am not a fan of stacked bar charts but they make sense this time
expand=c(0,0) reduces the whitespace in the graph, making the bars flush with the axes
using limits=c(NA, 700) give us some breathing room at the top of the bar chart

There’s an interesting spike on April 24th, but we don’t have individual IDs for the honeypots (from what I can tell from the data), so we can’t see if any one was more targeted than another. We can see the top attackers. There are length(unique(elas$source)) == 236 total contact IP addresses in the data set, so let’s see how many were involved in the April 24th spike:

elas %>%
  filter(day==as.Date("2015-04-24")) %>%
  count(source) %>%
  arrange(desc(n))

## Source: local data frame [12 x 2]
## 
##            source   n
## 1   218.4.169.146 144
## 2  61.176.222.160  70
## 3   218.4.169.148  36
## 4     58.42.32.27  24
## 5  121.79.133.179  10
## 6   111.74.239.77   6
## 7  61.160.213.180   4
## 8   107.160.23.56   2
## 9  183.129.153.66   1
## 10 202.109.189.49   1
## 11   219.235.4.22   1
## 12  61.176.223.77   1

218.4.169.146 was quite busy that day (missed previous days++ quota?). Again, we need more info to even try to discern “why”, something to think about when designing an information collection system for furhter analysis.

Contacts by request type

You can use the following basic structure to look at “contacts by…” for any column that makes sense. For now, we’ll just look at contacts by request type (mostly since that was an example in Jordan’s post).

gg <- ggplot(count(elas, method), aes(x=reorder(method, -n), y=n))
gg <- gg + geom_bar(stat="identity", fill="#1b6555", width=0.5)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + labs(x=NULL, y=NULL, title="Contacts by Request type")
gg <- gg + coord_flip()
gg <- gg + theme_bw()
gg <- gg + theme(panel.background=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg

Top IPs

We can also see who (overall) were the noisiest contacts. This could be useful for reputation analysis but I’m doing it mainly to show some additional dplyr and ggplot work. We’ll count the sources, make a pretty label for them (with % of total) and then plot it.

elas %>%
  count(source) %>%
  mutate(pct=percent(n/nrow(elas))) %>%
  arrange(desc(n)) %>%
  head(30) %>%
  mutate(source=sprintf("%s (%s)", source, pct)) -> attack_src

gg <- ggplot(attack_src, aes(x=reorder(source, -n), y=n))
gg <- gg + geom_bar(stat="identity", fill="#1b6555", width=0.5)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + labs(x=NULL, y=NULL, title="Top 30 attackers")
gg <- gg + coord_flip()
gg <- gg + theme_bw()
gg <- gg + theme(panel.background=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg

Better than geography

There’s the standard “geoip” blathering in the data set and a map in the blog post (and, no doubt, on the Kibana dashboard). Attribution issues aside, we can do better than a traditional map. Let’s dust off our ipv4heatmap package and over lay China CIDRs on a Hilbert space IPv4 map. We can grab China CIDRs from data sets maintained by Ivan Erben. I left this a traditional straight readLines call, but it would be a good exercise for the reader to convert this to the httr/write_disk idiom from above to save them some bandwidth.

hm <- ipv4heatmap(elas$source)

china <- grep("^#", readLines("http://www.iwik.org/ipcountry/CN.cidr"), invert=TRUE, value=TRUE)
cidrs <- rbindlist(pbsapply(china, boundingBoxFromCIDR))

hm$gg +
 geom_rect(data=cidrs,
           aes(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax),
           fill="white", alpha=0.1)

(Touch/click the graphic for a larger, zoomable version)

China IP space is a major player, but the address blocks are not at all mostly contiguous and there definitely are other network (and geo) sources. You can use dplyr and the other CIDR blocks from Ivan to take a more detailed look.

Wrapping up

There are definitely some further areas to explore in the data set, and I hope this insipred some folks to fire up RStudio and explore the data a bit further. If you find anything interesing, drop a note in the comments. Remember, all the source for the above is on github.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

Sometimes we want to make our own heatmap using image() function. I recently found it's tricky to set the color option there, as its manual has very little information on col:

col a list of colors such as that generated by rainbow, heat.colors, topo.colors, terrain.colors or similar functions.

I posted my question on BioStars. The short answer is: Unless the breaks is set, the range of Z is evenly cut into N intervals (where N = the length of color) and values in Z are assigned to the color of corresponding interval. For example, when x=c(3,1,2,1) and col=c("blue","red",'green','yellow'), the minimal of x is assigned as the first color, and max to the last color. Any value between is calculated proportionally to a color. In this case, 2 is the the middle one, according to the principal that intervals are closed on the right and open on the left, it's assigned to "red". So, that's why we see the colors are yellow-->blue-->red-->blue.

In practice, unless we want to manually define the color break points, we just set the first and last color, it will automatically find colors for the values in Z.

collist<-c(0,1)
image(1:ncol(x),1:nrow(x), as.matrix(t(x)), col=collist, asp=1)

If we want to manually define the color break points, we need to

x=matrix(rnorm(100),nrow=10)*100
xmin=0; xmax=100;
x[x<xmin]=xmin; x[x>xmax]=xmax;
collist<-c("#053061","#2166AC","#4393C3","#92C5DE","#D1E5F0","#F7F7F7","#FDDBC7","#F4A582","#D6604D","#B2182B","#67001F")
ColorRamp<-colorRampPalette(collist)(10000)
ColorLevels<-seq(from=xmin, to=xmax, length=10000)
ColorRamp_ex <- ColorRamp[round(1+(min(x)-xmin)*10000/(xmax-xmin)) : round( (max(x)-xmin)*10000/(xmax-xmin) )]
par(mar=c(2,0,2,0), oma=c(3,3,3,3))
layout(matrix(seq(2),nrow=2,ncol=1),widths=c(1),heights=c(3,0.5))
image(t(as.matrix(x)), col=ColorRamp_ex, las=1, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
image(as.matrix(ColorLevels),col=ColorRamp, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
axis(1,seq(xmin,xmax,10),seq(xmin,xmax,10))

To leave a comment for the author, please follow the link and comment on his blog: One Tip Per Day.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At last month's BUILD conference for Microsoft developers in San Francisco, R was front-and-center on the keynote stage.

In the keynote, Microsoft CVP Joseph Sirosh introduced the "language of data": open source R. Sirosh encouraged the audience to learn R, saying "if there is a single language that you choose to learn today .. let it be R".

The keynote featured a demonstration of genomic data analysis using R. The analysis was based on the 1000 genomes data set stored in the HDInsight Hadoop-in-the-cloud service. Revolution R Enterprise running on eight Hadoop clusters distributed around the globe (about 1600 cores in total), and R's Bioconductor suite (specifically the VariantTools and gmapR packages), was used to perform 'variant calling' and calculate the disease risks indicated by a subset of the 1000 genomes in parallel. The result was an interactive heat-map showing the disease risks for each individual.

The heat map was created by Winston Chang and Joe Cheng from RStudio as an htmlwidget using the D3heatmap package. (You can interact with a variant of the heatmap from the demo here.)

The next part of the demo was to compare an individual's disease risks — as indicated by his or her DNA — to the population. Joseph Sirosh had his own DNA sequence for this purpose, which he submitted via a Windows Phone app to an Azure service running R. This is easy to do with Azure ML Studio: just put your R code as part of a workflow, and an API will automatically be generated on request. In this way you can publish any R code as an API to the cloud, which is then callable by any connected application.

You can watch the entire keynote presentation below, and the R demo begins at around the 23 minute mark.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

(This article was first published on everyday analytics, and kindly contributed to R-bloggers)

Introduction

People are weird. And if there’s anything that’s greater collective proof of this fact than Reddit, you’d be hard pressed to find it.I tend to put reddit in the same bucket as companies like Google, Amazon and Netflix, where they have enough money, or freedom, or both, to say something like “wouldn’t it be cool if….?” and then they do it simply because they can.

Enter “the button” (/r/thebutton), reddit’s great social experiment that appeared on April Fool’s Day of this year. An enticing blue rectangle with a timer that counts down from 60 to zero that’s reset when the button is pushed, with no explanation as to what happens when the time is allowed to run out. Sound familiar? The catch here being that it was an experience shared by anyone who visited the site, and each user also only got one press (though many made attempts to game the system, at least initially).

Finally, the timer reached zero, the last button press being at 2015-06-05 21:49:53.069000UTC, and the game (rather anti-climactically I might offer) ended.

What does this have to do with people being weird? Well, an entire mythology was built up around the button, amongst other things. Okay, maybe interesting is a better word. And maybe we’re just talking about your average redditor.

Either way, what interests me is that when the experiment ended, all the data were made available. So let’s have a look shall we?

Background

The dataset consists of simply four fields:

press time, the date and time the button was pressed

flair, the flair the user was assigned given at what the timer was at when they pushed the button

css, the flair class given to the user

and lastly outage press, a Boolean indicator as to if the press occurred during a site outage.

The data span a time period from 2015-04-01 16:10:04.468000 to 2015-06-05 21:49:53.069000, with a total of 1,008,316 rows (unique presses).

I found there was css missing for some rows, and a lot of of “non presser” flair (users who were not eligible to press the button as their account was created after the event started). For these I used a “missing” value of -1 for the number of seconds remaining when the button was pushed; otherwise it could be stripped from the css field.

Analysis

With this data set, we’re looking at a pretty straightforward categorical time series.

Overall Activity in Time

First we can just look at the total number of button presses, regardless of what the clock said (when they occurred in the countdown) by plotting the raw number of presses per day:

Hmmm… you can see there is a massive spike at the beginning of the graph and there’s much, much fewer for the rest of the duration of the experiment. In fact, nearly 32% of all clicks occurred in the first day, and over half (51.3%) in the first two days.

I think has something to do with both the initial interest in the experiment when it first was announced, and also with the fact that the higher the counter is kept at, the more people can press the button in the same time period (more on this later).

Perhaps a logarithmic graph for the y-axis would be more suitable?

That’s better. We can see the big drop-off in the first two days or so, and also that little blip around the 18th of May is more apparent. This is likely tied to one of several technical glitches which are noted in the button wiki,

For a more granular look, let’s do the hourly presses as well (with a log scale):

Cool. The spike on the 18th seems to be mainly around one hour with about a thousand presses, and we can see too that perhaps there’s some kind of periodic behavior in the data on an hourly basis? If we exclude some of the earlier data we can also go back to not using a log scale for the y-axis:

Let’s look more into the hours of the day when the button presses occur. We can create a simple bar plot of the count of button presses by hour overall:

You can see that the vast majority occurred around 5 PM and then there is a drop-off after that, with the lows being in the morning hours between about 7 and noon. Note that all the timestamps for the button pushes are in Universal Time. Unfortunately there is no geo data, but assuming most redditors who pushed the button are within the continental United States (a rather fair assumption) the high between 5-7 PM would be 11 AM to 1 PM (so, around your lunch hour at work).

But wait, that was just the overall sum of hours over the whole time period. Is there a daily pattern? What about by hour and day of week? Are most redditors pushing the button on the weekend or are they doing it at work (or during school)? We should look into this in more detail.

Hmm, nope! The majority of the clicks occurred Wednesday-Thursday night. But as we know from the previous graphs, the vast majority also occurred within the first two days, which happened to be a Wednesday and Thursday. So the figures above aren’t really that insightful, and perhaps it would make more sense to look at the trending in time across both day and hour? That would give us the figure as below:

As we saw before, there is a huge amount of clicks in the first few days (the first few hours even) so even with log scaling it’s hard to pick out a clear pattern. But most of the presses appear to be present in the bands after 15:00 and before 07:00. You can see the clicks around the outage on the 18th of May were in the same high period, around 18:00 and into the next day.

Maybe alternate colouring would help?

That’s better. Also if we exclude the flurry of activity in the first few days or so, we can drop the logarithmic scaling and see the other data in more detail:

Activity by Seconds Remaining
So far we’ve only looked at the button press activity by the counts in time. What about the time remaining for the presses? That’s what determined each individual reddit user’s flair, and was the basis for all the discussion around the button.

The reddit code granted flairs which were specific to the time remaining when the button was pushed. For example, if there were 34 seconds remaining, then the css would be “34s”, so it was easy to strip these and convert into numeric data. There were also those that did not press the button who were given the “non presser” flair (6957 rows, ~0.69%), as well as a small number of entries missing flair (67, <0.01%), which I gave the placeholder value of -1.

The remaining flair classes served as a bucketing which functioned very much like a histogram:

Color	Have they pressed?	Can they press?	Timer number when pressed
Grey/Gray	N	Y	NA
Purple	Y	N	60.00 ~ 51.01
Blue	Y	N	51.00 ~ 41.01
Green	Y	N	41.00 ~ 31.01
Yellow	Y	N	31.00 ~ 21.01
Orange	Y	N	21.00 ~ 11.01
Red	Y	N	11.00 ~ 00.00
Silver/White	N	N	NA

We can see this if we plot a histogram of the button presses by using the CSS class which gives the more granular seconds remaining, and use breaks the same as above:

We can see there is much greater proportion of those who pressed within 51-60s left, and there is falloff from there (power law). This is in line with what we saw in the time series graphs: the more the button was pressed, the more presses could occur in a given interval of time, and so we expect that most of those presses occurred during the peak activity at the beginning of the experiment (which we’ll soon examine).

What’s different from the documentation above from the button wiki is the “cheater” class, which was given to those who tried to game the system by doing things like disconnecting their internet and pressing the button multiple times (as far as I can tell). You can see that plotting a bar graph is similar to the above histogram with the difference being contained in the “cheater” class:

Furthermore, looking over the time period, how are the presses distributed in each class? What about in the cheater class? We can plot a more granular histogram:

Here we can more clearly see the exponential nature of the distribution, as well as little ‘bumps’ around the 10, 20, 30 and 45 second marks. Unfortunately this doesn’t tell us anything about the cheater class as it still has valid second values. So let’s do a boxplot by css class as well, showing both the classes (buckets) as well as their distributions:

Obviously each class has to fit into a certain range given their definition, but we can see some are more skewed than others (e.g. class for 51-60s is highly negatively skewed, whereas the class for 41-50 has median around 45). Also we can see that the majority of the cheater class is right near the 60 mark.

If we want to be fancier we can also plot the boxplot using just the points themselves and adding jitter:

This shows the skew of the distributions per class/bucket (focus around “round” times like 10, 30, 45s, etc.) as before, as well as how the vast majority of the cheater class appears to be at 59s mark.

Presses by seconds remaining and in time
Lastly we can combine the analyses above and look at how the quantity and proportion of button presses varies in time by the class and number of seconds remaining.

First we can look at the raw count of presses per css type per day as a line graph. Note again the scale on the y-axis is logarithmic:

This is a bit noisy, but we can see that the press-6 class (presses with 51-60s remaining) dominate at the beginning, then taper off toward the end. Presses in the 0-10 class did not appear until after April 15, then eventually overtook the quicker presses, as would have to be the case in order for the timer to run out. The cheater class starts very high with the press-6 class, then drops off significantly and continues to decrease. I would have like to break this out into small multiples for more clarity, but it’s not the easiest to do using ggplot.

Another way to look at it would be to look at the percent of presses by class per day. I’ve written previously about how stacked area graphs are not your friend, but in this case it’s actually not too bad (plus I wanted to learn how to do it in ggplot). If anything it shows the increase presses in the 51-60 range right after the outage on May 18, and the increase in the 0-10 range toward the end (green):

This is all very well and good, but let’s get more granular. We can easily visualize the data more granularly using heatmaps with the second values taken from the user flair to get a much more detailed picture. First we’ll look at a heatmap of this by hour over the time period:

Again, the scaling is logarithmic for the counts (here the fill colour). We can see some interesting patterns emerging, but it’s a little too sparse as there are a lot of hours without presses for a particular second value. Let’s really get granular and use all the data on the per second level!

On the left is the data for the whole period with a logartihmic scale, whereas the figure on the right excludes some of the earlier data and uses a linear scale. We can see the beginning peak activity in the upper lefthand corner, and then these interesting bands around the 5, 10, 20, 30, and 45 marks forming and gaining strength over time (particular toward the end). Interestingly in addition the resurgence in near-instantaneous presses after the outage around May 18, there was also a hotspot of presses around the 45s mark close to the end of April. Alternate colouring below:

Finally, we can divide by the number of presses per day and calculate the percent each number of seconds remaining made up over the time period. That gives the figures below:

Here the flurry of activity at the beginning continues to be prominent, but the bands also stand out a little more on a daily basis. We can also see how the proportion of clicks for the smaller number of seconds remaining continues to increase until finally the timer is allowed to run out.

Conclusion

The button experiment is over. In the end there was no momentous meaning to it all, no grand scheme or plan, no hatch exploding into the jungle, just an announcement that the thread would be archived. Again, somewhat anti-climactic.

But, it was an interesting experiment. This was an interesting data set, given the relationship between the amount of data that could exist in the same interval of time because of the nature of it.

And I think it really says something about what the internet allows us to do (both in terms of creating something simply for the sake of it, and collecting and analyzing data), and also about people’s desire to find patterns and create meaning in things, no matter what they are. If you’d asked me, I never would have guessed religions would have sprung up around something as simple as pushing a button. But then again, religions have sprung up around stranger things.

You can read and discuss in the button aftermath thread, and if you want to have a go at it yourself, the code and data are below. Until next time I’ll just keep pressing on.

References & Resources

the button press data (from reddit’s github)

https://github.com/reddit/thebutton-data

R code for plots

https://github.com/mylesmharrison/reddit_thebutton_analysis

/r/thebutton

http://www.reddit.com/r/thebutton

To leave a comment for the author, please follow the link and comment on his blog: everyday analytics.

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce d3heatmap, our new package for generating interactive heat maps using d3.js and htmlwidgets. Tal Galili, author of dendextend, collaborated with us on this package.

d3heatmap is designed to have a familiar feature set and API for anyone who has used heatmap or heatmap.2 to create static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way.

d3heatmap includes the following features:

Shows the row/column/value under the mouse cursor
Click row/column labels to highlight
Drag a rectangle over the image to zoom in
Works from the R console, in RStudio, with R Markdown, and with Shiny

Installation

install.packages("d3heatmap")

Examples

Here’s a very simple example (source: flowingdata):

url <-"http://datasets.flowingdata.com/ppg2008.csv"
nba_players

You can easily customize the colors using the colors parameter. This can take an RColorBrewer palette name, a vector of colors, or a function that takes (potentially scaled) data points as input and returns colors.

Let’s modify the previous example by using the "Blues" colorbrewer palette, and dropping the clustering and dendrograms:

d3heatmap(nba_players, scale = "column", dendrogram = "none",
    color = "Blues")

If you want to use discrete colors instead of continuous, you can use the col_* functions from the scales package.

d3heatmap(nba_players, scale = "column", dendrogram = "none",
    color = scales::col_quantile("Blues", NULL, 5))

Thanks to integration with the dendextend package, you can customize dendrograms with cluster colors:

d3heatmap(nba_players, colors = "Blues", scale = "col",
    dendrogram = "row", k_row = 3)

For issue reports or feature requests, please see our GitHub repo.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

(This article was first published on R-statistics blog » R, and kindly contributed to R-bloggers)

My R package dendextend (version 1.0.1) is now on CRAN!

The dendextend package Offers a set of functions for extending dendrogram objects in R, letting you visualize and compare trees of hierarchical clusterings. With it you can (1) Adjust a tree’s graphical parameters – the color, size, type, etc of its branches, nodes and labels. (2) Visually and statistically compare different dendrograms to one another.

The previous release of dendextend (0.18.3) was half a year ago, and this version includes many new features and functions.

To help you discover how dendextend can solve your dendrogram/hierarchical-clustering issues, you may consult one of the following vignettes:

Hierarchical cluster analysis on famous data-sets – probably the most fun to go through
Frequently asked questions – if you are look for a quick solution on how to color your labels or branches
Introduction to dendextend – offer details on the various functions of the package

Here is an example figure from the first vignette (analyzing the Iris dataset)

iris_heatmap_dend

This week, at useR!2015, I will give a talk on the package. This will offer a quick example, and a step-by-step example of some of the most basic/useful functions of the package. Here are the slides:

Lastly, I would like to mention the new d3heatmap package for interactive heat maps. This package is by Joe Cheng from Rstudio, and integrates well with dendrograms in general and dendextend in particular (thanks to some lovely github-commit-discussion between Joe and I). You are invited to see lively examples of the package in the post at the RStudio blog. Here is just one quick example:

d3heatmap(nba_players, colors = “Blues”, scale = “col”, dendrogram = “row”, k_row = 3)

To leave a comment for the author, please follow the link and comment on his blog: R-statistics blog » R.

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

Matrix factorization follows from the realization that nothing forces us to accept the data as given. We start with objects placed in rows and record observations on those objects arrayed along the top in columns. Neither the objects nor the measurements need to be preserved in their original form.

It is helpful to remember that the entries in our data matrix are the result of choices made earlier for everything that can be recorded is not tallied. We must decide on the unit of analysis (What objects go in the rows?) and the level of detail in our measurements (What variables go in the columns?). For example, multilevel data can be aggregated as we deem appropriate, so that our objects can be classroom means rather than individual students nested within classrooms and our measures can be total number correct rather than separate right and wrong for each item on the test.

Even without a prior structure, one could separately cluster the rows and the columns using two different distance matrices, that is, a clustering of the rows/columns with distances calculated using the columns/rows. As an example, we can retrieve a hierarchical cluster heat map from an earlier post created using the R heatmap function.

Here, yellow represents a higher rating, and lower ratings are in red. Objects with similar column patterns would be grouped together into a cluster and treated as if they constitute a single aggregate object. Thus, when asked about technology adoption, there appears to be a segment toward the bottom of the heatmap who foresee a positive return to investment and are not concerned with potential problems. The flip side appears toward the top with the pattern of higher yellows and lower reds suggesting more worries than anticipated joys. The middle rows seems to contain individuals falling somewhere between these two extremes.

A similar clustering could be performed for the columns. Any two columns with similar patterns down the rows can be combined and an average score calculated. We started with 8 columns and could end with four separate 2-column clusters or two separate 4-column clusters, depending on where we want to draw the line. A cutting point for the number of row clusters seems less obvious, but it is clear that some aggregation is possible. As a result, we have reordered the rows and columns, one at a time, to reveal an underlying structure: technology adoption is viewed in terms of potential gains and losses with individuals arrayed along a dimension anchored by gain and loss endpoints.

Before we reach any conclusion concerning the usefulness of this type of “dual” clustering, we might wish to recall that the data come from a study of attitudes toward technology acceptance with the wording sufficiently general that every participant could provide ratings for all the items. If we had, instead, asked about concrete steps and concerns toward actual implementation, we might have found small and large companies living in different worlds and focusing on different details. I referred to this as local subspaces in a prior post, and it applies here because larger companies have in-house IT departments and IT changes a company’s point-of-view.

To be clear, conversation is filled with likes and dislikes supported by accompanying feelings and beliefs. This level of generality permits us to communicate quickly and easily. The advertising tells you that the product reduces costs, and you fill in the details only to learn later that you set your cost-savings expectations too high.

We need to invent jargon in order to get specific, but jargon requires considerable time and effort to acquire, a task achieved by only a few with specialized expertise (e.g., the chief financial officer in the case of cost cutting). The head of the company and the head of information technology may well be talking about two different domains when each speaks of reliability concerns. If we want the rows of our data matrix to include the entire range of diverse players in technological decision making while keeping the variables concrete, then we will need a substitute for the above “dual” scaling.

We may wish to restate our dilemma. Causal models, such as the following technology acceptance model (TAM), often guide our data collection and analysis with all the inputs measured as attitude ratings.

Using a data set called technology from the R package plspm, one can estimate and test the entire partial least squares path model. Only the latent variables have been shown in the above diagram (here is a link to a more complete graphic), but it should be clear from the description of the variables in the technology data set that Perceived Usefulness (U) is inferred from ratings of useful, accomplish quickly, increase productivity and enhance effectiveness. However, this is not a model of real-world adoption that depends on so many specific factors concerning quality, cost, availability, service, selection and technical support. The devil is in the details, but causal modeling struggles with high dimensional and sparse data. Consequently, we end up with a model of how people talk about technology acceptance and not a model of the adoption process pushed forward by its internal advocates and slowed by its sticking points, detours and dead ends.

Yet, the causal model is so appealing. First comes perception and then follows intention. The model does all the heavy lifting with causation asserted by the path diagram and not discovered in the data. All I need is a rating scale and some general attitude statements that everyone can answer because they avoid any specifics that might result in a DK (don’t know) or NA (not applicable). Although ultimately the details are where technology is accepted or rejected, they just do not fit into the path model. We would need to look elsewhere in R for a solution.

As I have been arguing for some time in this blog, the data that we wish to collect ought to have concrete referents. Although we begin with detailed measures, it is our hope that the data matrix can be simplified and expressed as the product of underlying structural forms. At least with technology adoption and other evolving product categories, there seems to be a linkage between product differentiation and consumer fragmentation. Perhaps surprisingly, matrix factorization with its roots as a computational routine from linear algebra seems to be able to untangle the factors responsible for such coevolving networks.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

(This article was first published on R-statistics blog » R, and kindly contributed to R-bloggers)

This post on the dendextend package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full:

abstract

Summary: dendextend is an R package for creating and comparing visually appealing tree diagrams. dendextend provides utility functions for manipulating dendrogram objects (their color, shape, and content) as well as several advanced methods for comparing trees to one another (both statistically and visually). As such, dendextend offers a flexible framework for enhancing R’s rich ecosystem of packages for performing hierarchical clustering of items.

Availability: The dendextend R package (including detailed introductory vignettes) is available under the GPL-2 Open Source license and is freely available to download from CRAN at: (http://cran.r-project.org/package=dendextend)

Contact: Tal.Galili@math.tau.ac.il

Introduction

Hierarchical Cluster Analysis (HCA) is a widely used family of unsupervised statistical methods for classifying a set of items into some hierarchy of clusters (groups) according to the similarities among the items. The R language (R Core Team, 2014) – a leading, cross-platform, and open source statistical programming environment – has many implementations of HCA algorithms (Hornik, 2014; Chipman and Tibshirani, 2006; Witten and Tibshirani, 2010; Schmidtlein et al., 2010). The output of these various algorithms are stored in the hclust object class, while the dendrogram class is an alternative object class that is often used as the go-to intermediate representation step for visualizing an HCA output.

In many R packages, a figure output is adjusted by supplying the plot function with both an object to be plotted and various graphical parameters to be modified (colors, sizes, etc.). However, different behavior happens in the (base R) plot.dendrogram function, in which the function is given a dendrogram object that contains within itself (most of) the graphical parameters to be used when plotting the tree. Internally, the dendrogram class is represented as a nested list of lists with attributes for colors, height, etc. (with useful methods from the stats package). Until now, no comprehensive framework has been available in R for flexibly controlling the various attributes in dendrogram’s class objects.

The dendextend package aims to fill this gap by providing a significant number of new functions for controlling a dendrogram’s structure and graphical attributes. It also implements methods for visually and statistically comparing different dendrogram objects. The package is extensively validated through unit-testing (Wickham, 2011), offers a C++ speed-up (Eddelbuettel and François, 2011) for some of the core functions through the dendextendRcpp package, and includes three detailed vignettes.

The dendextend package is primarily geared towards HCA. For phylogeny analysis, the phylo object class (from the ape package) is recommended (Paradis et al., 2004). A comprehensive comparison of dendextend, ape, as well as other software for tree analysis, is available in the supplementary materials.

Description

Updating a dendrogram for visualization

The function set(dend, what, value), in dendextend, accepts a dendrogram (i.e.: dend) as input and returns it after some adjustment. The parameter what is a character indicating the property of the tree to be adjusted (see Table 1) based on value. The user can repeatedly funnel a tree, through different configuration of the set function, until a desired outcome is reached.

Fig. 1. A dendrogram after modifying various graphical attributes

Fig. 1 is created by clustering a vector of 1 to 5 into a dendrogram:

dend0 <- 1:5 %>% dist %>% hclust %>% as.dendrogram

The above code uses the convenient forward-pipe operator %>% (Milton and Wickham, 2014), which is just like running:

dend0 <- as.dendrogram(hclust(dist(1:5)))

Next, the tree is plotted after repeatedly using the set function:

dend0 %>% set("labels_color")  %>%  
set("labels_cex", c(2,3))   %>%
 set("branches_lwd", c(2,4)) %>% set("branches_k_lty", k=3)  %>%  set("branches_k_color", k = 3)%>% plot

The “value” vector is recycled in a depth-first fashion, with the root node considered as having a branch (which is not plotted by default). The parameters of the new tree can be explored using the functions get_nodes_attr and get_leaves_attr. Also, we can rotate and prune a tree with the respective functions.

Table 1. Available options for the “what” parameter when using the set function for adjusting the look of a dendrogram

Description	Option name
Set the labels’ names, color (per color, or with k clusters), size, turn to character	labels, labels_to_character, labels_colors, labels_cex, labels_to_character
Set the leaves’ point type, color, size, height	leaves_pch, leaves_col, leaves_cex, hang_leaves
Set all nodes’ point type, color, size	nodes_pch, nodes_col, nodes_cex
Set branches’ line type, color, width – per branch, based on clustering the labels, and for specific labels	branches_lty, branches_col, branches_lwd, branches_k_color, by_labels_branches_lty, by_labels_branches_col, by_labels_branches_lwd

Fig. 2. A tanglegram for comparing two clustering algorithms used on 15 flowers from the Iris dataset. Similar sub-trees are connected by lines of the same color, while branches leading to distinct sub-trees are marked by a dashed line.

Comparing two dendrograms

The tanglegram function allows the visual comparison of two dendrograms, from different algorithms or experiments, by facing them one in front of the other and connecting their labels with lines. Distinct branches are marked with a dashed line. For easier and nicer plotting, dendlist concatenates the two dendrograms together, while untangle attempts to rotate trees with un-aligned labels in search for a good layout. Fig. 2 demonstrates a comparison of two clustering algorithms (single vs. complete linkage) on a subset of 15 flowers from the famous Iris dataset. The entanglement function measures the quality of the tanglegram layout. Measuring the correlation between tree topologies can be calculated using different measures with cor.dendlist (Sokal and Rohlf, 1962), Bk_plot (Fowlkes and Mallows, 1983), or dist.dendlist. Permutation test and bootstrap confidence intervals are available. The above methods offer sensitivity and replicability analysis for researchers who are interested in validating their hierarchical clustering results.

Enhancing other packages

The R ecosystem is abundant with functions that use dendrograms, and dendextend offers many functions for interacting and enhancing their visual display: The function rotate_DendSer (Hurley and Earle, 2013) rotates a dendrogram to optimize a visualization-based cost function. Other functions allow the highlighting of un-even creation of clusters with the dynamicTreeCut package (Langfelder et al., 2008), as well as of “significant” clusters based on the pvclust package (Suzuki and Shimodaira, 2006). Previously mentioned functions can be combined to create a highly customized (rotated, colorful, etc.) static heatmap using heatplot.2 from gplots (Warnes et al., 2014), or a D3 interactive heatmap using the d3heatmap package. The circlize_dendrogram function produces a simple circular tree layout, while more complex circular layouts can be achieved using the circlize package (Gu et al., 2014). Aside from R base graphics, a ggplot2 dendrogram may be created using the as.ggdend function.

In conclusion – the dendextend package simplifies the creation, comparison, and integration of dendrograms into fine-tuned (publication quality) graphs. A demonstration of the package on various datasets is available in the supplementary materials.

Acknowledgements

This work was made possible thanks to the code and kind support, of Yoav Benjamini, Gavin Simpson, Gregory Jefferis, Marco Gallotta, Johan Renaudie, The R Core Team, Martin Maechler, Kurt Hornik, Uwe Ligges, Andrej-Nikolai Spiess, Steve Horvath, Peter Langfelder, skullkey, Romain Francois, Dirk Eddelbuettel, Kevin Ushey, Mark Van Der Loo, and Andrie de Vries.

Funding: This work was supported in part by the European Research Council under EC–EP7 European Research Council grant PSARPS-297519. Conflict of Interest: none declared.

References

Chipman,H. and Tibshirani,R. (2006) Hybrid hierarchical clustering with applications to microarray data. Biostatistics, 7, 286–301.
Eddelbuettel,D. and François,R. (2011) Rcpp: Seamless R and C++ Integration. J. Stat. Softw., 40, 1–18.
Fowlkes,E.B. and Mallows,C.L. (1983) A Method for Comparing Two Hierarchical Clusterings. J. Am. Stat. Assoc., 78, 553 – 569.
Gu,Z. et al. (2014) circlize implements and enhances circular visualization in R. Bioinformatics, 30, 1–2.
Hornik,M.M. and P.R. and A.S. and M.H. and K. (2014) cluster: Cluster Analysis Basics and Extensions.
Hurley,C.B. and Earle,D. (2013) DendSer: Dendrogram seriation: ordering for visualisation.
Langfelder,P. et al. (2008) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24, 719–20.
Milton,B.S. and Wickham,H. (2014) magrittr: magrittr – a forward-pipe operator for R.
Paradis,E. et al. (2004) APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics, 20, 289–290.
R Core Team (2014) R: A Language and Environment for Statistical Computing.
Schmidtlein,S. et al. (2010) A brute-force approach to vegetation classification. J. Veg. Sci., 21, 1162–1171.
Sokal,R.R. and Rohlf,F.J. (1962) The comparison of dendrograms by objective methods. Taxon, 11, 33–40.
Suzuki,R. and Shimodaira,H. (2006) Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics, 22, 1540–2.
Warnes,G.R. et al. (2014) gplots: Various R programming tools for plotting data.
Wickham,H. (2011) testthat: Get started with testing. R J., 3, 5–10.
Witten,D.M. and Tibshirani,R. (2010) A framework for feature selection in clustering. J. Am. Stat. Assoc., 105, 713–726.

To leave a comment for the author, please follow the link and comment on his blog: R-statistics blog » R.

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

This is a continuation of my previous article, where I gave a basic overview of how to construct heatmaps in R. Here, I will show you how to use R packages to build a heatmap on top of the map of Chicago to see which areas have the most amount of crime. We will require two packages for the mapping, namely maps, and ggmap. We will also use two more packages, dplyr, and tidyr.

I will be using the Motor Vehicle Theft Data from Chicago, which can be obtained from the City of Chicago Data Portal.

The first part of the code is the same as my previous article.

The second part of the code will contain the following steps:

Removing empty locations. For some thefts, the location is not recorded, and the field is left blank. This caused problems for me later on, so I decided to remove all such locations. At the time of writing, empty locations made up about 2500 entries of the total 278000 entries.
Splitting the location column into latitude and longitude. The location column consists of data in the form (x, y) and is of the character class. We want x and y to be in separate columns and be of the numeric class. We will use the dplyr and tidyr libraries for this.
Get the map of Chicago. We will use the ggmap library for this
Plotting the location heatmap.

Here is the code:

## Removing empty locations
chicagoMVT$Location[chicagoMVT$Location == ''] <- NA
chicagoMVT <- na.omit(chicagoMVT)

We set all empty values to NA, and then change the original dataset so that it no longer contains the NA values.

## Splitting location into latitude and longitude
chicagoMVT % extract(Location, c('Latitude', 'Longitude'), '\(([^,]+), ([^)]+)\)')
chicagoMVT$Longitude <- round(as.numeric(chicagoMVT$Longitude), 2)
chicagoMVT$Latitude <- round(as.numeric(chicagoMVT$Latitude), 2)

%>% is called the pipe operator. It is from the dplyr package. The above line of code is the same as writing

chicagoMVT <- extract(chicagoMVT, Location, c('Latitude', 'Longitude'), '\(([^,]+), ([^)]+)\)')

As we can see, the pipe operator is very helpful to pass output resulting from one operation to another. While it isn’t particularly useful here, it’s usefulness becomes apparent when you have to perform multiple operations, and don’t want to create a temporary variable for the result of each of the operations. I will do a short of tutorial and demonstration of dplyr in my next article.

The extract method is from the tidyr package, and I am using it to separate the Location column into Latitude and Longitude. The last parameter of the extract method is a Regular Expression or RegEx for short. I often have problems with RegEx, and this time was no different, and I’d like to thank StackOverflow user nongkrong for helping me with that.

Both dplyr and tidyr are great packages written by Hadly Wickham, the man who revolutionised R. I highly recommend that you check out his body of work and read his books.

Next, we will get the map of Chicago, so that we plot on top of it.

library(ggmap)
chicago <- get_map(location = 'chicago', zoom = 11)

If you would like to see the map, you can use this command:

ggmap(chicago)

which will give the following map

Now we will create a data frame containing the coordinates of all the thefts.

locationCrimes <- as.data.frame(table(chicagoMVT$Longitude, chicagoMVT$Latitude))
names(locationCrimes) <- c('long', 'lat', 'Frequency')
locationCrimes$long <- as.numeric(as.character(locationCrimes$long))
locationCrimes$lat <- as.numeric(as.character(locationCrimes$lat))
locationCrimes <- subset(locationCrimes, Freq > 0)

As we saw in the above map, the axes are named ‘long’ and ‘lat’, so we will use the same naming convention for our data frame. When we create the data frame, the latitude and longitude get converted to the factor class. To convert it back to numeric, we have to first convert them back to character, and then to numeric (otherwise it will give an error). Finally, we remove all data points where there were no crimes recorded. If you don’t do this, the resulting plot will have a lot of tiles plotted on the water, which we don’t want.

ggmap(chicago) + geom_tile(data = locationCrimes, aes(x = long, y = lat, alpha = Frequency),
                           fill = 'red') + theme(axis.title.y = element_blank(), axis.title.x = element_blank())

alpha = Frequency will set how transparent/opaque each tile is, based on the frequency of crimes in that particular area.
theme(axis.title.y = element_blank(), axis.title.x = element_blank()) will remove the titles for the axes.

This will generate the following plot:

The plot gives us a pretty good idea of which areas are the most prone to thefts, and thus should be avoided. The police are using advanced crime prediction algorithms to predict where crimes will happen, and prevent them before they occur. Here is a great article by MIT Technology Review on the topic.

The full repo can be found on GitHub.

That’s it for now! I hope you enjoyed the article, and found it helpful. Feel free to leave a comment if you have any questions or contact me on Twitter!

To leave a comment for the author, please follow the link and comment on his blog: DataScience+.