I previously built an interactive, online App using Shiny where you can upload your own data, perform basic clustering analysis, and view correlations in a heatmap. I wrote about this in my last post. While this worked, it was very ugly and needed a face lift. With the help of a friend, I implemented a dashboard using shinydashboard.
New implementation
The new implementation organizes the individual apps in a dashboard instead of a single Rmarkdown file. Essentially the individual apps (upload, clustering, etc.) remained the same (except for some feature improvements). The big difference is that the outer layer is now a normal shiny app with the standard app.R, ui.R, and server.R files. The other major change is the addition of three new files controlling the dashboard appearance and contents (body.R, sidebar.R, and header.R). See the shinydashboard Get Started for a basic example and the Structure page for more details.
Try it out!
Take a look at the dashboard app and source code and leave a comment to let me know what you think! You can download a sample data set (mtcars) to try out if you don’t have your own.
Previous Implementation
Dashboard Implementation
To leave a comment for the author, please follow the link and comment on his blog: Stefantastic - r.
Backgammon clubs and on-line forums use a modified form of the Elo rating system to keep track of how well individuals have played and draw inferences about their underlying strength. The higher the rating, the stronger the player. Players with higher ratings are inferred to be stronger than those with lower ratings and hence are expected to win; and the longer the match, the more likely the greater skill level will overcome the random chance of the dice. The animated plot below shows the expected probabilities of two players in the FIBS (First Internet Backgammon Server) internet forum, where players start at 1500 and the very best reach over 2000.
When the two players have the same rating, they are expected to have a 50/50 chance of winning, and this causes the constant diagonal line in the animation above. When one player is better than the other they have a higher chance of winning, represented for Player A by the increasingly blue space in the bottom right of the plot and the numbers (which are estimated probabilities) labelling the contour lines. The actual formula used on FIBS and illustrated above is defined by on the FIBS site as:
(As an aside, I don’t think there is a strong theoretical reason for the SQRT(ML) in that formula. With the complications of the doubling cube I very much doubt that the relationship of the probability winning to match length is as simple as that. But it’s a reasonable empirical approximation; I might blog more about that another time.)
Now, here’s how I made that animation. In the R code below, first, I load up some functionality and the Google font I use for this blog. Then I define a function that estimates the probability of a player winning using the FIBS formula above.
library(dplyr)library(showtext)# for fontslibrary(RColorBrewer)library(directlabels)# for labels on contour lineslibrary(ggplot2)library(scales)library(forecast)# for auto.arima later onfont.add.google("Poppins","myfont")showtext.auto()#==================helper functions=====================fibs_p <-function(a, b, ml){# function to determine the expected probability of player a winning a bachgammon match# against player b of length ml, with a and b representing their FIBS Elo ratings tmp <-1-(1/(10^((a - b)*sqrt(ml)/2000)+1))return(tmp)}
The strategy to create the animation is simple:
Create all the combinations of two players’ Elo ratings.
Loop through 23 possible match lengths, calculating the probabilities of A winning for each combination of Elo ratings at that match length
Draw a still image heatmap for each of those 23 match lengths using {ggplot2}
Compile the images into a single animated Gif using ImageMagick.
#================heat maps showing probability of winning# create a matrix of players A and B's possible Elo ratingsa <- b <-seq(from =1000, to =2000, by =5)mat <-expand.grid(a, b)%>% rename(a = Var1, b = Var2)# create a folder to hold the images and navigate to itdir.create("tmp0002")owd <-setwd("tmp0002")# cycle through a range of possible match lengths, drawing a plot for eachmatchlengths <-seq(from =1, to =23, by =1)res <-150for(i in1:length(matchlengths)){ df1 <- mat %>% mutate(probs = fibs_p(a = a, b = b, ml = matchlengths[i])) p1 <- ggplot(df1, aes(x = a, y = b, z = probs))+ geom_tile(aes(fill = probs))+ theme_minimal(base_family ="myfont")+ scale_fill_gradientn(colours = brewer.pal(10,"Spectral"), limits =c(0,1))+ scale_colour_gradientn(colours ="black")+ stat_contour(aes(colour =..level..), binwidth =.1)+# force contours to be same distance each image labs(x ="Player A Elo rating", y ="Player B Elo rating")+ theme(legend.position ="none")+ coord_equal()+ ggtitle(paste0("Probability of Player A winning a match to ", matchlengths[i])) png(paste0(letters[i],".png"), res *5, res *5, res = res, bg ="white")print(direct.label(p1))# direct labels used to label the contour points dev.off()}# compile the images into a single animated Gif with ImageMagick. Note that# the {animation} package provides R wrappers to do this but they often get# confused (eg clashes with Windows' convert) and it's easier to do it explicitly# yourself by sending an instruction to the system:system('"C:\Program Files\ImageMagick-6.9.1-Q16\convert" -loop 0 -delay 150 *.png "EloProbs.gif"')# move the asset over to where needed for the blogfile.copy("EloProbs.gif","../..http://ellisp.github.io/img/0002-EloProbs.gif", overwrite =TRUE)# clean upsetwd(owd)unlink("tmp0002", recursive =TRUE)
Actual ratings varying, “true” rating constant
The estimated probability of winning against a certain opponent at a given match length might be of interest to backgammon players (I’m surprised it’s not referred to more often), but its most common use is embedded in the calculation of how much each player’s Elo rating changes after a match is decided. That formula is well explained by FIBS and a little involved so I won’t spell it out here, but you can see it in action in the function I provide later in this post. A key point for our purposes is that as the win/loss of a match is a random event, players’ Elo ratings which are based on those events are random variables. Even if we grant the existence of a “real” value for each player, it is unobservable, and can only be estimated by their actual Elo rating in a tournament or internet forum.
In fact, my original motivation for this blog post was to see how much Elo ratings fluctuate due to randomness of individual games. In a later post I’ll have a more complex and realistic simulation, but the chart below shows what happens to a player’s Elo ratings over time in the following situation:
Only two players
They only ever play 5 point matches, and they play 10,000 of them
Player A’s chance of winning the 5 point match is 0.6, and this does not change over time (ie no player is improving in skill relative to the other)
The results are shown in the chart below, and in this unrealistically simple scenario there’s more variation than I’d realised there would be. Player A’s actual Elo rating fluctuates fairly wildly around the true value (which can be calculated as 1578.75), hitting 1650 several times and dropping nearly all the way to 1500 at one point.
Here’s the code for that simulation, including my R function that provides FIBS-style Elo ratings, adjusted for experience as set out by FIBS.
#=================determine change of Elo rating ================fibs_scores <-function(a, b, winner ="a", ml =1, axp =500, bxp =500){# a is Elo rating of player A before match# b is Elo rating of player B before match# ml is match length# axp is total match lengths (experience) of player a until this match# bxp is total match lengths (experience) of player b until this match# see http://www.fibs.com/ratings.html for formulae# calculate experience-correction multipliers: multa <-ifelse(axp <400,5-((axp + ml)/100),1) multb <-ifelse(bxp <400,5-((axp + ml)/100),1)# probability of A winning, using fibs_p function defined earlier: winproba <- fibs_p(a = a, b = b, ml = ml)# match value (points to be distributed between the two players): matchvalue <-4*sqrt(ml)# who gets them?:if(winner =="b"){ a <- a - matchvalue * winproba * multa
b <- b + matchvalue * winproba * multb
}else{ a <- a + matchvalue *(1- winproba)* multa
b <- b - matchvalue *(1- winproba)* multb
}return(list(a = a, b = b, axp = axp + ml, bxp = bxp + ml))}# test against "baptism by fire" example at http://www.fibs.com/ratings.html# should be 1540.95:round(fibs_scores(a =1500, b =1925, ml =7, axp =0, bxp =10000, winner ="a")$a,2)#================simulations of how long it takes for scores to stablise================# two people playing eachother 5 point games with 0.6 chance of winningA <- B <-data_frame(rating =1500, exp =0)timeseries <-data_frame(A = A$rating, B = B$rating)set.seed(123)# for reproducibilityfor(i in1:10000){ result <- fibs_scores(a = A$rating, b = B$rating, winner =ifelse(runif(1)>0.4,"a","b"), axp = A$exp, bxp = B$exp, ml =5) timeseries[i,"A"]<- A$rating <- result$a
timeseries[i,"B"]<- B$rating <- result$b
A$exp <- result$axp
B$exp <- result$bxp
}timeseries$A <- ts(timeseries$A)timeseries$B <- ts(timeseries$B)timeseries$time <-1:nrow(timeseries)# theoretical value:tv <-(log(1/0.4-1)/log(10)*2000/sqrt(5)+3000)/2svg("..http://ellisp.github.io/img/0002-elo-rating.svg",8,5)ggplot(timeseries, aes(x = time, y = A))+ geom_line(colour ="grey50")+ theme_minimal(base_family ="myfont")+ scale_y_continuous("Player A's Elo rating")+ geom_hline(yintercept = tv, colour ="blue")+ ggtitle("Elo rating of Player A in two player, constant skill simulation")+ scale_x_continuous("nNumber of matches played", label = comma)+ annotate("text", y = tv -10, x =max(timeseries$time)-100, label ="Theoreticalnvalue", colour ="blue", family ="myfont", size =3)dev.off()
In this situation, the resulting player’s Elo rating bouncing around at random is quite well modelled by an autoregressive AR(1) time series model with an intercept at the “true” level of the player’s skill. With an autoregression parameter of around 0.98, the rating at any point in time is obviously very closely related to the previous rating; almost but not quite a random walk. Showing how and why would make this post too long, but it’s a useful factoid to note for future work when we come to more realistic and complex simulations of Elo ratings on FIBS.
To leave a comment for the author, please follow the link and comment on his blog: Peter's stats stuff - R.
Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.
Recently we added new features to genomation and here we present them on example of binding profiles of 6 transcription factors around the CTCF binding sites derived from ChIP-seq. All new functionalities are available in the latest version of genomation that can be found on its github website.
# install the package from github library(devtools) install_github("BIMSBbioinfo/genomation",build_vignettes=FALSE)
Extending genomation to work with paired-end BAM files
Genomation can work with paired-end BAM files. Mates from reads are treated as fragments (they are stitched together).
Accelerate functions responsible for reading genomic files
This is achived by using readr::read_delim function to read genomic files instead of read.table. Additionally if skip=“auto” argument is provided in readGeneric_or track.line=“auto” in other functions that read genomic files, e.g. _readBroadPeak then UCSC header is detected (and first track).
We use ScoreMatrixList function to extract coverage values of all transcription factors around ChIP-seq peaks. ScoreMatrixList was improved by adding new argument coresthat indicates number of cores to be used at the same time (by using parallel:mclapply).
# descriptions of file that contain info. about transcription factors sampleInfo = read.table(system.file('extdata/SamplesInfo.txt', package='genomationData'),header=TRUE, sep='t') names(sml) = sampleInfo$sampleName[match(names(sml),sampleInfo$fileName)]
Arithmetic, indicator and logic operations as well as subsetting work on score matrices
Arithmetic, indicator and logic operations work on ScoreMatrix, ScoreMatrixBin and ScoreMatrixList objects, e.i.: Arith: “+”, “-”, “*”, “”, “%%”, “%/%”, “/” Compare: “==”, “>”, “<”, “!=”, “<=”, “>=” Logic: “&”, “|”
sml1 = sml * 100 sml1
## scoreMatrixlist of length:5 ## ## 1. scoreMatrix with dims: 1681 50 ## 2. scoreMatrix with dims: 1681 50 ## 3. scoreMatrix with dims: 1681 50 ## 4. scoreMatrix with dims: 1681 50 ## 5. scoreMatrix with dims: 1681 50
Subsetting:
sml[[6]] = sml[[1]] sml
## scoreMatrixlist of length:6 ## ## 1. scoreMatrix with dims: 1681 50 ## 2. scoreMatrix with dims: 1681 50 ## 3. scoreMatrix with dims: 1681 50 ## 4. scoreMatrix with dims: 1681 50 ## 5. scoreMatrix with dims: 1681 50 ## 6. scoreMatrix with dims: 1681 50
sml[[6]] <- NULL
Improvements and new arguments in visualization functions
Due to large signal scale of rows of each element in the ScoreMatrixList we scale them.
sml.scaled = scaleScoreMatrixList(sml)
Faster heatmaps
HeatMatrix and multiHeatMatrix function works faster by faster assigning colors. Heatmap profile of scaled coverage shows a colocalization of Ctcf, Rad21 and Znf143.
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500))
New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix
clustfun allow to add more clustering functions and integrate them with the heatmap function multiHeatMatrix. It has to be a function that returns a vector of integers indicating the cluster to which each point is allocated. Previous version of multiHeatMatrix could cluster rows of heatmaps using only k-means algorithm.
# hierarchical clustering with Ward's method for agglomeration into 2 clusters cl2 <- function(x) cutree(hclust(dist(x), method="ward"), k=2) multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl2)
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix
clust.matrix argument indicates which matrices are used for clustering. It can be a numerical vector of indexes of matrices or a character vector of names of the ‘ScoreMatrix’ objects in 'ScoreMatrixList'. Matrices that are not in clust.matrix are ordered according to the result of the clustering algorithm. By default all matrices are clustered.
Central tendencies in line plots: centralTend in plotMeta
We extended visualization capabilities for meta-plots. plotMeta function can plot not only mean, but also median as central tendency and it can be set up using centralTend argument. Previously user could plot only mean.
We added smoothfun argument to smooth central tendency as well as dispersion bands around it which is shown in the next figure. Smoothfun has to be a function that returns a list that contains a vector of y coordinates (vector named '$y').
Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function
We added new function patternMatrix that calculates k-mer and PWM occurrences over predefined equal width windows. If one pattern (character of length 1 or PWM matrix) is given then it returns ScoreMatrix, if more than one character ot list of PWM matrices then ScoreMatrixList. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with score themselves. windows can be a DNAStringList object or GRanges object (but then genome argument has to be provided, a BSgenome object).
#ctcf motif from the JASPAR database ctcf.pfm = matrix(as.integer(c(87,167,281,56,8,744,40,107,851,5,333,54,12,56,104,372,82,117,402, 291,145,49,800,903,13,528,433,11,0,3,12,0,8,733,13,482,322,181, 76,414,449,21,0,65,334,48,32,903,566,504,890,775,5,507,307,73,266, 459,187,134,36,2,91,11,324,18,3,9,341,8,71,67,17,37,396,59)), ncol=19,byrow=TRUE) rownames(ctcf.pfm) <- c("A","C","G","T")
Recently we integrated genomation with Travis CI. It allows users to see current status of the package which is updated during every change of the package. Travis automatically runs R CMD CHECK and reports it. Shields shown below are on the genomation github site: https://github.com/BIMSBbioinfo/genomation Status
The Maungawhau volcano dataset is an R classic, often used to illustrate 3d plotting.
Being on a Gaussian process kick lately, it seemed fun to try to interpolate the volcano elevation data using a subset of the full dataset as training data.
Even with only 1% of the data, a squared exponential Gaussian process model does a decent job at estimating the true elevation surface (code here):
The upper row of plots show the true elevation surface, estimated surface based on 1% of the data (53 of the 5307 cells), and the squared error in estimation.
The lower plots show the same data in heatmap form, with the location of sampled points shown as crosses.
To leave a comment for the author, please follow the link and comment on their blog: Maxwell B. Joseph.
The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.
I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…
If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.
You can find the app on iTunes and Google Play. It’s a game of trivial pursuits – here’s the logo, now tell me the brand. Each item is scored as right or wrong, and the players must take it all very seriously for there is a Facebook page with cheat sheets for improving one’s total score.
Psychometrics Sees Everything as a Test
What would a psychometrician make of such a game based on brand logo knowledge? Are we measuring one’s level of consumerism (“a preoccupation with and an inclination toward buying consumer goods“)? Everyone knows the most popular brands, but only the most involved are familiar with logos of the less publicized products. The question for psychometrics is whether they are able to explain the logos that you can identify correctly by knowing only your level of consumption.
For example, if you were a car enthusiast, then you would be able to name all the car logos in the above table. However, if you did not drive a car or watch commercial television or read car ads in print media, you might be familiar with only the most “popular” logos (i.e., the ones that cannot be avoided because their signage is everywhere you look). We make the assumption that everyone falls somewhere between these two extremes along a consumption continuum and assess whether we can reproduce every individual pattern of answers based solely on their location on this single dimension. Shopping intensity or consumerism is the path, and logo identifications are the sensors along that path.
Specifically, if some number of N respondents played this game, it would not be difficult to rank order the 36 logos in the above table along a line stretching from 0% to 100% correct identification. Next, we examine each respondent, starting by sorting the players from those with the fewest correct identifications to those getting the most right. As shown in an earlier post, a heatmap will reveal the relationship between the ease of identifying each logo and the overall logo knowledge for each individual, as measured by their total score over all the brand logos. [The R code required to simulate the data and produce the heatmap can be found at the end of this post.]
You can begin by noting that blue is correct and red is not. Thus, the least knowledgeable players are in the top rows filled with the most red and the least blue. The logos along the x-axis are sorted by difficulty with the hardest to name on the left and the easiest on the right. In general, better players tend to know the harder logos. This is shown by the formation of a blue triangle as one scans towards the lower, right-hand corner. We call this a Guttman scale, and it suggests that both variation among the logos and the players can be described by a single dimension, which we might call logo familiarity or brand presence. However, one must be wary of suggestive names like “brand presence” for over time we forget that we are only measuring logo familiarity and not something more impactful.
Our psychometrician might have analyzed this same data using the R package ltm for latent trait modeling. A hopefully intuitive introduction to item response modeling was posted earlier on this blog. Those results could be summarized with a series of item characteristic curves displaying the relationship between the probability of answering correctly and the underlying trait, labeled ability by default.
As you see in the above plot, the items are arranged from the easiest (V1) to the hardest (V36) with the likelihood of naming the logo increasing as a logistic function of the unobserved consumerism measured as z-scores and called ability because item response theory (IRT) originated in achievement testing. These curves are simple to read and understand. A player with low consumption (e.g., a z-score near -2) has a better than even chance of identifying the most popular logos, but almost zero probability of naming any of the least familiar logos. All those probabilities move up their respective S-curves together as consumers become more involved.
In this example the function has been specified for I have plotted the item characteristics curves from the one-parameter Rasch model. However, a specific functional form is not required, and we could have used the R package KernSmoothIRT to fit a nonparametric model. The topology remains a unidimensional manifold, something similar to Hastie’s principal curve in the R package princurve. Because the term has multiple meanings, I should note that I am using “topology” in a limited sense in order to refer to the shape of the data and not as in topological data analysis.
To be clear, there must be powerful forces at work to constrain logo naming to a one-dimensional continuum. Sequential skills that build on earlier achievements can often be described by a low-dimensional manifold (e.g., learning descriptive statistics before attempting inference since the latter assumes knowledge of the former). We would have needed a different model had our brands been local so that higher shopping intensity would have produced greater familiarity only for those logos available in a given locality (e.g., country-specific brands without an international presence).
The Meaning of Brand Familiarity Depends on Brand Presence in Local Markets
Now, it gets interesting. We started with players differentiated by a single parameter indicating how far they had traveled along a common consumption path. The path markers or sensors are the logos arrayed in decreasing popularity. Everyone shares a common environment with similar exposures to the same brand logos. Most have seen the McDonald’s double-arcing M or the Nike swoosh because both brands have spent a considerable amount of money to buy market presence. On the other hand, Hilton’s “blue H in the swirl” with less market presence would be recognized less often (fourth row and first column in the above brand logo table).
But what if market presence and thus logo popularity depended on your local neighborhood? Even international companies have differential presence in different countries, as well as varying concentration within the same country. Spending and distribution patterns by national, regional and local brands create clusters of differential market presence. Everyone does not share a common logo exposure so that each cluster requires its own brand list. That is, consumers reside in localities with varying degrees of brand presence so that two individuals with identical levels of consumption intensity or consumerism would not be familiar with the same brand logos. Consequently, we need to add a second parameter to each individual’s position along a path specific to their neighborhood. The psychometrician calls this differential item functioning (DIF), and R provides a number of ways of handling the additional mixture parameter.
Overlapping Audiences in the Marketplace of Attention
You may have anticipated the next step as the topology becomes more complex. We began with one pathway marked with brand logos as our sensors. Then, we argued for a mixture model with groups of individuals living in different neighborhoods with different ordering of the brand logos. Finally, we will end by allowing consumers to belong to more than one neighborhood with whatever degree of belonging they desire. We are describing the kind of fragmentation that occurs when consumers seize control and there is more available to them than they can attend to or consider. James Webster outlines this process of audience formation in his book The Marketplace of Attention.
The topology has changed again. There are just too many brand logos, and unless it becomes a competitive game, consumers will derive diminishing returns from continuing search and they typically will stop sooner rather than later. It helps that the market comes preorganized by providers trying to make the sale. Expert reviews and word of mouth guide the search. Yet, it is the consumer who decides what to select from the seemingly endless buffet. In the process, an individual will see and remember only a subset of all possible brand logos. We need a new model – one that simultaneously sorts both rows and columns by grouping together consumers and the brand logos that they are likely to recognize.
A heatmap may help to explain what can be accomplished when we search for joint clusterings of the rows and columns (also known as biclustering). Using an R package for nonnegative matrix factorization (NMF), I will simulate a data set with such a structure and show you the heatmap. Actually, I will display two heatmaps, one without noise so that you can see the pattern and a second with the same pattern but with added noise. Hopefully, the heatmap without noise will enable you to see the same pattern in the second heatmap with additional distortions.
I kept the number of columns at 36 for comparison with the first one-dimensional heatmap that you saw toward the beginning of this post. As before, blue is one, and red is zero. We discover enclaves or silos in the first heatmap without noise (polarization). The boundaries become fuzzier with random variation (fragmentation). I should note that you can see the biclusters in both heatmaps without reordering the rows and columns only because this is how the simulator generates the data. If you wish to see how this can be done with actual data, I have provided a set of links with the code needed to run a NMF in R at the end of my post on Brand and Product Category Representation.
Finally, although we speak of NMF as a form of simultaneous clustering, the cluster membership are graded rather than all-or-none (soft vs. hard clustering). This yields a very flexible and expressive topology, which becomes clear when we review the three alternative representations presented in this post. First, we saw how some highly structured data matrices can be reproduced using a single dimension with rows and columns both located on the same continuum (IRT). Next, we asked if there might be discrete groups of rows with each row cluster having its own unique ordering of the columns (mixed IRT). Lastly, we sought a model of audience formation with rows and columns jointly collected together into blocks with graded membership for both the rows and the columns (NMF).
Knowledge is organized as a single dimension when learning is formalized within a curriculum (e.g., a course at an educational institution) or accumulative (e.g., need to know addition before one can learn multiplication). However, coevolving networks of customers and products cannot be described by any one dimension or even a finite mixture of different dimensions. The Internet creates both microgenres and fragmented audiences that require their own topology.
R Code to Produce Figures in this Post # use psych package to simulate latent trait data
# Sort data by both item mean # and person total score item<-apply(logos$items,2,mean) person<-apply(logos$items,1,sum) logos$itemsOrd<-logos$items[order(person),order(item)]
# create heatmap # may need to increase size of plots window in R studio library(gplots) heatmap.2(logos$itemsOrd, Rowv=FALSE, Colv=FALSE, dendrogram="none",col=redblue(16), key=T, keysize=1.5, density.info="none", trace="none", labRow=NA)
library(ltm) # two-parameter logistic model fit<-ltm(logos$items ~ z1) summary(fit)
Marketing borrows the biological notion of coevolution to explain the progressive “fit” between products and consumers. While evolutionary time may seem a bit slow for product innovation and adoption, the same metaphor can be found in models of assimilation and accommodation from cultural and cognitive psychology.
The digital camera was introduced as an alternative to film, but soon redefined how pictures are taken, stored and shared. The selfie stick is but the latest step in this process by which product usage and product features coevolve over time with previous cycles enabling the next in the chain. Is it the smartphone or the lack of fun that’s killing the camera?
The diffusion of innovation unfolds in the marketplace as a social movement with the behavior of early adopters copied by the more cautious. For example, “cutting the cord” can be a lifestyle change involving both social isolation from conversations among those watching live sporting events and a commitment to learning how to retrieve television-like content from the Internet. The Diary of a Cord-Cutter in 2015 offers a funny and informative qualitative account. Still, one needs the timestamp because cord-cutting is an evolving product category. The market will become larger and more diverse with more heterogeneous customers (assimilation) and greater differentiation of product offerings (accommodation).
So, we should be able to agree that product markets are the outcome of dynamic processes involving both producers and customers (see Sociocognitive Dynamics in a Product Market for a comprehensive overview). User-centered product design takes an additional step and creates fictional customers or personas in order to find the perfect match. Shoppers do something similar when they anticipate how they will use the product they are considering. User types can be real (an actual person) or imagined (a persona). If this analysis is correct, then both customers and producers should be looking at the same data: the cable TV customer to decide if they should become cord-cutters and the cable TV provider to identify potential defectors.
Identifying the Likely Cord-Cutter
We can ask about your subscriptions: cable TV, internet connection, Netflix, Hulu, Amazon Prime, Sling, and so on). It is a long list, and we might get some frequency of usage data at the same time. This may be all that we need, especially if we probe for the details (e.g., cable TV usage would include live sports, on-demand movies, kid’s shows, HBO or other channel subscriptions, and continue until just before respondents become likely to terminate on-line surveys). Concurrently, it might be helpful to know something about your hardware, such as TVs, DVDs, DVRs, media streamers and other stuff.
A form of reverse engineering guides our data collection. Qualitative research and personal experience gives us some idea of the usage types likely to populate our customer base. Cable TV offers a menu of bundled and ala carte hardware and channels. Only some of the alternatives are mutually exclusive; otherwise, you are free to create your own assortment. Internet availability only increases the number of options, which you can watch on a television, a computer, a tablet or a phone. Plus, there are always free broadcast TV captured with an antenna and DVDs that you rent or buy. We ought not to forget DVRs and media streamers (e.g., Roku, Apple TV, Chromecast, and Amazon Fire Stick). Obviously, there is no reason to stop with usage so why not extend the scale to include awareness and familiarity? You might not be a cord-cutter, though you may be on your way if you know all about Sling TV.
Each consumer defines their own personal choices by arranging options in a continually changing pattern that does not depend on existing bundles offered by providers. Consequently, whatever statistical model is chosen must be open to the possibility that every non-contradictory arrangement is possible. Yet, every combination will not survive for some will be dominated by others and never achieve a sustainable audience.
Consumers are listed in U, and a line is drawn to the offerings in V that they might wish to purchase (shown in the center panel). It is this linkage between U and V that produces the consumer and product networks in the two side panels. The A-B and B-C-D cliques of offerings in Projection V would be disjoint without customer U_5. Moreover, the 1-2-3 and 4-5-6-7 consumer clusters are connected by the presence of offering B in V. Removing B or #5 cuts the graph into independent parts.
Actual markets contain many more consumers in U, and the number of choices in V can be extensive. Consumer heterogeneity creates complexities for the marketer trying to discover structure in Projection U. Besides, the task is not any easier for an individual consumer who must select the best from a seemingly overwhelming number of alternatives in Projection V. Luckily, one trick frees the consumer from having to learn all the options that are available and being forced to make all the difficult tradeoffs – simply do as others do (as in observational learning). The other can be someone you know or read about as in the above Diary of a Cord-Cutter in 2015. There is no need for a taxonomy of offerings or a complete classification of user types.
In fact, it has become popular to believe that social diffusion or contagion models describe the actual adoption process (e.g., The Tipping Point). Regardless, over time, the U’s and V’s in the bipartite interactions of customers and offerings come to organize each other through mutual influence. Specifically, potential customers learn about the cord-cutting persona through the social and professional media and at the same time come to group together those offerings that the cord-cutter might purchase. Offerings are not alphabetized or catalogued as an academic exercise. There is money to be saved and entertainment to be discovered. Sorting needs to be goal-directed and efficient. I am ready to binge-watch, and I am looking for a recommendation.
It has taken some time to outline how consumers are able to simplify complex purchase process by modeling the behavior of others. It is such a common experience, although rational decision theory continues to control our statistical modeling of choice. As you are escorted to your restaurant table, you cannot help but notice a delicious meal being served next to where you are seated. You refuse a menu and simply ask for the same dish. “I’ll Have What She’s Having” works as a decision strategy only when I can identify the “she” and the “what” simultaneously.
If we intend to analyze that data we have just talked about collecting, we will need a statistical model. Happily, the R Project for Statistical Computing implements at least two approaches for such joint identification: a latent clustering of a bipartite network in the latentnet package and a nonnegative matrix factorization in the NMF package. The Davis data from the latentnet R package will serve as our illustration. The R code for all the analyses that will be reported can be found at the end of this post.
We will start with the plot from the latentnet R package. The names are the women in the rows and the numbered E’s are the events in the columns. The events appear to be separated into two groups of E1 to E6 toward the top and E9 to E14 toward the bottom. E7 and E8 seem to occupy a middle position. The names are also divided into an upper and lower grouping with Ruth and Pearl falling between the two clusters. Does this plot not look similar to the earlier bipartite graph from Barabasi? That is, the linkages between the women and the events organize both into two corresponding clusters tied together by at least two women and two events.
The heatmaps from the NMF reveal the same pattern for the events and the women. You should recall that NMF seeks a lower dimensional representation that will reproduce the original data table with 0s and 1s. In this case, two basis components were extracted. The mixture coefficients for the events vary from 0 to 1 with a darker red indicating a higher contribution for that basis component. The first six events (E1-E6) form the first basis component with the second basis component containing the last six events (E9-E14). As before, E7 and E8 share a more even mixture of the two basis components. Again, the most of the women load on one basis component or the other with Ruth and Pearl traveling freely between both components. As you can easily verify, the names form the same clusters in both plots.
It would help to know something about the events and the women. If E1 through E6 were all of a certain type (e.g., symphony concerts), then we could easily name the first component. Similarly, if all of the women in red at bottom of our basis heatmap played the piano, our results would have at least face validity. A more detailed description of this naming process can be found in a previous example called “What Can We Learn from the Apps on Your Smartphone?“. Those wishing to learn more might want to review the link listed at the end of that post in a note.
Which events should a newcomer attend? If Helen, Nora, Sylvia and Katherine are her friends, the answer is the second cluster of E9-E14. The collaborative filtering of recommender systems enables a novice to decide quickly and easily without a rational appraisal of the feature tradeoffs. Of course, a tradeoff analysis will work as well for we have a joint scaling of products and users. If the event is a concert with a performer you love, then base your decision on a dominating feature. When in tradeoff doubt, go along with your friends.
Finally, brand management can profit from this perspective. Personas work as a design strategy when user types are differentiated by their preference structures and a single individual can represent each group. Although user-centered designers reject segmentations that are based on demographics, attitudes, or benefit statements, a NMF can get very specific and include as many columns as needed (e.g., thousands of movie and even more music recordings). Furthermore, sparsity is not a problem and most of the rows can be empty.
There is no reason why each of the basis components in the above heatmaps could not be summarized by one person and/or one event. However, NMF forms building blocks by jointly clustering many rows and columns. Every potential customer and every possible product configuration are additive compositions built from these blocks. Would not design thinking be better served with several exemplars of each user type rather than trying to generalize from a single individual? Plus, we have the linked columns telling us what attracts each user type in the desired detail provided by the data we collected.
Every Christmas Eve, my family watches Love Actually. Objectively it’s not a particularly, er, good movie, but it’s well-suited for a holiday tradition. (Vox has got my back here).
Even on the eighth or ninth viewing, it’s impressive what an intricate network of characters it builds. This got me wondering how we could visualize the connections quantitatively, based on how often characters share scenes. So last night, while my family was watching the movie, I loaded up RStudio, downloaded a transcript, and started analyzing.
Parsing
It’s easy to use R to parse the raw script into a data frame, using a combination of dplyr, stringr, and tidyr. (For legal reasons I don’t want to host the script file myself, but it’s literally the first Google result for “Love Actually script.” Just copy the .doc contents into a text file called love_actually.txt).
library(dplyr)library(stringr)library(tidyr)
raw <-readLines("love_actually.txt")
lines <-data_frame(raw =raw)%>%
filter(raw !="",!str_detect(raw,"(song)"))%>%
mutate(is_scene = str_detect(raw," Scene "),
scene =cumsum(is_scene))%>%
filter(!is_scene)%>%
separate(raw,c("speaker","dialogue"), sep =":", fill ="left")%>%
group_by(scene, line =cumsum(!is.na(speaker)))%>%
summarize(speaker = speaker[1], dialogue = str_c(dialogue, collapse =" "))
I also set up a CSV file matching characters to their actors, which you can read in separately. (I chose 20 characters that have notable roles in the story).
Whenever we have a matrix, it’s worth trying to cluster it. Let’s start with hierarchical clustering.1
norm <- speaker_scene_matrix /rowSums(speaker_scene_matrix)
h <- hclust(dist(norm, method ="manhattan"))
plot(h)
This looks about right! Almost all the romantic pairs are together (Natalia/PM; Aurelia/Jamie, Harry/Karen; Karl/Sarah; Juliet/Peter; Jack/Judy) as are the friends (Colin/Tony; Billy/Joe) and family (Daniel/Sam).
One thing this tree is perfect for is giving an ordering that puts similar characters close together:
If you’ve seen the film as many times as I have (you haven’t), you can stare at this graph and the film’s scenes spring out, like notes engraved in vinyl.
One reason it’s good to lay out raw data like this (as opposed to processed metrics like distances) is that anomalies stand out. For instance, look at the last scene: it’s the “coda” at the airport that includes 15 (!) characters. If we’re going to plot this as a network (and we totally are!) we’ve got to ignore that scene, or else it looks like almost everyone is connected to everyone else.
After that, we can create a cooccurence matrix (see here) containing how many times two characters share scenes:
A few patterns pop out of this visualization. We see that the majority of characters are tightly connected (often by the scenes at the school play, or by Karen (Emma Thompson), who is friends or family to many key characters). But we see Bill Nighy’s plotline occurs almost entirely separate from everyone else, and that five other characters are linked to the main network by only a single thread (Sarah’s conversation with Mark at the wedding).
One interesting aspect of this data is that this network builds over the course of the movie, growing nodes and connections as characters and relationships are introduced. There are a few ways to show this evolving network (such as an animation), but I decided to make it an interactive Shiny app, which lets the user specify the scene and shows the network that the movie has built up to that point.
(You can view the code for the Shiny app on GitHub).
Data Actually
Have you heard the complaint that we are “drowning in data”? How about the horror stories about how no one understands statistics, and we need trained statisticians as the “police” to keep people from misinterpreting their methods? It sure makes data science sound like important, dreary work.
Whenever I get gloomy about those topics, I try to spend a little time on silly projects like this, which remind me why I learned statistical programming in the first place. It took minutes to download a movie script and turn it into usable data, and within a few hours, I was able to see the movie in a new way. We’re living in a wonderful world: one with powerful tools like R and Shiny, and one overflowing with resources that are just a Google search away.
Maybe you don’t like ‘Love Actually’; you like Star Wars. Or you like baseball, or you like comparing programming languages. Or you’re interested in dating, or hip hop. Whatever questions you’re interested in, the answers are just a search and a script away. If you look for it, I’ve got a sneaky feeling you’ll find that data actually is all around us.
Footnotes
We made a few important choices in our clustering here. First, we normalized so that the number of scenes for each character adds up to 1: otherwise, we wouldn’t be clustering based on a character’s distribution across scenes so much as the number of scenes they’re in. Secondly, we used Manhattan distance, which for a binary matrix means “how many scenes is one of these characters in that the other isn’t”. Try varying these approaches to see how the clusters change! ↩
To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.
In a previous post, we had ‘mapped’ the culinary diversity in India through a visualization of food consumption patterns. Since then, one of the topics in my to-do list was a visualization of world cuisines. The primary question was similar to that asked of the Indian cuisine: Are cuisines of geographically and culturally closer regions also similar? I recently came across an article on the analysis of recipe ingredients that distinguish the cuisines of the world. The analysis was conducted on a publicly available dataset consisting of ingredients for more than 13,000 recipes from the recipe website Epicurious. Each recipe was also tagged with the cuisine it belonged to, and there were a total of 26 different cuisines. This dataset was initially reported in an analysis of flavor network and principles of food pairing.
In this post, we (re)look the Epicurious recipe dataset and perform an exploratory analysis and visualization of ingredient frequencies among cuisines. Ingredients that are frequently found in a region’s recipes would also have high consumption in that region, and so an analysis of the ‘ingredient frequency’ of a cuisine should give us similar info as an analysis of ‘ingredient consumption’.
Outline of Analysis Method
Here is a part of the first few lines of data from the Epicurious dataset:
Vietnamese
vinegar
cilantro
mint
olive_oil
cayenne
fish
lime_juice
Vietnamese
onion
cayenne
fish
black_pepper
seed
garlic
Vietnamese
garlic
soy_sauce
lime_juice
thai_pepper
Vietnamese
cilantro
shallot
lime_juice
fish
cayenne
ginger
pea
Vietnamese
coriander
vinegar
lemon
lime_juice
fish
cayenne
scallion
Vietnamese
coriander
lemongrass
sesame_oil
beef
root
fish
…
Each row of the dataset lists the ingredients for one recipe and the first column gives the cuisine the recipe belongs to. As the first step in our analysis, we collect ALL the ingredients for each cuisine (over all the recipes for that cuisine). Then we calculate the frequency of occurrence of each ingredient in each cuisine and normalize the frequencies for each cuisine with the number of recipes available for that cuisine. This matrix of normalized ingredient frequencies is used for further analysis.
We use two approaches for the exploratory analysis of the normalized ingredient frequencies: (1) heatmap and (2) principal component analysis (pca), followed by display using biplots. The complete R code for the analysis is given at the end of this post.
Results
There are a total of 350 ingredients occurring in the dataset (among all cuisines). Some of the ingredients occur in just one cuisine, which, though interesting, will not be of much use for the current analysis. For better visual display, we restrict attention to ingredients showing most variation in normalized frequency across cuisines. The results are shown below:
Heatmap:
Biplot:
The figures look self-explanatory and does show the clustering together of geographically nearby regions on the basis of commonly used ingredients. Moreover, we also notice the grouping together of regions with historical travel patterns (North Europe and American, Spanish_Portuguese and SouthAmerican/Mexican) or historical trading patterns (Indian and Middle East).
We need to further test the stability of the grouping obtained here by including data from the Allrecipes dataset. Also, probably taking the third principal component might dissipate some of the crowd along the PC2 axis. These would be some of the tasks for the next post…
Here is the complete R code used for the analysis:
workdir <- "C:\Path\To\Dataset\Directory"
datafile <- file.path(workdir,"epic_recipes.txt")
data <- read.table(datafile, fill=TRUE, col.names=1:max(count.fields(datafile)),
na.strings=c("", "NA"), stringsAsFactors = FALSE)
a <- aggregate(data[,-1], by=list(data[,1]), paste, collapse=",")
a$combined <- apply(a[,2:ncol(a)], 1, paste, collapse=",")
a$combined <- gsub(",NA","",a$combined) ## this column contains the totality of all ingredients for a cuisine
cuisines <- as.data.frame(table(data[,1])) ## Number of recipes for each cuisine
freq <- lapply(lapply(strsplit(a$combined,","), table), as.data.frame) ## Frequency of ingredients
names(freq) <- a[,1]
prop <- lapply(seq_along(freq), function(i) {
colnames(freq[[i]])[2] <- names(freq)[i]
freq[[i]][,2] <- freq[[i]][,2]/cuisines[i,2] ## proportion (normalized frequency)
freq[[i]]}
)
names(prop) <- a[,1] ## this is a list of 26 elements, one for each cuisine
final <- Reduce(function(...) merge(..., all=TRUE, by="Var1"), prop)
row.names(final) <- final[,1]
final <- final[,-1]
final[is.na(final)] <- 0 ## If ingredient missing in all recipes, proportion set to zero
final <- t(final) ## proportion matrix
s <- sort(apply(final, 2, sd), decreasing=TRUE)
## Selecting ingredients with maximum variation in frequency among cuisines and
## Using standardized proportions for final analysis
final_imp <- scale(subset(final, select=names(which(s > 0.1))))
## heatmap
library(gplots)
heatmap.2(final_imp, trace="none", margins = c(6,11), col=topo.colors(7),
key=TRUE, key.title=NA, keysize=1.2, density.info="none")
## PCA and biplot
p <- princomp(final_imp)
biplot(p,pc.biplot=TRUE, col=c("black","red"), cex=c(0.9,0.8),
xlim=c(-2.5,2.5), xlab="PC1, 39.7% explained variance", ylab="PC2, 24.5% explained variance")
To leave a comment for the author, please follow the link and comment on their blog: Design Data Decisions » R.
Moritz Stefaner started off 2016 with a very spiffy post on “a visual exploration of the spatial patterns in the endings of German town and village names”. Moritz was exploring some new data processing & visualization tools for the post, but when I saw what he was doing I wondered how hard it would be to do something similar in R and also used it as an opportunity to start practicing a new habit in 2016: packages vs projects.
To state more precisely the goals for this homage, the plan was to:
use as close to the same data sets Mortiz has in his github repo, including the ones in pure javascript
generate an HTML page as output that is as close to the style in Moritz’s visualization
use R for everything (i.e. no “cheating” by sneaking in some javascript via htmlwidgets)
bundle everything into a package to take advantage of all the good stuff that comes with R package validation
You may want to take a look at the result to see if you want to continue reading (I hope you will!).
The Setup
By using an R package as the framework for the visualization, it’s possible to keep the data with the code and also organize and document the code in a way that makes it easy for folks to use and explore without cutting and pasting (our sourceing) code. It also makes it possible to list all the dependencies for the project and help ensure they’ll be installed when someone tries to work with it.
While I could have converted Moritz’s processed data into R data files, I left the CSV intact and the javascript file of suffix groupings also intact to show that R is extremely flexible when it comes to data processing (which is a “duh” for most folks by this point but the use of javascript data structures might give some folks ideas as how to reduce data duplication between projects). Both these files get stored in the inst/alt folder of the source package. I also end up using some CSS for the final visualization and placed that into a file in the same directory, which makes the code that generates the HTML a bit cleaner.
Because R processes some things automatically (like .onAttach) when it interacts with a package one can have it provide helpful instructions (in this case, how to generate the visualization) in similar fashion to the ggplot2 loading messages.
Similarly, there both the package itself and the package functions have documentation to help folks understand both what the package and each component is doing.
While read.csv (no need for readr as the file is small) can handle the CSV file, we use the V8 package to source the javascript and convert it to an R object:
We actually turn that into a vector of regular expressions (for town name ending checking) and a list of vectors (for the HTML visualization creation). Check out suffix_regex() and suffix_names() in the source code.
The read_places() function builds a data.frame of the places combined with the suffix grouping(s) they belong to:
# read in the file
plc <-read.csv(system.file("alt/placenames_de.tsv", package="zellingenach"),
stringsAsFactors=FALSE)# iterate over each suffix and identify which place names match the groupinglapply(suf, function(regex){which(stri_detect_regex(plc$name, regex))})-> matched_endings
plc$found <-""# add which grouping(s) the place was found to a new columnfor(i in1:length(matched_endings)){
where_found <- matched_endings[[i]]
plc$found[where_found]<-
paste0(plc$found[where_found], sprintf("%d|", i))}# some don't match so get rid of them
mutate(filter(plc, found !=""), found=sub("\|$", "", found))
I do something a bit different than Moritz in that in that I allow towns to be part of multiple suffix groups, since:
I’m neither a historian nor expert in German town naming conventions, and
the javascript version and this R version both take a naive approach to suffix mapping.
This means my numbers (for the “#### places” label) will be different for some of my maps.
R has similar shortcut functions (Mortiz uses D3) to make hexgrids out of shapefiles. Here’s the entirety of create_hexgrid():
You can play with cellsize to change the number of hexes. I tried to find a good number to get close to the # in Moritz’s maps.
This all gets put together in make_maps() where we use ggplot2 to build 52 gridded heatmaps (one for each suffix grouping). I used a log of the counts to map to a binned viridis color scale, so my colors come out a bit different than Moritz’s but the overall patterns are on par with his.
Finally, display_maps() takes the list created by make_maps() and builds out an HTML page using the htmltools package for the page framework and svglite::htmlSVG to make SVGs of the ggplot objects). NOTE that you can use the output_file option of display_maps() to send the HTML to a file as well as display it in the viewer/browser.
Fin
Because the project is in a pacakge, we can run package checks to see if we’re missing anything including other pacakge dependencies, function documentation and other details that the package tools are gleeful to point out. We can also include code to test out our various components to ensure they are behaving as expected (i.e. generating the right data/output).
Once nice thing about the output is that it’s “responsive”, which means it handles multiple screen sizes quite well. So, if your screen is huge, you’ll have many map boxes on one line and if it’s small (like the iframe below) it will have fewer.
You’ll see that my maps are a bit bigger than Moritz’s. This is due to both the hex grid size and the fact that the SVG output is just slightly larger overall than the ones made by D3. Of note: I noticed some suffix subtitle components wrapped at the “-” so I converted the plain dashes to non-breaking ones ‑/”‑”.
The one downside to using a package for this is that it’s harder to post complete code into a blog post, but you can clone the repo to look at the code and skip the dissection and just generate the visualization locally via:
What are the current tRends? The image is CC from coco + kelly.
It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.
Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).
library(rvest)library(dplyr)# devtools::install_github("hadley/multidplyr")library(multidplyr)library(magrittr)library(lubridate)
getCranberriesElmnt <-function(txt, elmnt_name){
desc <-grep(sprintf("^%s:", elmnt_name), txt)if(length(desc)==1){
txt <- txt[desc:length(txt)]end<-grep("^[A-Za-z/@]{2,}:", txt[-1])if(length(end)==0)end<-length(txt)elseend<-end[1]
desc <-
txt[1:end]%>%gsub(sprintf("^%s: (.+)", elmnt_name),
"\1", .)%>%paste(collapse =" ")%>%gsub("[ ]{2,}", " ", .)%>%gsub(" , ", ", ", .)}elseif(length(desc)==0){
desc <-paste("No", tolower(elmnt_name))}else{stop("Could not find ", elmnt_name, " in text: n",
paste(txt, collapse ="n"))}return(desc)}
convertCharset <-function(txt){if(grepl("Windows", Sys.info()["sysname"]))
txt <-iconv(txt, from ="UTF-8", to ="cp1252")return(txt)}
getAuthor <-function(txt, package){
author <- getCranberriesElmnt(txt, "Author")if(grepl("No author|See AUTHORS file", author)){
author <- getCranberriesElmnt(txt, "Maintainer")}if(grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author)||is.null(author)||nchar(author)<=2){
cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
package))
author <- cran_txt %>%
html_nodes("tr")%>%
html_text %>%
convertCharset %>%gsub("(^[ tn]+|[ tn]+$)", "", .)%>%
.[grep("^Author", .)]%>%gsub(".*n", "", .)# If not found then the package has probably been# removed from the repositoryif(length(author)==1)
author <- author
else
author <-"No author"}# Remove stuff such as:# [cre, auth]# (worked on the...)# <my@email.com># "John Doe"
author %<>%gsub("^Author: (.+)",
"\1", .)%>%gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .)%>%gsub("\([^)]+\)", " ", .)%>%gsub("([ ]*<[^>]+>)", " ", .)%>%gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .)%>%gsub("[ ]{2,}", " ", .)%>%gsub("(^[ '"]+|[ '"]+$)", "", .)%>%gsub(" , ", ", ", .)return(author)}
getDate <-function(txt, package){date<-grep("^Date/Publication", txt)if(length(date)==1){date<- txt[date]%>%gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*",
"\1", .)}else{
cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
package))date<-
cran_txt %>%
html_nodes("tr")%>%
html_text %>%
convertCharset %>%gsub("(^[ tn]+|[ tn]+$)", "", .)%>%
.[grep("^Published", .)]%>%gsub(".*n", "", .)# The main page doesn't contain the original date if # new packages have been submitted, we therefore need# to check first entry in the archivesif(cran_txt %>%
html_nodes("tr")%>%
html_text %>%gsub("(^[ tn]+|[ tn]+$)", "", .)%>%grepl("^Old.{1,4}sources", .)%>%any){
archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/",
package))
pkg_date <-
archive_txt %>%
html_nodes("tr")%>%lapply(function(x){
nodes <- html_nodes(x, "td")if(length(nodes)==5){return(nodes[3]%>%
html_text %>%as.Date(format="%d-%b-%Y"))}})%>%
.[sapply(., length)>0]%>%
.[!sapply(., is.na)]%>%head(1)if(length(pkg_date)==1)date<- pkg_date[[1]]}}date<-tryCatch({as.Date(date)}, error =function(e){"Date missing"})return(date)}
getNewPkgStats <-function(published_in){# The parallel is only for making cranlogs requests# we can therefore have more cores than actual cores# as this isn't processor intensive while there is# considerable wait for each http-request
cl <- create_cluster(parallel::detectCores()*4)
parallel::clusterEvalQ(cl, {library(cranlogs)})
set_default_cluster(cl)on.exit(stop_cluster())
berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/"))
pkgs <-# Select the divs of the package class
html_nodes(berries, ".package")%>%# Extract the text
html_text %>%# Split the linesstrsplit("[n]+")%>%# Now clean the lineslapply(.,
function(pkg_txt){
pkg_txt[sapply(pkg_txt, function(x){nchar(gsub("^[ t]+", "", x))>0},
USE.NAMES = FALSE)]%>%gsub("^[ t]+", "", .)})# Now we select the new packages
new_packages <-
pkgs %>%# The first line is key as it contains the text "New package"sapply(., function(x) x[1], USE.NAMES = FALSE)%>%grep("^New package", .)%>%
pkgs[.]%>%# Now we extract the package name and the date that it was published# and merge everything into one tablelapply(function(txt){
txt <- convertCharset(txt)
ret <-data.frame(
name =gsub("^New package ([^ ]+) with initial .*",
"\1", txt[1]),
stringsAsFactors = FALSE
)
ret$desc <- getCranberriesElmnt(txt, "Description")
ret$author <- getAuthor(txt, ret$name)
ret$date <- getDate(txt, ret$name)return(ret)})%>%
rbind_all %>%# Get the download data in parallel
partition(name)%>%
do({
down <- cran_downloads(.$name[1],
from =max(as.Date("2015-01-01"), .$date[1]),
to ="2015-12-31")$count
cbind(.[1,],
data.frame(sum=sum(down),
avg =mean(down)))})%>%
collect %>%
ungroup %>%
arrange(desc(avg))return(new_packages)}
pkg_list <-lapply(2010:2015,
getNewPkgStats)
pkgs <-
rbind_all(pkg_list)%>%
mutate(time=as.numeric(as.Date("2016-01-01")-date),
year =format(date, "%Y"))
Downloads and time on CRAN
The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:
pkgs %<>%
mutate(time_yrs =time/365.25)
fit <-lm(avg ~ time_yrs, data= pkgs)# Test for non-linearitylibrary(splines)anova(fit,
update(fit, .~.-time_yrs+ns(time_yrs, 2)))
Analysis of Variance Table
Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7348 189661922
2 7347 189656567 1 5355.1 0.2075 0.6488
Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:
The above table conveys a slightly more interesting picture. Most packages don’t get that much attention while the top 1% truly reach the masses.
Top downloaded packages
In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).
Query the main R SVN… Query the main R SVN repository to find the versions r-release and r-oldrel refer to, and also all previous R versions and their release dates.
Interface to the lib… Interface to the libgit2 library, which is a pure C implementation of the Git core methods. Provides access to Git repositories to extract data and running some basic git commands.
Import excel files i… Import excel files into R. Supports ‘.xls’ via the embedded ‘libxls’ C library (http://sourceforge.net/projects/libxls/) and ‘.xlsx’ via the embedded ‘RapidXML’ C++ library (http://rapidxml.sourceforge.net). Works on Windows, Mac and Linux without external dependencies.
Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy
9,745
217
Easily translate ggp… Easily translate ggplot2 graphs to an interactive web-based version and/or create custom web-based visualizations directly from R. Once uploaded to a plotly account, plotly graphs (and the data behind them) can be viewed and modified in a web browser.
Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc
24,806
120
Data objects in R ca… Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’ (typically via R Markdown or Shiny). The ‘DataTables’ library has been included in this R package. The package name ‘DT’ is an abbreviation of ‘DataTables’.
Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc.
1,316,900
3608
stringi allows for v… stringi allows for very fast, correct, consistent, and convenient character string/text processing in each locale and any native encoding. Thanks to the use of the ICU library, the package provides R users with a platform-independent functionality known to Java, Perl, Python, PHP and Ruby programmers.
The R6 package allow… The R6 package allows the creation of classes with reference semantics, similar to R’s built-in reference classes. Compared to reference classes, R6 classes are simpler and lighter-weight, and they are not built on S4 classes so they do not require the methods package. These classes allow public and private members, and they support inheritance.
Interactive plotting… Interactive plotting functions for use within RStudio. The manipulate function accepts a plotting expression and a set of controls (e.g. slider, picker, checkbox, or button) which are used to dynamically change values within the expression. When a value is changed using its corresponding control the expression is automatically re-executed and the plot is redrawn.
The curl() function … The curl() function provides a drop-in replacement for base url() with better performance and support for http 2.0, ssl (https, ftps), gzip, deflate and other libcurl goodies. This interface is implemented using the RConnection API in order to support incremental processing of both binary and text streams. If you are looking for a more user friendly http client, try the RCurl or httr packages instead.
This package provide… This package provides functions to make it easy to access the RStudio API when available, and provide informative error messages when not.
This package is a fo… This package is a fork of the RJSONIO package by Duncan Temple Lang. It builds on the parser from RJSONIO, but implements a different mapping between R objects and JSON strings. The C code in this package is mostly from Temple Lang, the R code has been rewritten from scratch. In addition to drop-in replacements for fromJSON and toJSON, the package has functions to serialize objects. Furthermore, the package contains a lot of unit tests to make sure that all edge cases are encoded and decoded consistently for use with dynamic data in systems and applications.
John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois
691,280
1894
Boost provides free … Boost provides free peer-reviewed portable C++ source libraries. A large part of Boost is provided as C++ template code which is resolved entirely at compile-time without linking. This package aims to provide the most useful subset of Boost libraries for template use among CRAN package. By placing these libraries in this package, we offer a more efficient distribution system for CRAN as replication of this code in the sources of other packages is avoided.
This package provide… This package provides syntax highlighting for R source code. Currently it supports LaTeX and HTML output. Source code of other languages can be supported via Andre Simon’s Highlight package.
assertthat is an ext… assertthat is an extension to stopifnot() that makes it easy to declare the pre and post conditions that you code should satisfy, while also producing friendly error messages so that your users know what they’ve done wrong.
httpuv provides low-… httpuv provides low-level socket and protocol support for handling HTTP and WebSocket requests directly from within R. It is primarily intended as a building block for other packages, rather than making it particularly easy to create complete web applications using httpuv alone. httpuv is built on top of the libuv and http-parser C libraries, both of which were developed by Joyent, Inc. (See LICENSE file for libuv and http-parser license information.)
This package provide… This package provides a framework to perform Non-negative Matrix Factorization (NMF). It implements a set of already published algorithms and seeding methods, and provides a framework to test, develop and plug new/custom algorithms. Most of the built-in algorithms have been optimized in C++, and the main interface function provides an easy way of performing parallel computations on multicore machines.
Implements the Hammi… Implements the Hamming distance and weighted versions of the Levenshtein, restricted Damerau-Levenshtein (optimal string alignment), and Damerau-Levenshtein distance.
An R interface to th… An R interface to the C libstemmer library that implements Porter’s word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary. Currently supported languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish.
This package provide… This package provides a general-purpose tool for dynamic report generation in R, which can be used to deal with any type of (plain text) files, including Sweave and HTML. The patterns of code chunks and inline R expressions can be customized. R code is evaluated as if it were copied and pasted in an R terminal thanks to the evaluate package (e.g. we do not need to explicitly print() plots from ggplot2 or lattice). R code can be reformatted by the formatR package so that long lines are automatically wrapped, with indent and spaces being added, and comments being preserved. A simple caching mechanism is provided to cache results from computations for the first time and the computations will be skipped the next time. Almost all common graphics devices, including those in base R and add-on packages like Cairo, cairoDevice and tikzDevice, are built-in with this package and it is straightforward to switch between devices without writing any special functions. The width and height as well as alignment of plots in the output document can be specified in chunk options (the size of plots for graphics devices is still supported as usual). Multiple plots can be recorded in a single code chunk, and it is also allowed to rearrange plots to the end of a chunk or just keep the last plot. Warnings, messages and errors are written in the output document by default (can be turned off). Currently LaTeX, HTML and Markdown are supported, and other output formats can be supported by hook functions. The large collection of hooks in this package makes it possible for the user to control almost everything in the R code input and output. Hooks can be used either to format the output or to run a specified R code fragment before or after a code chunk. Most features are borrowed or inspired by Sweave, cacheSweave, pgfSweave, brew and decumar.
Provides useful tool… Provides useful tools for working with HTTP connections. Is a simplified wrapper built on top of RCurl. It is much much less configurable but because it only attempts to encompass the most common operations it is also much much simpler.
JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte
636,888
1745
Markdown is a plain-… Markdown is a plain-text formatting syntax that can be converted to XHTML or other formats. This package provides R bindings to the Sundown markdown rendering library.
Shiny makes it incre… Shiny makes it incredibly easy to build interactive web applications with R. Automatic “reactive” binding between inputs and outputs and extensive pre-built widgets make it possible to build beautiful, responsive, and powerful applications with minimal effort.
Lattice is a powerfu… Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data, that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. See ?Lattice for an introduction.
This package provide… This package provides some low-level utilities to use for package development. It currently provides managers for multiple package specific options and registries, vignette, unit test and bibtex related utilities. It serves as a base package for packages like NMF, RcppOctave, doRNG, and as an incubator package for other general purposes utilities, that will eventually be packaged separately. It is still under heavy development and changes in the interface(s) are more than likely to happen.
This package contain… This package contains a set of functions for working with Random Number Generators (RNGs). In particular, it defines a generic S4 framework for getting/setting the current RNG, or RNG data that are embedded into objects for reproducibility. Notably, convenient default methods greatly facilitate the way current RNG settings can be changed.
Douglas Bates, Romain Francois and Dirk Eddelbuettel
634,224
1738
R and Eigen integrat… R and Eigen integration using Rcpp. Eigen is a C++ linear template library for linear algebra: matrices, vectors, numerical solvers and related algorithms. It supports dense and sparse matrices on integer, floating point and complex numbers. The performance on many algorithms is comparable with some of the best implementations based on Lapack and level-3 BLAS. The RcppEigen package includes the header files from the Eigen C++ template library (currently version 3.0.1). Thus users do not need to install Eigen itself in order to use RcppEigen. Eigen is licensed under the GNU LGPL version 3 or later, and also under the GNU GPL version 2 or later. RcppEigen (the Rcpp bindings/bridge to Eigen) is licensed under the GNU GPL version 2 or later, as is the rest of Rcpp.
nloptr is an R inter… nloptr is an R interface to NLopt. NLopt is a free/open-source library for nonlinear optimization, providing a common interface for a number of different free optimization routines available online as well as original implementations of various other algorithms. See
Combine multi-dimens… Combine multi-dimensional arrays into a single array. This is a generalization of cbind and rbind. Works with vectors, matrices, and higher-dimensional arrays. Also provides functions adrop, asub, and afill for manipulating, extracting and replacing data in arrays.
This package provide… This package provides a GUI (using gWidgets) to format R source code. Spaces and indent will be added to the code automatically, so that R code will be more readable and tidy.
This is a package th… This is a package that allows conversion to and from data in Javascript object notation (JSON) format. This allows R objects to be inserted into Javascript/ECMAScript/ActionScript code and allows R programmers to read and convert JSON content to R objects. This is an alternative to rjson package. That version is too slow for large data and not extensible, but a very useful prototype. This package uses methods, vectorized operations and C code and callbacks to R functions for deserializing JSON objects to R. In the future, we will implement the deserialization in C. There are some routines that can be used now for particular array types.
R and Armadillo inte… R and Armadillo integration using Rcpp Armadillo is a C++ linear algebra library aiming towards a good balance between speed and ease of use. Integer, floating point and complex numbers are supported, as well as a subset of trigonometric and statistics functions. Various matrix decompositions are provided through optional integration with LAPACK and ATLAS libraries. A delayed evaluation approach is employed (during compile time) to combine several operations into one and reduce (or eliminate) the need for temporaries. This is accomplished through recursive templates and template meta-programming. This library is useful if C++ has been decided as the language of choice (due to speed and/or integration capabilities), rather than another language. This Armadillo / C integration provides a nice illustration of the capabilities of the Rcpp package for seamless R and C++ integration/
Provide R functions … Provide R functions to read/write/format Excel 2007 (xlsx) file formats.
Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more.
R-star authors
Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:
It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.
My own 2015-R-experience
My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable.
When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous.
When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:
DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
checkmate A neat package for checking function arguments.
covr An excellent package for testing how much of a package’s code is tested.
The site R-bloggers.com is now 6 years young. It strives to be an (unofficial) online news and tutorials website for the R community, written by over 600 bloggers who agreed to contribute their R articles to the website. In 2015, the site served almost 17.7 million pageviews to readers worldwide.
In celebration to R-bloggers’ 6th birth-month, here are the top 100 most read R posts written in 2015, enjoy:
p.s.: 2015 was also a great year for R-users.com, a job board site for R users. If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds). If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).
The state of Utah (my adopted home) has an Open Data Catalog with lots of interesting data sets, including a collection of health care indicators from 2014 for the 29 counties in Utah. The observations for each county include measurements such as the infant mortality rate, the percent of people who don’t have insurance, what percent of people have diabetes, and so forth. Let’s see how these health care indicators are related to each other and if we can use these data to cluster Utah counties into similar groups.
Something to Keep in Mind
Before we start, let’s look at one demographic map of Utah that is important to remember.
The population in Utah is not evenly distributed among counties. Salt Lake County, where I live, has a population over 1 million people and the rest of the counties have much lower populations. Utah County, just to the south of Salt Lake, has a population that is about half of Salt Lake’s, and the numbers go down very quickly after that; there are a number of counties with populations only in the 1000s. This will effect both the actual health care indicators (rural populations can have different healthcare issues than more urban ones) and the measurements of the health care indicators.
Getting Started
The data sets at Utah’s Open Data Catalog can be downloaded via Socrata Open API. Let’s load the data, fix the data types, and remove the row that contains numbers for the state as a whole.
Now let’s explore how some of these health care indicators are related to each other. Some of the indicators are correlated with each other in ways that make sense.
ggplot(data=allHealth,aes(x=Median.Household.Income,y=Children.Eligible.Free.Lunch...Free.Lunch))+geom_point(alpha=0.6,size=3)+stat_smooth(method="lm")+geom_point(data=subset(allHealth,County=="Salt Lake"),size=5,colour="maroon")+xlab("Median household income (dollars)")+ylab("Children eligible for free lunch (percent)")
I’ve highlighted Salt Lake County in this plot and the following ones, just to give some context. The correlation coefficient between these two economic/health indicators is -0.652 with a 95% confidence interval from -0.822 to -0.374. Counties with higher incomes have fewer children eligible for free lunch.
ggplot(data=allHealth,aes(x=X65.and.over,y=X..Diabetic))+geom_point(alpha=0.6,size=3)+stat_smooth(method="lm")+geom_point(data=subset(allHealth,County=="Salt Lake"),size=5,colour="maroon")+xlab("Population over 65 (percent)")+ylab("Diabetic population (percent)")
The correlation coefficient between the population percentage over 65 and the percentage of the population with diabetes is 0.667 with a 95% confidence interval from 0.398 to 0.831. Counties with more older people in them have more people with diabetes in them. Notice that Salt Lake County has less than 10% of its population 65 or older; we are very young here in Utah, the youngest in the nation, in fact.
Then there are lots of health care indicators that are not correlated with each other.
ggplot(data=allHealth,aes(x=Premature.Age.adjusted.Mortality,y=X..Uninsured.Children.1))+geom_point(alpha=0.6,size=3)+stat_smooth(method="lm")+geom_point(data=subset(allHealth,County=="Salt Lake"),size=5,colour="maroon")+xlab("Premature mortality rate (per 100,000 population)")+ylab("Uninsured children (percent)")
The correlation coefficient between the percentage of uninsured children and premature age adjusted mortality is 0.221 with a 95% confidence interval from -0.173 to 0.555.
To facilitate exploring all of the health care indicators in the data set, I made a Shiny app where the user can plot any two indicators from the data set, add a linear regression line, calculate a correlation coefficient, and highlight any county of choice. Use the app to explore the data, and check out the code for the app on Github.
Woe Is Me, NA Values…
The clustering analysis we would like to do requires that each county has complete information for all columns, i.e. no missing values. The populations of some Utah counties are so low that some of these health care indicators cannot be measured or are zero. Let’s look at how this plays out.
health<-allHealth[,c(4:5,18,22,24,27,31,34,36,38,42,44,48,51,55,60,63,64)]rownames(health)<-allHealth$Countycolnames(health)<-c("PercentUnder18","PercentOver65","DiabeticRate","HIVRate","PrematureMortalityRate","InfantMortalityRate","ChildMortalityRate","LimitedAccessToFood","FoodInsecure","MotorDeathRate","DrugDeathRate","Uninsured","UninsuredChildren","HealthCareCosts","CouldNotSeeDr","MedianIncome","ChildrenFreeLunch","HomicideRate")scaledhealth<-scale(health)library(viridis)heatmap(scaledhealth,Colv=NA,Rowv=NA,margins=c(10,4),main="Heatmap of Data Set Values",col=viridis(32,1))
The values for the health indicators have been scaled for this heat map (otherwise, for example, the numbers for the median income would swamp out the numbers for the HIV rate because of the units they are measured with). The blank spaces in the heat map show where we have NA values to deal with. HIV/AIDs is not a very common disease and there are no reported cases of HIV in many of the sparsely populated counties in Utah. It probably makes sense to just put a zero in those spots because more urban areas have more HIV cases. Having an infant die is also quite uncommon in the United States and there are many counties in Utah where no infants died in 2014. Does it make sense to just put a zero in those spots?
Probably not, right? Outcomes for newborn babies appear to be worse in more rural counties. While plugging in a zero for the infant mortality rate in a county where zero newborns died does make sense on one level, it is a problematic thing to do.
One option is to impute the missing values based on the values for other, similar counties. One possible method for this is the random forest, an ensemble decision tree algorithm.
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
We can access the new matrix with the imputed values via healthimputed$ximp. Unfortunately, this was not a screaming success because some of the columns have so few real measured values; the mean squared error was not good and this approach doesn’t seem like a good idea. The good news is that I tested the rest of this analysis both with the random forest imputed data and just replacing NA values with 0, and the results were pretty much the same. There were some minor differences in exactly how the counties clustered, but no major differences in the main results. Given that, let’s just replace all the NA values with zeroes, scale and center the data, and move forward.
health[is.na(health)]<-0health<-scale(health)
Principal Component Wonderfulness
We can think of a data set like this as a high-dimensional space where each county is at a certain spot in that space. At this point in the analysis we are working with 18 columns of observations. We removed the columns that directly measure how many people live in each county such as population number, percentage of population who are rural dwellers, etc. and kept the columns on health care indicators such as child mortality rate, homicide rate, and percentage of population who is uninsured. Thus we have an 18-dimensional space and each county is located at its own spot in that space. Principal component analysis is a way to project these data points onto a new, special coordinate system. In our new coordinate system, each coordinate, or principal component, is a weighted sum of the original coordinates. The first principal component has the most variance in the data in its direction, the second principal component has the second most variance in the data in its direction, and so forth. Let’s do it!
myPCA<-prcomp(health)
Welp, that was easy.
I just love PCA; it’s one of my very favorite algorithmic-y things. Let’s see what the first few of the principal components actually look like.
library(reshape2)melted<-melt(myPCA$rotation[,1:9])ggplot(data=melted)+theme(legend.position="none",axis.text.x=element_blank(),axis.ticks.x=element_blank())+xlab("Health care indicator measurements")+ylab("Relative importance in each principle component")+ggtitle("Variables in Principal Component Analysis")+geom_bar(aes(x=Var1,y=value,fill=Var1),stat="identity")+facet_wrap(~Var2)
So each of these components are orthogonal to each other, and the colored bars show the contribution of each original health care indicator to that principal component. Each principal component is uncorrelated to the others and together, the principal components contain the information in the data set. Let’s zoom in on the first principal component, the one that has the largest variance and accounts for the most variability between the counties.
ggplot(data=melted[melted$Var2=="PC1",])+theme(legend.position="none",axis.text.x=element_text(angle=45,hjust=1),axis.ticks.x=element_blank())+xlab("Health care indicator measurements")+ylab("Relative importance in principle component")+ggtitle("Variables in PC1")+geom_bar(aes(x=Var1,y=value,fill=Var1),stat="identity")
We can see here that counties with higher positive values for PC1 (the component that accounts for the most variability among the counties) have fewer children, more older people, low HIV and homicide rates, are more poor, and have more uninsured people. These sound like more rural counties.
It’s Clustering Time
Now let’s see if this data set of health care indicators can be used to cluster similar counties together. Clustering is an example of unsupervised machine learning, where we want to use an algorithm to find structure in unlabeled data. Let’s begin with hierarchical clustering. This method of clustering begins with all the individual items (counties, in our case) alone by themselves and then starts merging them into clusters with the items that are closest to them within the space we are considering. First, the algorithm merges them into two-item clusters, then it will merge another nearby item into each cluster, and so forth, until all the items are merged together into one big cluster. We can examine the tree structure the algorithm used to do the clustering to see what kind of clustering makes sense for the data, given the context, etc. Hierarchical clustering can be done with different methods of computing the distance (or similarity) of the items.
Let’s use the fpc package, a package with lots of resources for clustering algorithms, to do some hierarchical clustering of this county health data. Let’s do the hierarchical clustering algorithm, but let’s do it with bootstrap resampling of the county sample to assess how stable the clusters are to individual counties within the sample and what the best method for computing the distance/similarity is.
I tested different methods for computing the distance and found Ward clustering to be the most stable. The bootstrap results also indicate that 3 clusters is a stable, sensible choice. Let’s look at the results for these parameters for the hierarchical clustering.
myClusterBoot$bootmean
## [1] 0.8090913 0.7216360 0.6972370
myClusterBoot$bootbrd
## [1] 10 26 21
The bootmean value measures the cluster stability, where a value close to 1 indicates a stable cluster. The bootbrd value measures how many times (out of the 100 resampling runs) that cluster dissolved. These three clusters look pretty stable, so let’s take a look at how the hierarchical clustering algorithm has grouped the counties here.
library(dendextend)myDend<-health%>%dist%>%hclust(method="ward.D")%>%as.dendrogram%>%set("branches_k_color",k=3)%>%set("labels_col",k=3)%>%hang.dendrogram(hang_height=0.8)par(mar=c(3,3,3,7))plot(myDend,horiz=TRUE,main="Clustering in Utah County Health Care Indicators")
The scale along the bottom shows a measure of how separated the branches of the tree structure are. As a resident of Utah, these county names look like they may be in a certain order to me; let’s check it out. What if I looked at county names ordered from lowest population to highest? (This is from the original data frame, not the data used to do the clustering.)
Yes indeed! The pink counties are the lowest population counties, the green ones are intermediate in population, and the blue counties are the most populous. The hierarchical clustering algorithm groups the counties by population based on their health care indicators.
Another algorithm for grouping similar objects is k-means clustering. K-means clustering works a bit differently than hierarchical clustering. You decide ahead of time how many clusters you are going to have (the number k) and randomly pick centers for each cluster (perhaps by picking data points at random to be the centers of each cluster). Then, the algorithm assigns each data point (county, in our case) to the closest cluster. After the clusters have their new members, the algorithm calculates new centers for each cluster. These steps of calculating the centers and assigning points to the clusters are repeated until the assignment of points to clusters converges (hopefully to a real minimum). Then you have your final cluster assignments!
The kmeansruns function in the fpc library will run k-means clustering many times to find the best clustering.
myKmeans<-kmeansruns(health,krange=1:5)
Helpfully, this function estimates the number of clusters in the data; it can use two different methods for this estimate but both give the same answer for our county health data here. If we include 1 in the range for krange, this function also tests whether there should even be more than one cluster at all. For the county health data, the best k is 2. Let’s plot what this k-means clustering looks like.
library(ggfortify)library(ggrepel)set.seed(2346)autoplot(kmeans(health,2),data=health,size=3,aes=0.8)+ggtitle("K-Means Clustering of Utah Counties")+theme(legend.position="none")+geom_label_repel(aes(PC1,PC2,fill=factor(kmeans(health,2)$cluster),label=rownames(health)),fontface='bold',color='white',box.padding=unit(0.5,"lines"))
This plot puts the counties on a plane where the x-axis is the first principal component and the y-axis is the second principal component; this kind of plotting can be helpful to show how data points are different from each other. Like with hierarchical clustering, k-means clustering has grouped counties by population. The cluster on the right is a low-population cluster while the cluster on the left is a high population cluster.
Remember that when we looked in detail at PC1, lower negative values of PC1 correspond to higher homicide rate, higher HIV rate, more children and fewer older poeple, higher income, lower rates of being food insecure and children eligible for free lunch, etc. Notice which counties have the lowest negatives values for PC1: the three most populous counties in Utah. That is heartening to see.
The methods for estimating numbers of clusters in the k-means algorithm indicated that 2 was the best number, but we can do a 3-cluster k-means clustering to compare to the groups found by hierarchical clustering.
set.seed(2350)autoplot(kmeans(health,3),data=health,size=3,aes=0.8)+ggtitle("K-Means Clustering of Utah Counties")+theme(legend.position="none")+geom_label_repel(aes(PC1,PC2,fill=factor(kmeans(health,3)$cluster),label=rownames(health)),fontface='bold',color='white',box.padding=unit(0.5,"lines"))
These groups are very similar to those found by hierarchical clustering.
The End
If you have a skeptical turn of mind (as I tend to do), you might suggest that what the clustering algorithms are actually finding is just how many NA values each county had. The least populous counties have the most NA values, counties with more intermediate populations have just a few NA values, and the most populous counties have none. There are a couple of things to consider about this perspective. One is that the pattern of NA values is not random; it could be considered informative in itself so perhaps it is not a problem if that affected the clustering results. Another is that I tested this clustering analysis with a subset of the data that excluded the columns that had many missing values (HIV rate, homicide rate, infant mortality rate, and child mortality rate). The clustering results still showed groups of low and high population counties, although the results were messier since there was less data and the excluded columns were highly predictive. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback and other perspectives!
To leave a comment for the author, please follow the link and comment on their blog: data science ish.
Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.
annotatePeak
Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.
Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.
Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.
getBioRegion
getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.
Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.
ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.
clusterProfiler
We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.
For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.
read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.
KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.
The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.
In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.
The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.
GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.
I bump the version to 3.0.0 the following three reasons:
the changes of function calls
can analyze any ontology/pathway annotation (supports user’s customize annotation data)
can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)
Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.
This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:
setReadable (convert IDs stored enrichResult object to gene symbol)
simplify (remove redundant GO terms, supported via GOSemSim)
DOSE
DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.
maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.
gsfilter function for restricting enriched results with minimal and maximal gene set sizes.
upsetplot was implemented to visualize overlap of enriched gene sets.
The dot sizes in enrichMap now scaled by category sizes
All these changes also affect clusterProfiler and ReactomePA.
ggtree
I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.
CHANGES IN VERSION 1.3.16
------------------------
o geom_treescale() supports family argument <2016-04-27, Wed>
+ https://github.com/GuangchuangYu/ggtree/issues/56
o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/54
o support passing textConnection(text_string) as a file <2016-04-21, Thu>
+ contributed by Casey Dunn <casey_dunn@brown.edu>
+ https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
CHANGES IN VERSION 1.3.15
------------------------
o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
o geom_label2 that support subsetting <2016-04-07, Thu>
o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
o geom_taxalink for connecting related taxa <2016-04-01, Fri>
o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
CHANGES IN VERSION 1.3.14
------------------------
o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
+ see https://github.com/GuangchuangYu/ggtree/issues/46
o subview and inset now supports annotating with img files <2016-02-23, Tue>
CHANGES IN VERSION 1.3.13
------------------------
o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
o geom_cladelabel works with collapse <2016-02-07, Sun>
+ see https://github.com/GuangchuangYu/ggtree/issues/38
CHANGES IN VERSION 1.3.12
------------------------
o exchange function name of geom_tree and geom_tree2 <2016-01-25, Mon>
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/hadley/ggplot2/issues/1512
o colnames_level parameter in gheatmap <2016-01-25, Mon>
o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon>
CHANGES IN VERSION 1.3.11
------------------------
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/GuangchuangYu/ggtree/issues/36
o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
+ fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
o support phyloseq object <2016-01-21, Thu>
o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
CHANGES IN VERSION 1.3.10
------------------------
o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
o remove dependency of colorspace <2016-01-20, Wed>
o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>
CHANGES IN VERSION 1.3.9
------------------------
o optimize getYcoord <2016-01-14, Thu>
o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
+ > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
CHANGES IN VERSION 1.3.8
------------------------
o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
o add viewClade function <2016-01-12, Tue>
o support obkData object defined by OutbreakTools <2016-01-12, Tue>
o update vignettes <2016-01-07, Thu>
o 05 advance tree annotation vignette <2016-01-04, Mon>
o export theme_inset <2016-01-04, Mon>
o inset, nodebar, nodepie functions <2015-12-31, Thu>
CHANGES IN VERSION 1.3.7
------------------------
o split the long vignette to several vignettes
+ 00 ggtree <2015-12-29, Tue>
+ 01 tree data import <2015-12-28, Mon>
+ 02 tree visualization <2015-12-28, Mon>
+ 03 tree manipulation <2015-12-28, Mon>
+ 04 tree annotation <2015-12-29, Tue>
CHANGES IN VERSION 1.3.6
------------------------
o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
- remove annotation_clade and annotation_clade2 functions.
o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
CHANGES IN VERSION 1.3.5
------------------------
o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
+ see https://github.com/GuangchuangYu/ggtree/issues/30
CHANGES IN VERSION 1.3.4
------------------------
o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
o get_clade_position function <2015-11-26, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/28
o get_heatmap_column_position function <2015-11-25, Wed>
+ see https://github.com/GuangchuangYu/ggtree/issues/26
o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
CHANGES IN VERSION 1.3.3
------------------------
o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
CHANGES IN VERSION 1.3.2
------------------------
o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
o add support of ape bootstrap analysis <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/20
o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/21
CHANGES IN VERSION 1.3.1
------------------------
o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
+ see https://github.com/GuangchuangYu/ggtree/issues/17
o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
+ see https://github.com/hadley/ggplot2/issues/1380
o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>
Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.
annotatePeak
Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.
Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.
Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.
getBioRegion
getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.
Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.
ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.
clusterProfiler
We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.
For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.
read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.
KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.
The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.
In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.
The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.
GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.
I bump the version to 3.0.0 the following three reasons:
the changes of function calls
can analyze any ontology/pathway annotation (supports user’s customize annotation data)
can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)
Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.
This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:
setReadable (convert IDs stored enrichResult object to gene symbol)
simplify (remove redundant GO terms, supported via GOSemSim)
DOSE
DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.
maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.
gsfilter function for restricting enriched results with minimal and maximal gene set sizes.
upsetplot was implemented to visualize overlap of enriched gene sets.
The dot sizes in enrichMap now scaled by category sizes
All these changes also affect clusterProfiler and ReactomePA.
ggtree
I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.
CHANGES IN VERSION 1.3.16
------------------------
o geom_treescale() supports family argument <2016-04-27, Wed>
+ https://github.com/GuangchuangYu/ggtree/issues/56
o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/54
o support passing textConnection(text_string) as a file <2016-04-21, Thu>
+ contributed by Casey Dunn <casey_dunn@brown.edu>
+ https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
CHANGES IN VERSION 1.3.15
------------------------
o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
o geom_label2 that support subsetting <2016-04-07, Thu>
o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
o geom_taxalink for connecting related taxa <2016-04-01, Fri>
o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
CHANGES IN VERSION 1.3.14
------------------------
o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
+ see https://github.com/GuangchuangYu/ggtree/issues/46
o subview and inset now supports annotating with img files <2016-02-23, Tue>
CHANGES IN VERSION 1.3.13
------------------------
o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
o geom_cladelabel works with collapse <2016-02-07, Sun>
+ see https://github.com/GuangchuangYu/ggtree/issues/38
CHANGES IN VERSION 1.3.12
------------------------
o exchange function name of geom_tree and geom_tree2 <2016-01-25, Mon>
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/hadley/ggplot2/issues/1512
o colnames_level parameter in gheatmap <2016-01-25, Mon>
o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon>
CHANGES IN VERSION 1.3.11
------------------------
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/GuangchuangYu/ggtree/issues/36
o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
+ fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
o support phyloseq object <2016-01-21, Thu>
o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
CHANGES IN VERSION 1.3.10
------------------------
o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
o remove dependency of colorspace <2016-01-20, Wed>
o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>
CHANGES IN VERSION 1.3.9
------------------------
o optimize getYcoord <2016-01-14, Thu>
o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
+ > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
CHANGES IN VERSION 1.3.8
------------------------
o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
o add viewClade function <2016-01-12, Tue>
o support obkData object defined by OutbreakTools <2016-01-12, Tue>
o update vignettes <2016-01-07, Thu>
o 05 advance tree annotation vignette <2016-01-04, Mon>
o export theme_inset <2016-01-04, Mon>
o inset, nodebar, nodepie functions <2015-12-31, Thu>
CHANGES IN VERSION 1.3.7
------------------------
o split the long vignette to several vignettes
+ 00 ggtree <2015-12-29, Tue>
+ 01 tree data import <2015-12-28, Mon>
+ 02 tree visualization <2015-12-28, Mon>
+ 03 tree manipulation <2015-12-28, Mon>
+ 04 tree annotation <2015-12-29, Tue>
CHANGES IN VERSION 1.3.6
------------------------
o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
- remove annotation_clade and annotation_clade2 functions.
o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
CHANGES IN VERSION 1.3.5
------------------------
o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
+ see https://github.com/GuangchuangYu/ggtree/issues/30
CHANGES IN VERSION 1.3.4
------------------------
o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
o get_clade_position function <2015-11-26, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/28
o get_heatmap_column_position function <2015-11-25, Wed>
+ see https://github.com/GuangchuangYu/ggtree/issues/26
o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
CHANGES IN VERSION 1.3.3
------------------------
o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
CHANGES IN VERSION 1.3.2
------------------------
o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
o add support of ape bootstrap analysis <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/20
o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/21
CHANGES IN VERSION 1.3.1
------------------------
o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
+ see https://github.com/GuangchuangYu/ggtree/issues/17
o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
+ see https://github.com/hadley/ggplot2/issues/1380
o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>
Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.
annotatePeak
Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.
Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.
Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.
getBioRegion
getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.
Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.
ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.
clusterProfiler
We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.
For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.
read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.
KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.
The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.
In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.
The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.
GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.
I bump the version to 3.0.0 the following three reasons:
the changes of function calls
can analyze any ontology/pathway annotation (supports user’s customize annotation data)
can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)
Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.
This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:
setReadable (convert IDs stored enrichResult object to gene symbol)
simplify (remove redundant GO terms, supported via GOSemSim)
DOSE
DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.
maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.
gsfilter function for restricting enriched results with minimal and maximal gene set sizes.
upsetplot was implemented to visualize overlap of enriched gene sets.
The dot sizes in enrichMap now scaled by category sizes
All these changes also affect clusterProfiler and ReactomePA.
ggtree
I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.
CHANGES IN VERSION 1.3.16
------------------------
o geom_treescale() supports family argument <2016-04-27, Wed>
+ https://github.com/GuangchuangYu/ggtree/issues/56
o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/54
o support passing textConnection(text_string) as a file <2016-04-21, Thu>
+ contributed by Casey Dunn <casey_dunn@brown.edu>
+ https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
CHANGES IN VERSION 1.3.15
------------------------
o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
o geom_label2 that support subsetting <2016-04-07, Thu>
o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
o geom_taxalink for connecting related taxa <2016-04-01, Fri>
o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
CHANGES IN VERSION 1.3.14
------------------------
o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
+ see https://github.com/GuangchuangYu/ggtree/issues/46
o subview and inset now supports annotating with img files <2016-02-23, Tue>
CHANGES IN VERSION 1.3.13
------------------------
o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
o geom_cladelabel works with collapse <2016-02-07, Sun>
+ see https://github.com/GuangchuangYu/ggtree/issues/38
CHANGES IN VERSION 1.3.12
------------------------
o exchange function name of geom_tree and geom_tree2 <2016-01-25, Mon>
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/hadley/ggplot2/issues/1512
o colnames_level parameter in gheatmap <2016-01-25, Mon>
o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon>
CHANGES IN VERSION 1.3.11
------------------------
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/GuangchuangYu/ggtree/issues/36
o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
+ fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
o support phyloseq object <2016-01-21, Thu>
o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
CHANGES IN VERSION 1.3.10
------------------------
o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
o remove dependency of colorspace <2016-01-20, Wed>
o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>
CHANGES IN VERSION 1.3.9
------------------------
o optimize getYcoord <2016-01-14, Thu>
o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
+ > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
CHANGES IN VERSION 1.3.8
------------------------
o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
o add viewClade function <2016-01-12, Tue>
o support obkData object defined by OutbreakTools <2016-01-12, Tue>
o update vignettes <2016-01-07, Thu>
o 05 advance tree annotation vignette <2016-01-04, Mon>
o export theme_inset <2016-01-04, Mon>
o inset, nodebar, nodepie functions <2015-12-31, Thu>
CHANGES IN VERSION 1.3.7
------------------------
o split the long vignette to several vignettes
+ 00 ggtree <2015-12-29, Tue>
+ 01 tree data import <2015-12-28, Mon>
+ 02 tree visualization <2015-12-28, Mon>
+ 03 tree manipulation <2015-12-28, Mon>
+ 04 tree annotation <2015-12-29, Tue>
CHANGES IN VERSION 1.3.6
------------------------
o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
- remove annotation_clade and annotation_clade2 functions.
o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
CHANGES IN VERSION 1.3.5
------------------------
o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
+ see https://github.com/GuangchuangYu/ggtree/issues/30
CHANGES IN VERSION 1.3.4
------------------------
o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
o get_clade_position function <2015-11-26, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/28
o get_heatmap_column_position function <2015-11-25, Wed>
+ see https://github.com/GuangchuangYu/ggtree/issues/26
o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
CHANGES IN VERSION 1.3.3
------------------------
o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
CHANGES IN VERSION 1.3.2
------------------------
o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
o add support of ape bootstrap analysis <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/20
o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/21
CHANGES IN VERSION 1.3.1
------------------------
o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
+ see https://github.com/GuangchuangYu/ggtree/issues/17
o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
+ see https://github.com/hadley/ggplot2/issues/1380
o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>
Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.
annotatePeak
Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.
Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.
Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.
getBioRegion
getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.
Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.
ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.
clusterProfiler
We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.
For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.
read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.
KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.
The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.
In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.
The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.
GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.
I bump the version to 3.0.0 the following three reasons:
the changes of function calls
can analyze any ontology/pathway annotation (supports user’s customize annotation data)
can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)
Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.
This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:
setReadable (convert IDs stored enrichResult object to gene symbol)
simplify (remove redundant GO terms, supported via GOSemSim)
DOSE
DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.
maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.
gsfilter function for restricting enriched results with minimal and maximal gene set sizes.
upsetplot was implemented to visualize overlap of enriched gene sets.
The dot sizes in enrichMap now scaled by category sizes
All these changes also affect clusterProfiler and ReactomePA.
ggtree
I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.
CHANGES IN VERSION 1.3.16
------------------------
o geom_treescale() supports family argument <2016-04-27, Wed>
+ https://github.com/GuangchuangYu/ggtree/issues/56
o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/54
o support passing textConnection(text_string) as a file <2016-04-21, Thu>
+ contributed by Casey Dunn <casey_dunn@brown.edu>
+ https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
CHANGES IN VERSION 1.3.15
------------------------
o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
o geom_label2 that support subsetting <2016-04-07, Thu>
o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
o geom_taxalink for connecting related taxa <2016-04-01, Fri>
o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
CHANGES IN VERSION 1.3.14
------------------------
o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
+ see https://github.com/GuangchuangYu/ggtree/issues/46
o subview and inset now supports annotating with img files <2016-02-23, Tue>
CHANGES IN VERSION 1.3.13
------------------------
o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
o geom_cladelabel works with collapse <2016-02-07, Sun>
+ see https://github.com/GuangchuangYu/ggtree/issues/38
CHANGES IN VERSION 1.3.12
------------------------
o exchange function name of geom_tree and geom_tree2 <2016-01-25, Mon>
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/hadley/ggplot2/issues/1512
o colnames_level parameter in gheatmap <2016-01-25, Mon>
o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon>
CHANGES IN VERSION 1.3.11
------------------------
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/GuangchuangYu/ggtree/issues/36
o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
+ fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
o support phyloseq object <2016-01-21, Thu>
o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
CHANGES IN VERSION 1.3.10
------------------------
o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
o remove dependency of colorspace <2016-01-20, Wed>
o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>
CHANGES IN VERSION 1.3.9
------------------------
o optimize getYcoord <2016-01-14, Thu>
o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
+ > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
CHANGES IN VERSION 1.3.8
------------------------
o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
o add viewClade function <2016-01-12, Tue>
o support obkData object defined by OutbreakTools <2016-01-12, Tue>
o update vignettes <2016-01-07, Thu>
o 05 advance tree annotation vignette <2016-01-04, Mon>
o export theme_inset <2016-01-04, Mon>
o inset, nodebar, nodepie functions <2015-12-31, Thu>
CHANGES IN VERSION 1.3.7
------------------------
o split the long vignette to several vignettes
+ 00 ggtree <2015-12-29, Tue>
+ 01 tree data import <2015-12-28, Mon>
+ 02 tree visualization <2015-12-28, Mon>
+ 03 tree manipulation <2015-12-28, Mon>
+ 04 tree annotation <2015-12-29, Tue>
CHANGES IN VERSION 1.3.6
------------------------
o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
- remove annotation_clade and annotation_clade2 functions.
o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
CHANGES IN VERSION 1.3.5
------------------------
o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
+ see https://github.com/GuangchuangYu/ggtree/issues/30
CHANGES IN VERSION 1.3.4
------------------------
o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
o get_clade_position function <2015-11-26, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/28
o get_heatmap_column_position function <2015-11-25, Wed>
+ see https://github.com/GuangchuangYu/ggtree/issues/26
o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
CHANGES IN VERSION 1.3.3
------------------------
o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
CHANGES IN VERSION 1.3.2
------------------------
o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
o add support of ape bootstrap analysis <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/20
o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/21
CHANGES IN VERSION 1.3.1
------------------------
o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
+ see https://github.com/GuangchuangYu/ggtree/issues/17
o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
+ see https://github.com/hadley/ggplot2/issues/1380
o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>
Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.
annotatePeak
Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.
Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.
Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.
getBioRegion
getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.
Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.
ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.
clusterProfiler
We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software is almost identical.
For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.
read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.
KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.
The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.
In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analysis.
The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.
GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.
I bump the version to 3.0.0 the following three reasons:
the changes of function calls
can analyze any ontology/pathway annotation
can analyze all speices
Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.
This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:
setReadable (convert IDs stored enrichResult object to gene symbol)
simplify (remove redundant GO terms, supported via GOSemSim)
DOSE
DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.
maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.
gsfilter function for restricting enriched results with minimal and maximal gene set sizes.
upsetplot was implemented to visualize overlap of enriched gene sets.
The dot sizes in enrichMap now scaled by category sizes
ggtree
I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.
CHANGES IN VERSION 1.3.16
------------------------
o geom_treescale() supports family argument <2016-04-27, Wed>
+ https://github.com/GuangchuangYu/ggtree/issues/56
o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/54
o support passing textConnection(text_string) as a file <2016-04-21, Thu>
+ contributed by Casey Dunn <casey_dunn@brown.edu>
+ https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
CHANGES IN VERSION 1.3.15
------------------------
o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
o geom_label2 that support subsetting <2016-04-07, Thu>
o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
o geom_taxalink for connecting related taxa <2016-04-01, Fri>
o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
CHANGES IN VERSION 1.3.14
------------------------
o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
+ see https://github.com/GuangchuangYu/ggtree/issues/46
o subview and inset now supports annotating with img files <2016-02-23, Tue>
CHANGES IN VERSION 1.3.13
------------------------
o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
o geom_cladelabel works with collapse <2016-02-07, Sun>
+ see https://github.com/GuangchuangYu/ggtree/issues/38
CHANGES IN VERSION 1.3.12
------------------------
o exchange function name of geom_tree and geom_tree2 <2016-01-25, Mon>
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/hadley/ggplot2/issues/1512
o colnames_level parameter in gheatmap <2016-01-25, Mon>
o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon>
CHANGES IN VERSION 1.3.11
------------------------
o solved issues of geom_tree2 <2016-01-25, Mon>
+ https://github.com/GuangchuangYu/ggtree/issues/36
o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
+ fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
o support phyloseq object <2016-01-21, Thu>
o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
CHANGES IN VERSION 1.3.10
------------------------
o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
o remove dependency of colorspace <2016-01-20, Wed>
o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>
CHANGES IN VERSION 1.3.9
------------------------
o optimize getYcoord <2016-01-14, Thu>
o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
+ > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
CHANGES IN VERSION 1.3.8
------------------------
o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
o add viewClade function <2016-01-12, Tue>
o support obkData object defined by OutbreakTools <2016-01-12, Tue>
o update vignettes <2016-01-07, Thu>
o 05 advance tree annotation vignette <2016-01-04, Mon>
o export theme_inset <2016-01-04, Mon>
o inset, nodebar, nodepie functions <2015-12-31, Thu>
CHANGES IN VERSION 1.3.7
------------------------
o split the long vignette to several vignettes
+ 00 ggtree <2015-12-29, Tue>
+ 01 tree data import <2015-12-28, Mon>
+ 02 tree visualization <2015-12-28, Mon>
+ 03 tree manipulation <2015-12-28, Mon>
+ 04 tree annotation <2015-12-29, Tue>
CHANGES IN VERSION 1.3.6
------------------------
o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
- remove annotation_clade and annotation_clade2 functions.
o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
CHANGES IN VERSION 1.3.5
------------------------
o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
+ see https://github.com/GuangchuangYu/ggtree/issues/30
CHANGES IN VERSION 1.3.4
------------------------
o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
o get_clade_position function <2015-11-26, Thu>
+ https://github.com/GuangchuangYu/ggtree/issues/28
o get_heatmap_column_position function <2015-11-25, Wed>
+ see https://github.com/GuangchuangYu/ggtree/issues/26
o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
CHANGES IN VERSION 1.3.3
------------------------
o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
CHANGES IN VERSION 1.3.2
------------------------
o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
o add support of ape bootstrap analysis <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/20
o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
see https://github.com/GuangchuangYu/ggtree/issues/21
CHANGES IN VERSION 1.3.1
------------------------
o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
+ see https://github.com/GuangchuangYu/ggtree/issues/17
o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
+ see https://github.com/hadley/ggplot2/issues/1380
o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>
GOSemSim
update IC data using update OrgDb packages.
ReactomePA
Internal implementation was updated according to the change of DOSE.