Quantcast
Channel: Search Results for “heatmap”– R-bloggers
Viewing all 152 articles
Browse latest View live

A Dashboard Implementation Of Clustering App

$
0
0

(This article was first published on Stefantastic - r, and kindly contributed to R-bloggers)

Background

I previously built an interactive, online App using Shiny where you can upload your own data, perform basic clustering analysis, and view correlations in a heatmap. I wrote about this in my last post. While this worked, it was very ugly and needed a face lift. With the help of a friend, I implemented a dashboard using shinydashboard.

New implementation

The new implementation organizes the individual apps in a dashboard instead of a single Rmarkdown file. Essentially the individual apps (upload, clustering, etc.) remained the same (except for some feature improvements). The big difference is that the outer layer is now a normal shiny app with the standard app.R, ui.R, and server.R files. The other major change is the addition of three new files controlling the dashboard appearance and contents (body.R, sidebar.R, and header.R). See the shinydashboard Get Started for a basic example and the Structure page for more details.

Try it out!

Take a look at the dashboard app and source code and leave a comment to let me know what you think! You can download a sample data set (mtcars) to try out if you don’t have your own.

Previous Implementation

Screenshot of previous implementation

Dashboard Implementation

Screenshot of dashboard implementation

To leave a comment for the author, please follow the link and comment on his blog: Stefantastic - r.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simulating backgammon players’ Elo ratings

$
0
0

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Probabilities of winning with a given rating

Backgammon clubs and on-line forums use a modified form of the Elo rating system to keep track of how well individuals have played and draw inferences about their underlying strength. The higher the rating, the stronger the player. Players with higher ratings are inferred to be stronger than those with lower ratings and hence are expected to win; and the longer the match, the more likely the greater skill level will overcome the random chance of the dice. The animated plot below shows the expected probabilities of two players in the FIBS (First Internet Backgammon Server) internet forum, where players start at 1500 and the very best reach over 2000.

Animated heatmap of player probabilities of winning

When the two players have the same rating, they are expected to have a 50/50 chance of winning, and this causes the constant diagonal line in the animation above. When one player is better than the other they have a higher chance of winning, represented for Player A by the increasingly blue space in the bottom right of the plot and the numbers (which are estimated probabilities) labelling the contour lines. The actual formula used on FIBS and illustrated above is defined by on the FIBS site as:

Winning prob. = 1 – (1 / (10 ^ ((YOU – HIM) * SQRT(ML) / 2000) + 1))

(As an aside, I don’t think there is a strong theoretical reason for the SQRT(ML) in that formula. With the complications of the doubling cube I very much doubt that the relationship of the probability winning to match length is as simple as that. But it’s a reasonable empirical approximation; I might blog more about that another time.)

Now, here’s how I made that animation. In the R code below, first, I load up some functionality and the Google font I use for this blog. Then I define a function that estimates the probability of a player winning using the FIBS formula above.

library(dplyr)
library(showtext) # for fonts
library(RColorBrewer)
library(directlabels) # for labels on contour lines
library(ggplot2)
library(scales)
library(forecast) # for auto.arima later on

font.add.google("Poppins", "myfont")
showtext.auto()
#==================helper functions=====================

fibs_p <- function(a, b, ml){
   # function to determine the expected probability of player a winning a bachgammon match
   # against player b of length ml, with a and b representing their FIBS Elo ratings
   tmp <- 1 - (1 / (10 ^ ((a - b) * sqrt(ml) / 2000) + 1))
   return(tmp)
}

The strategy to create the animation is simple:

  • Create all the combinations of two players’ Elo ratings.
  • Loop through 23 possible match lengths, calculating the probabilities of A winning for each combination of Elo ratings at that match length
  • Draw a still image heatmap for each of those 23 match lengths using {ggplot2}
  • Compile the images into a single animated Gif using ImageMagick.
#================heat maps showing probability of winning

# create a matrix of players A and B's possible Elo ratings
a <- b <- seq(from = 1000, to = 2000, by = 5)
mat <- expand.grid(a, b) %>%
   rename(a = Var1, b = Var2)

# create a folder to hold the images and navigate to it
dir.create("tmp0002")
owd <- setwd("tmp0002")

# cycle through a range of possible match lengths, drawing a plot for each
matchlengths <- seq(from = 1, to = 23, by = 1)
res <- 150

for(i in 1:length(matchlengths)){
   df1 <- mat %>% mutate(probs = fibs_p(a = a, b = b, ml = matchlengths[i]))
   
   p1 <- ggplot(df1,  aes(x = a, y = b, z = probs)) +
      geom_tile(aes(fill = probs)) +
      theme_minimal(base_family = "myfont") +
      scale_fill_gradientn(colours = brewer.pal(10, "Spectral"), limits = c(0, 1)) +
      scale_colour_gradientn(colours = "black") +
      stat_contour(aes(colour = ..level..), binwidth = .1) + # force contours to be same distance each image
      labs(x = "Player A Elo rating", y = "Player B Elo rating") +
      theme(legend.position = "none") +
      coord_equal() +
      ggtitle(paste0("Probability of Player A winning a match to ", matchlengths[i]))
   
   png(paste0(letters[i], ".png"), res * 5, res * 5, res = res, bg = "white")
      print(direct.label(p1)) # direct labels used to label the contour points
   dev.off()
}

# compile the images into a single animated Gif with ImageMagick.  Note that
# the {animation} package provides R wrappers to do this but they often get
# confused (eg clashes with Windows' convert) and it's easier to do it explicitly
# yourself by sending an instruction to the system:
system('"C:\Program Files\ImageMagick-6.9.1-Q16\convert" -loop 0 -delay 150 *.png "EloProbs.gif"')

# move the asset over to where needed for the blog
file.copy("EloProbs.gif", "../..http://ellisp.github.io/img/0002-EloProbs.gif", overwrite = TRUE)

# clean up
setwd(owd)
unlink("tmp0002", recursive = TRUE)

Actual ratings varying, “true” rating constant

The estimated probability of winning against a certain opponent at a given match length might be of interest to backgammon players (I’m surprised it’s not referred to more often), but its most common use is embedded in the calculation of how much each player’s Elo rating changes after a match is decided. That formula is well explained by FIBS and a little involved so I won’t spell it out here, but you can see it in action in the function I provide later in this post. A key point for our purposes is that as the win/loss of a match is a random event, players’ Elo ratings which are based on those events are random variables. Even if we grant the existence of a “real” value for each player, it is unobservable, and can only be estimated by their actual Elo rating in a tournament or internet forum.

In fact, my original motivation for this blog post was to see how much Elo ratings fluctuate due to randomness of individual games. In a later post I’ll have a more complex and realistic simulation, but the chart below shows what happens to a player’s Elo ratings over time in the following situation:

  • Only two players
  • They only ever play 5 point matches, and they play 10,000 of them
  • Player A’s chance of winning the 5 point match is 0.6, and this does not change over time (ie no player is improving in skill relative to the other)

The results are shown in the chart below, and in this unrealistically simple scenario there’s more variation than I’d realised there would be. Player A’s actual Elo rating fluctuates fairly wildly around the true value (which can be calculated as 1578.75), hitting 1650 several times and dropping nearly all the way to 1500 at one point.

Ratings of Player A in 2 player simulation

Here’s the code for that simulation, including my R function that provides FIBS-style Elo ratings, adjusted for experience as set out by FIBS.

#=================determine change of Elo rating ================
fibs_scores <- function(a, b, winner = "a", ml = 1, axp = 500, bxp = 500){
   # a is Elo rating of player A before match
   # b is Elo rating of player B before match
   # ml is match length
   # axp is total match lengths (experience) of player a until this match
   # bxp is total match lengths (experience) of player b until this match
   # see http://www.fibs.com/ratings.html for formulae
   
   # calculate experience-correction multipliers:
   multa <- ifelse(axp < 400, 5 - ((axp + ml) / 100), 1)
   multb <- ifelse(bxp < 400, 5 - ((axp + ml) / 100), 1)
   
   # probability of A winning, using fibs_p function defined earlier:
   winproba <- fibs_p(a = a, b = b, ml = ml)
   
   # match value (points to be distributed between the two players):
   matchvalue <- 4 * sqrt(ml)
   
   # who gets them?:
   if(winner == "b"){
      a <- a - matchvalue * winproba * multa
      b <- b + matchvalue * winproba * multb
   } else {
      a <- a + matchvalue * (1 - winproba) * multa 
      b <- b - matchvalue * (1- winproba) * multb
   }
   
   return(list(a = a, b = b, axp = axp + ml, bxp = bxp + ml))
}

# test against "baptism by fire" example at http://www.fibs.com/ratings.html
# should be 1540.95:
round(fibs_scores(a = 1500, b = 1925, ml = 7, axp = 0, bxp = 10000, winner = "a")$a, 2)

#================simulations of how long it takes for scores to stablise================
# two people playing eachother 5 point games with 0.6 chance of winning


A <- B <- data_frame(rating = 1500, exp = 0)

timeseries <- data_frame(A = A$rating, B = B$rating)

set.seed(123) # for reproducibility
for (i in 1:10000){
   result <- fibs_scores(a = A$rating, b = B$rating, 
               winner = ifelse(runif(1) > 0.4, "a", "b"),
               axp = A$exp, bxp = B$exp, ml = 5)
   timeseries[i, "A"] <- A$rating <- result$a
   timeseries[i, "B"] <-    B$rating <- result$b
   A$exp <- result$axp
   B$exp <- result$bxp
   
}

timeseries$A <- ts(timeseries$A)
timeseries$B <- ts(timeseries$B)
timeseries$time <- 1:nrow(timeseries)

# theoretical value:
tv <- (log(1/ 0.4 - 1) / log(10)  * 2000 / sqrt(5) + 3000 ) / 2


svg("..http://ellisp.github.io/img/0002-elo-rating.svg", 8, 5)
ggplot(timeseries, aes(x = time, y = A)) +
   geom_line(colour = "grey50") +
   theme_minimal(base_family = "myfont") +
   scale_y_continuous("Player A's Elo rating") +
   geom_hline(yintercept = tv, colour = "blue") +
   ggtitle("Elo rating of Player A in two player, constant skill simulation") +
   scale_x_continuous("nNumber of matches played", label = comma) +
   annotate("text", y = tv - 10, x = max(timeseries$time) - 100, 
            label ="Theoreticalnvalue", colour  = "blue", 
            family = "myfont", size = 3)
dev.off()

In this situation, the resulting player’s Elo rating bouncing around at random is quite well modelled by an autoregressive AR(1) time series model with an intercept at the “true” level of the player’s skill. With an autoregression parameter of around 0.98, the rating at any point in time is obviously very closely related to the previous rating; almost but not quite a random walk. Showing how and why would make this post too long, but it’s a useful factoid to note for future work when we come to more realistic and complex simulations of Elo ratings on FIBS.

To leave a comment for the author, please follow the link and comment on his blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New features in genomation package

$
0
0

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)



Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.

Recently we added new features to genomation and here we present them on example of binding profiles of 6 transcription factors around the CTCF binding sites derived from ChIP-seq. All new functionalities are available in the latest version of genomation that can be found on its github website.

# install the package from github
library(devtools)
install_github("BIMSBbioinfo/genomation",build_vignettes=FALSE)

Extending genomation to work with paired-end BAM files

Genomation can work with paired-end BAM files. Mates from reads are treated as fragments (they are stitched together).

library(genomation)
genomationDataPath = system.file('extdata',package='genomationData')
bam.files = list.files(genomationDataPath, full.names=TRUE, pattern='bam$')
bam.files = bam.files[!grepl('Cage', bam.files)]

Accelerate functions responsible for reading genomic files

This is achived by using readr::read_delim function to read genomic files instead of read.table. Additionally if skip=“auto” argument is provided in readGeneric_or track.line=“auto” in other functions that read genomic files, e.g. _readBroadPeak then UCSC header is detected (and first track).

library(GenomicRanges)

ctcf.peaks = readBroadPeak(file.path(genomationDataPath,
'wgEncodeBroadHistoneH1hescCtcfStdPk.broadPeak.gz'))
ctcf.peaks = ctcf.peaks[seqnames(ctcf.peaks) == 'chr21']
ctcf.peaks = ctcf.peaks[order(-ctcf.peaks$signalValue)]
ctcf.peaks = resize(ctcf.peaks, width=1000, fix='center')

Parallelizing data processing in ScoreMatrixList

We use ScoreMatrixList function to extract coverage values of all transcription factors around ChIP-seq peaks. ScoreMatrixList was improved by adding new argument coresthat indicates number of cores to be used at the same time (by using parallel:mclapply).

sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num=50, type='bam', cores=2)

# descriptions of file that contain info. about transcription factors
sampleInfo = read.table(system.file('extdata/SamplesInfo.txt',
package='genomationData'),header=TRUE, sep='t')
names(sml) = sampleInfo$sampleName[match(names(sml),sampleInfo$fileName)]

Arithmetic, indicator and logic operations as well as subsetting work on score matrices

Arithmetic, indicator and logic operations work on ScoreMatrix, ScoreMatrixBin and ScoreMatrixList
objects, e.i.:
Arith: “+”, “-”, “*”, “”, “%%”, “%/%”, “/”
Compare: “==”, “>”, “<”, “!=”, “<=”, “>=”
Logic: “&”, “|”

sml1 = sml * 100
sml1
## scoreMatrixlist of length:5
##
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50

Subsetting:

sml[[6]] = sml[[1]]
sml
## scoreMatrixlist of length:6
##
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50
## 6. scoreMatrix with dims: 1681 50
sml[[6]] <- NULL

Improvements and new arguments in visualization functions

Due to large signal scale of rows of each element in the ScoreMatrixList we scale them.

sml.scaled = scaleScoreMatrixList(sml)

Faster heatmaps

HeatMatrix and multiHeatMatrix function works faster by faster assigning colors. Heatmap profile of scaled coverage shows a colocalization of Ctcf, Rad21 and Znf143.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500))

plot of chunk unnamed-chunk-8

New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix

clustfun allow to add more clustering functions and integrate them with the heatmap function multiHeatMatrix. It has to be a function that returns a vector of integers indicating the cluster to which each point is allocated. Previous version of multiHeatMatrix could cluster rows of heatmaps using only k-means algorithm.

# k-means algorithm with 2 clusters
cl1 <- function(x) kmeans(x, centers=2)$cluster
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1)

plot of chunk unnamed-chunk-9

# hierarchical clustering with Ward's method for agglomeration into 2 clusters
cl2 <- function(x) cutree(hclust(dist(x), method="ward"), k=2)
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl2)
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot of chunk unnamed-chunk-9

Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix

clust.matrix argument indicates which matrices are used for clustering. It can be a numerical vector of indexes of matrices or a character vector of names of the ‘ScoreMatrix’ objects in 'ScoreMatrixList'. Matrices that are not in clust.matrix are ordered according to the result of the clustering algorithm. By default all matrices are clustered.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1, clust.matrix = 1)

plot of chunk unnamed-chunk-10

Central tendencies in line plots: centralTend in plotMeta

We extended visualization capabilities for meta-plots. plotMeta function can plot not only mean, but also median as central tendency and it can be set up using centralTend argument. Previously user could plot only mean.

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
xcoords=c(-500, 500),
winsorize=c(0,99),
centralTend="mean")

plot of chunk unnamed-chunk-11

Smoothing central tendency: smoothfun in plotMeta

We added smoothfun argument to smooth central tendency as well as dispersion bands around it which is shown in the next figure. Smoothfun has to be a function that returns a list that contains a vector of y coordinates (vector named '$y').

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
xcoords=c(-500, 500),
winsorize=c(0,99),
centralTend="mean",
smoothfun=function(x) stats::smooth.spline(x, spar=0.5))

plot of chunk unnamed-chunk-12

Plotting dispersion around central lines in line plots: dispersion in plotMeta

We added new argument dispersion to plotMeta that shows dispersion bands around centralTend. It can take one of the arguments:

  • “se” shows standard error of the mean and 95 percent confidence interval for the mean
  • “sd” shows standard deviation and 2*(standard deviation)
  • “IQR” shows 1st and 3rd quartile and confidence interval around the median based on the median +/- 1.57 * IQR/sqrt(n) (notches)
plotMeta(mat=sml, profile.names=names(sml),
xcoords=c(-500, 500),
winsorize=c(0,99),
centralTend="mean",
smoothfun=function(x) stats::smooth.spline(x, spar=0.5),
dispersion="se", lwd=4)

plot of chunk unnamed-chunk-13

Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function

We added new function patternMatrix that calculates k-mer and PWM occurrences over predefined equal width windows. If one pattern (character of length 1 or PWM matrix) is given then it returns ScoreMatrix, if more than one character ot list of PWM matrices then ScoreMatrixList. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with score themselves. windows can be a DNAStringList object or GRanges object (but then genome argument has to be provided, a BSgenome object).

#ctcf motif from the JASPAR database
ctcf.pfm = matrix(as.integer(c(87,167,281,56,8,744,40,107,851,5,333,54,12,56,104,372,82,117,402,
291,145,49,800,903,13,528,433,11,0,3,12,0,8,733,13,482,322,181,
76,414,449,21,0,65,334,48,32,903,566,504,890,775,5,507,307,73,266,
459,187,134,36,2,91,11,324,18,3,9,341,8,71,67,17,37,396,59)),
ncol=19,byrow=TRUE)
rownames(ctcf.pfm) <- c("A","C","G","T")

prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25)
priorProbs = prior.params/sum(prior.params)
postProbs = t( t(ctcf.pfm + prior.params)/(colSums(ctcf.pfm)+sum(prior.params)) )
ctcf.pwm = Biostrings::unitScale(log2(postProbs/priorProbs))

library(BSgenome.Hsapiens.UCSC.hg19)
hg19 = BSgenome.Hsapiens.UCSC.hg19

p = patternMatrix(pattern=ctcf.pwm, windows=ctcf.peaks, genome=hg19, min.score=0.8)

Visualization of the patternMatrix patternMatrix (here as ScoreMatrix object) can be visualized using i.e. heatMatrix, heatMeta or plotMeta functions.

heatMatrix(p, xcoords=c(-500, 500), main="CTCF motif")

plot of chunk unnamed-chunk-15

plotMeta(mat=p, xcoords=c(-500, 500), smoothfun=function(x) stats::lowess(x, f = 1/10), 
line.col="red", main="ctcf motif")

plot of chunk unnamed-chunk-15

Integration with Travis CI for auto-testing

Recently we integrated genomation with Travis CI. It allows users to see current status of the package which is updated during every change of the package. Travis automatically runs R CMD CHECK and reports it. Shields shown below are on the genomation github site:
https://github.com/BIMSBbioinfo/genomation
Status Build Status codecov.io BioC_years BioC_availability

# <br />
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3
## [3] rtracklayer_1.28.10 Biostrings_2.36.4
## [5] XVector_0.8.0 GenomicRanges_1.20.8
## [7] GenomeInfoDb_1.4.3 IRanges_2.2.9
## [9] S4Vectors_0.6.6 BiocGenerics_0.14.0
## [11] genomation_1.1.27 BiocInstaller_1.18.5
## [13] devtools_1.9.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 formatR_1.2.1
## [3] futile.logger_1.4.1 plyr_1.8.3
## [5] bitops_1.0-6 futile.options_1.0.0
## [7] tools_3.2.2 zlibbioc_1.14.0
## [9] digest_0.6.8 gridBase_0.4-7
## [11] evaluate_0.8 memoise_0.2.1
## [13] gtable_0.1.2 curl_0.9.3
## [15] yaml_2.1.13 proto_0.3-10
## [17] httr_1.0.0 stringr_1.0.0
## [19] knitr_1.11 data.table_1.9.6
## [21] impute_1.42.0 R6_2.1.1
## [23] plotrix_3.5-12 XML_3.98-1.3
## [25] BiocParallel_1.2.22 seqPattern_1.0.1
## [27] rmarkdown_0.8.1 readr_0.1.1
## [29] reshape2_1.4.1 ggplot2_1.0.1
## [31] lambda.r_1.1.7 magrittr_1.5
## [33] matrixStats_0.14.2 MASS_7.3-44
## [35] scales_0.3.0 Rsamtools_1.20.5
## [37] htmltools_0.2.6 GenomicAlignments_1.4.2
## [39] colorspace_1.2-6 KernSmooth_2.23-15
## [41] stringi_0.5-5 munsell_0.4.2
## [43] RCurl_1.95-4.7 chron_2.3-47
## [45] markdown_0.7.7

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Maungawhau with a Gaussian process

$
0
0

(This article was first published on Maxwell B. Joseph, and kindly contributed to R-bloggers)

The Maungawhau volcano dataset is an R classic, often used to illustrate 3d plotting.
Being on a Gaussian process kick lately, it seemed fun to try to interpolate the volcano elevation data using a subset of the full dataset as training data.
Even with only 1% of the data, a squared exponential Gaussian process model does a decent job at estimating the true elevation surface (code here):

The upper row of plots show the true elevation surface, estimated surface based on 1% of the data (53 of the 5307 cells), and the squared error in estimation.
The lower plots show the same data in heatmap form, with the location of sampled points shown as crosses.

To leave a comment for the author, please follow the link and comment on their blog: Maxwell B. Joseph.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A two-hour introduction to data analysis in R

$
0
0

(This article was first published on Citizen-Statistician » R Project, and kindly contributed to R-bloggers)

A few weeks ago I gave a two-hour Introduction to R workshop for the Master of Engineering Management students at Duke. The session was organized by the student-led Career Development and Alumni Relations committee within this program. The slides for the workshop can be found here and the source code is available on GitHub.

Why might this be of interest to you?

  • The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
  • The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
    • Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
    • Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
    • Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
  • The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
 
##    BikeAge_Gr crash_count
##    (chr)      (int)
## 1  0-5        60
## 2  10-Jun     421
## 3  15-Nov     747
## 4  16-19      605
## 5  20-24      680
## 6  25-29      430
## 7  30-39      658
## 8  40-49      920
## 9  50-59      739
## 10 60-69      274
## 11 70         12
## 12 70+        58

Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.

I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…

If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

To leave a comment for the author, please follow the link and comment on their blog: Citizen-Statistician » R Project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Countries of refugees to the US in 2014 and their destinations

$
0
0

(This article was first published on Adventures in Analytics and Visualization, and kindly contributed to R-bloggers)
A tweet from Kyle Walker introduced me to data from the Office of Refugee Resettlement from the US Department of Health and Human Services. Using multiple R packages such as shiny, rCharts, rcdimple, leaflet, and d3heatmap, this post looks at the countries of 69,986 refugees in 2014 and their destinations in the US. All charts are interactive and the code for generating them and the shiny apps can be found in the Refugees repository in my account on GitHub. 

Click the link to go to the post:  http://patilv.com/USRefugees/

To leave a comment for the author, please follow the link and comment on their blog: Adventures in Analytics and Visualization.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Topology Underlying the Brand Logo Naming Game: Unidimensional or Local Neighborhoods?

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

You can find the app on iTunes and Google Play. It’s a game of trivial pursuits – here’s the logo, now tell me the brand. Each item is scored as right or wrong, and the players must take it all very seriously for there is a Facebook page with cheat sheets for improving one’s total score.

Psychometrics Sees Everything as a Test

What would a psychometrician make of such a game based on brand logo knowledge? Are we measuring one’s level of consumerism (“a preoccupation with and an inclination toward buying consumer goods“)? Everyone knows the most popular brands, but only the most involved are familiar with logos of the less publicized products. The question for psychometrics is whether they are able to explain the logos that you can identify correctly by knowing only your level of consumption.

For example, if you were a car enthusiast, then you would be able to name all the car logos in the above table. However, if you did not drive a car or watch commercial television or read car ads in print media, you might be familiar with only the most “popular” logos (i.e., the ones that cannot be avoided because their signage is everywhere you look). We make the assumption that everyone falls somewhere between these two extremes along a consumption continuum and assess whether we can reproduce every individual pattern of answers based solely on their location on this single dimension. Shopping intensity or consumerism is the path, and logo identifications are the sensors along that path.

Specifically, if some number of N respondents played this game, it would not be difficult to rank order the 36 logos in the above table along a line stretching from 0% to 100% correct identification. Next, we examine each respondent, starting by sorting the players from those with the fewest correct identifications to those getting the most right. As shown in an earlier post, a heatmap will reveal the relationship between the ease of identifying each logo and the overall logo knowledge for each individual, as measured by their total score over all the brand logos. [The R code required to simulate the data and produce the heatmap can be found at the end of this post.]

You can begin by noting that blue is correct and red is not. Thus, the least knowledgeable players are in the top rows filled with the most red and the least blue. The logos along the x-axis are sorted by difficulty with the hardest to name on the left and the easiest on the right. In general, better players tend to know the harder logos. This is shown by the formation of a blue triangle as one scans towards the lower, right-hand corner. We call this a Guttman scale, and it suggests that both variation among the logos and the players can be described by a single dimension, which we might call logo familiarity or brand presence. However, one must be wary of suggestive names like “brand presence” for over time we forget that we are only measuring logo familiarity and not something more impactful.

Our psychometrician might have analyzed this same data using the R package ltm for latent trait modeling. A hopefully intuitive introduction to item response modeling was posted earlier on this blog. Those results could be summarized with a series of item characteristic curves displaying the relationship between the probability of answering correctly and the underlying trait, labeled ability by default.

As you see in the above plot, the items are arranged from the easiest (V1) to the hardest (V36) with the likelihood of naming the logo increasing as a logistic function of the unobserved consumerism measured as z-scores and called ability because item response theory (IRT) originated in achievement testing. These curves are simple to read and understand. A player with low consumption (e.g., a z-score near -2) has a better than even chance of identifying the most popular logos, but almost zero probability of naming any of the least familiar logos. All those probabilities move up their respective S-curves together as consumers become more involved.

In this example the function has been specified for I have plotted the item characteristics curves from the one-parameter Rasch model. However, a specific functional form is not required, and we could have used the R package KernSmoothIRT to fit a nonparametric model. The topology remains a unidimensional manifold, something similar to Hastie’s principal curve in the R package princurve. Because the term has multiple meanings, I should note that I am using “topology” in a limited sense in order to refer to the shape of the data and not as in topological data analysis.

To be clear, there must be powerful forces at work to constrain logo naming to a one-dimensional continuum. Sequential skills that build on earlier achievements can often be described by a low-dimensional manifold (e.g., learning descriptive statistics before attempting inference since the latter assumes knowledge of the former). We would have needed a different model had our brands been local so that higher shopping intensity would have produced greater familiarity only for those logos available in a given locality (e.g., country-specific brands without an international presence).

The Meaning of Brand Familiarity Depends on Brand Presence in Local Markets

Now, it gets interesting. We started with players differentiated by a single parameter indicating how far they had traveled along a common consumption path. The path markers or sensors are the logos arrayed in decreasing popularity. Everyone shares a common environment with similar exposures to the same brand logos. Most have seen the McDonald’s double-arcing M or the Nike swoosh because both brands have spent a considerable amount of money to buy market presence. On the other hand, Hilton’s “blue H in the swirl” with less market presence would be recognized less often (fourth row and first column in the above brand logo table).

But what if market presence and thus logo popularity depended on your local neighborhood? Even international companies have differential presence in different countries, as well as varying concentration within the same country. Spending and distribution patterns by national, regional and local brands create clusters of differential market presence. Everyone does not share a common logo exposure so that each cluster requires its own brand list. That is, consumers reside in localities with varying degrees of brand presence so that two individuals with identical levels of consumption intensity or consumerism would not be familiar with the same brand logos. Consequently, we need to add a second parameter to each individual’s position along a path specific to their neighborhood. The psychometrician calls this differential item functioning (DIF), and R provides a number of ways of handling the additional mixture parameter.

Overlapping Audiences in the Marketplace of Attention

You may have anticipated the next step as the topology becomes more complex. We began with one pathway marked with brand logos as our sensors. Then, we argued for a mixture model with groups of individuals living in different neighborhoods with different ordering of the brand logos. Finally, we will end by allowing consumers to belong to more than one neighborhood with whatever degree of belonging they desire. We are describing the kind of fragmentation that occurs when consumers seize control and there is more available to them than they can attend to or consider. James Webster outlines this process of audience formation in his book The Marketplace of Attention.

The topology has changed again. There are just too many brand logos, and unless it becomes a competitive game, consumers will derive diminishing returns from continuing search and they typically will stop sooner rather than later. It helps that the market comes preorganized by providers trying to make the sale. Expert reviews and word of mouth guide the search. Yet, it is the consumer who decides what to select from the seemingly endless buffet. In the process, an individual will see and remember only a subset of all possible brand logos. We need a new model – one that simultaneously sorts both rows and columns by grouping together consumers and the brand logos that they are likely to recognize.

A heatmap may help to explain what can be accomplished when we search for joint clusterings of the rows and columns (also known as biclustering). Using an R package for nonnegative matrix factorization (NMF), I will simulate a data set with such a structure and show you the heatmap. Actually, I will display two heatmaps, one without noise so that you can see the pattern and a second with the same pattern but with added noise. Hopefully, the heatmap without noise will enable you to see the same pattern in the second heatmap with additional distortions.

I kept the number of columns at 36 for comparison with the first one-dimensional heatmap that you saw toward the beginning of this post. As before, blue is one, and red is zero. We discover enclaves or silos in the first heatmap without noise (polarization). The boundaries become fuzzier with random variation (fragmentation). I should note that you can see the biclusters in both heatmaps without reordering the rows and columns only because this is how the simulator generates the data. If you wish to see how this can be done with actual data, I have provided a set of links with the code needed to run a NMF in R at the end of my post on Brand and Product Category Representation.

Finally, although we speak of NMF as a form of simultaneous clustering, the cluster membership are graded rather than all-or-none (soft vs. hard clustering). This yields a very flexible and expressive topology, which becomes clear when we review the three alternative representations presented in this post. First, we saw how some highly structured data matrices can be reproduced using a single dimension with rows and columns both located on the same continuum (IRT). Next, we asked if there might be discrete groups of rows with each row cluster having its own unique ordering of the columns (mixed IRT). Lastly, we sought a model of audience formation with rows and columns jointly collected together into blocks with graded membership for both the rows and the columns (NMF).

Knowledge is organized as a single dimension when learning is formalized within a curriculum (e.g., a course at an educational institution) or accumulative (e.g., need to know addition before one can learn multiplication). However, coevolving networks of customers and products cannot be described by any one dimension or even a finite mixture of different dimensions. The Internet creates both microgenres and fragmented audiences that require their own topology.

R Code to Produce Figures in this Post

# use psych package to simulate latent trait data

library(psych)
logos<-sim.irt(nvar=36, n=500, mod="logistic")
 
# Sort data by both item mean
# and person total score
item<-apply(logos$items,2,mean)
person<-apply(logos$items,1,sum)
logos$itemsOrd<-logos$items[order(person),order(item)]
 
# create heatmap
# may need to increase size of plots window in R studio
library(gplots)
heatmap.2(logos$itemsOrd, Rowv=FALSE, Colv=FALSE,
dendrogram="none", col=redblue(16),
key=T, keysize=1.5, density.info="none",
trace="none", labRow=NA)
 
library(ltm)
# two-parameter logistic model
fit<-ltm(logos$items ~ z1)
summary(fit)
 
# item characteristic curves
plot(fit)
 
# constrains slopes to be equal
fit2<-rasch(logos$items)
plot(fit2)
summary(fit2)
 
library(NMF)
# generate a synthetic dataset with
# 500 rows and three groupings of
# columns (1-10, 11-20, and 21-36)
n <- 500
counts <- c(10, 10, 16)
 
# no noise
V1 <- syntheticNMF(n, counts, noise=FALSE)
V1[V1>0]<-1
 
# with noise
V2 <- syntheticNMF(n, counts)
V2[V2>0]<-1
 
# produce heatmap with and without noise
heatmap.2(V1, Rowv=FALSE, Colv=FALSE,
dendrogram="none", col=redblue(16),
key=T, keysize=1.5, density.info="none",
trace="none", labRow=NA)
heatmap.2(V2, Rowv=FALSE, Colv=FALSE,
dendrogram="none", col=redblue(16),
key=T, keysize=1.5, density.info="none",
trace="none", labRow=NA)

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Modeling How Consumers Simplify the Purchase Process by Copying Others

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)
A Flower That Fits the Bill

Marketing borrows the biological notion of coevolution to explain the progressive “fit” between products and consumers. While evolutionary time may seem a bit slow for product innovation and adoption, the same metaphor can be found in models of assimilation and accommodation from cultural and cognitive psychology.

The digital camera was introduced as an alternative to film, but soon redefined how pictures are taken, stored and shared. The selfie stick is but the latest step in this process by which product usage and product features coevolve over time with previous cycles enabling the next in the chain. Is it the smartphone or the lack of fun that’s killing the camera?

The diffusion of innovation unfolds in the marketplace as a social movement with the behavior of early adopters copied by the more cautious. For example, “cutting the cord” can be a lifestyle change involving both social isolation from conversations among those watching live sporting events and a commitment to learning how to retrieve television-like content from the Internet. The Diary of a Cord-Cutter in 2015 offers a funny and informative qualitative account. Still, one needs the timestamp because cord-cutting is an evolving product category. The market will become larger and more diverse with more heterogeneous customers (assimilation) and greater differentiation of product offerings (accommodation).

So, we should be able to agree that product markets are the outcome of dynamic processes involving both producers and customers (see Sociocognitive Dynamics in a Product Market for a comprehensive overview). User-centered product design takes an additional step and creates fictional customers or personas in order to find the perfect match. Shoppers do something similar when they anticipate how they will use the product they are considering. User types can be real (an actual person) or imagined (a persona). If this analysis is correct, then both customers and producers should be looking at the same data: the cable TV customer to decide if they should become cord-cutters and the cable TV provider to identify potential defectors.

Identifying the Likely Cord-Cutter

We can ask about your subscriptions: cable TV, internet connection, Netflix, Hulu, Amazon Prime, Sling, and so on). It is a long list, and we might get some frequency of usage data at the same time. This may be all that we need, especially if we probe for the details (e.g., cable TV usage would include live sports, on-demand movies, kid’s shows, HBO or other channel subscriptions, and continue until just before respondents become likely to terminate on-line surveys). Concurrently, it might be helpful to know something about your hardware, such as TVs, DVDs, DVRs, media streamers and other stuff.

A form of reverse engineering guides our data collection. Qualitative research and personal experience gives us some idea of the usage types likely to populate our customer base. Cable TV offers a menu of bundled and ala carte hardware and channels. Only some of the alternatives are mutually exclusive; otherwise, you are free to create your own assortment. Internet availability only increases the number of options, which you can watch on a television, a computer, a tablet or a phone. Plus, there are always free broadcast TV captured with an antenna and DVDs that you rent or buy. We ought not to forget DVRs and media streamers (e.g., Roku, Apple TV, Chromecast, and Amazon Fire Stick). Obviously, there is no reason to stop with usage so why not extend the scale to include awareness and familiarity? You might not be a cord-cutter, though you may be on your way if you know all about Sling TV.

Traditional segmentation will not be able to represent this degree of complexity.

Each consumer defines their own personal choices by arranging options in a continually changing pattern that does not depend on existing bundles offered by providers. Consequently, whatever statistical model is chosen must be open to the possibility that every non-contradictory arrangement is possible. Yet, every combination will not survive for some will be dominated by others and never achieve a sustainable audience.

We could display this attraction between consumers and offerings as a bipartite graph (Figure 2.9 from Barabasi’s Network Science).

Consumers are listed in U, and a line is drawn to the offerings in V that they might wish to purchase (shown in the center panel). It is this linkage between U and V that produces the consumer and product networks in the two side panels. The A-B and B-C-D cliques of offerings in Projection V would be disjoint without customer U_5. Moreover, the 1-2-3 and 4-5-6-7 consumer clusters are connected by the presence of offering B in V. Removing B or #5 cuts the graph into independent parts.

Actual markets contain many more consumers in U, and the number of choices in V can be extensive. Consumer heterogeneity creates complexities for the marketer trying to discover structure in Projection U. Besides, the task is not any easier for an individual consumer who must select the best from a seemingly overwhelming number of alternatives in Projection V. Luckily, one trick frees the consumer from having to learn all the options that are available and being forced to make all the difficult tradeoffs – simply do as others do (as in observational learning). The other can be someone you know or read about as in the above Diary of a Cord-Cutter in 2015. There is no need for a taxonomy of offerings or a complete classification of user types.

In fact, it has become popular to believe that social diffusion or contagion models describe the actual adoption process (e.g., The Tipping Point). Regardless, over time, the U’s and V’s in the bipartite interactions of customers and offerings come to organize each other through mutual influence. Specifically, potential customers learn about the cord-cutting persona through the social and professional media and at the same time come to group together those offerings that the cord-cutter might purchase. Offerings are not alphabetized or catalogued as an academic exercise. There is money to be saved and entertainment to be discovered. Sorting needs to be goal-directed and efficient. I am ready to binge-watch, and I am looking for a recommendation.

I’ll Have What She’s Having

It has taken some time to outline how consumers are able to simplify complex purchase process by modeling the behavior of others. It is such a common experience, although rational decision theory continues to control our statistical modeling of choice. As you are escorted to your restaurant table, you cannot help but notice a delicious meal being served next to where you are seated. You refuse a menu and simply ask for the same dish. “I’ll Have What She’s Having” works as a decision strategy only when I can identify the “she” and the “what” simultaneously.

If we intend to analyze that data we have just talked about collecting, we will need a statistical model. Happily, the R Project for Statistical Computing implements at least two approaches for such joint identification: a latent clustering of a bipartite network in the latentnet package and a nonnegative matrix factorization in the NMF package. The Davis data from the latentnet R package will serve as our illustration. The R code for all the analyses that will be reported can be found at the end of this post.

Stephen Borgatti is a good place to begin with his two-mode social network analysis of the Davis data. The rows are 18 women, the columns are 14 events, and the cells are zero or one depending on whether or not each woman attended each event. The nature of the events has not been specified, but since I am in marketing, I prefer to think of the events as if they were movies seen or concerts attended (i.e., events requiring the purchase of tickets). You will find a latentnet tutorial covering the analysis of this same data as a bipartite network (section 6.3). Finally, a paper by Michael Brusco called “Analysis of two-mode network data using nonnegative matrix factorization” provides a detailed treatment of the NMF approach.

We will start with the plot from the latentnet R package. The names are the women in the rows and the numbered E’s are the events in the columns. The events appear to be separated into two groups of E1 to E6 toward the top and E9 to E14 toward the bottom. E7 and E8 seem to occupy a middle position. The names are also divided into an upper and lower grouping with Ruth and Pearl falling between the two clusters. Does this plot not look similar to the earlier bipartite graph from Barabasi? That is, the linkages between the women and the events organize both into two corresponding clusters tied together by at least two women and two events.

The heatmaps from the NMF reveal the same pattern for the events and the women. You should recall that NMF seeks a lower dimensional representation that will reproduce the original data table with 0s and 1s. In this case, two basis components were extracted. The mixture coefficients for the events vary from 0 to 1 with a darker red indicating a higher contribution for that basis component. The first six events (E1-E6) form the first basis component with the second basis component containing the last six events (E9-E14). As before, E7 and E8 share a more even mixture of the two basis components. Again, the most of the women load on one basis component or the other with Ruth and Pearl traveling freely between both components. As you can easily verify, the names form the same clusters in both plots.

It would help to know something about the events and the women. If E1 through E6 were all of a certain type (e.g., symphony concerts), then we could easily name the first component. Similarly, if all of the women in red at bottom of our basis heatmap played the piano, our results would have at least face validity. A more detailed description of this naming process can be found in a previous example called “What Can We Learn from the Apps on Your Smartphone?“. Those wishing to learn more might want to review the link listed at the end of that post in a note.

Which events should a newcomer attend? If Helen, Nora, Sylvia and Katherine are her friends, the answer is the second cluster of E9-E14. The collaborative filtering of recommender systems enables a novice to decide quickly and easily without a rational appraisal of the feature tradeoffs. Of course, a tradeoff analysis will work as well for we have a joint scaling of products and users. If the event is a concert with a performer you love, then base your decision on a dominating feature. When in tradeoff doubt, go along with your friends.

Finally, brand management can profit from this perspective. Personas work as a design strategy when user types are differentiated by their preference structures and a single individual can represent each group. Although user-centered designers reject segmentations that are based on demographics, attitudes, or benefit statements, a NMF can get very specific and include as many columns as needed (e.g., thousands of movie and even more music recordings). Furthermore, sparsity is not a problem and most of the rows can be empty.

There is no reason why each of the basis components in the above heatmaps could not be summarized by one person and/or one event. However, NMF forms building blocks by jointly clustering many rows and columns. Every potential customer and every possible product configuration are additive compositions built from these blocks. Would not design thinking be better served with several exemplars of each user type rather than trying to generalize from a single individual? Plus, we have the linked columns telling us what attracts each user type in the desired detail provided by the data we collected.

R Code to Produce Plots

library(latentnet)
data(davis)
davis.fit<-ergmm(davis~bilinear(d=2)+rsociality)
plot(davis.fit,pie=TRUE,rand.eff="sociality",labels=TRUE)
 
library(NMF)
data_matrix<-as.matrix.network(davis)
fit<-nmf(data_matrix, 2, "lee", nrun=20)
par(mfrow = c(1, 2))
basismap(fit)
coefmap(fit)

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing networks of characters in ‘Love Actually’

$
0
0

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Every Christmas Eve, my family watches Love Actually. Objectively it’s not a particularly, er, good movie, but it’s well-suited for a holiday tradition. (Vox has got my back here).

Even on the eighth or ninth viewing, it’s impressive what an intricate network of characters it builds. This got me wondering how we could visualize the connections quantitatively, based on how often characters share scenes. So last night, while my family was watching the movie, I loaded up RStudio, downloaded a transcript, and started analyzing.

Parsing

It’s easy to use R to parse the raw script into a data frame, using a combination of dplyr, stringr, and tidyr. (For legal reasons I don’t want to host the script file myself, but it’s literally the first Google result for “Love Actually script.” Just copy the .doc contents into a text file called love_actually.txt).

library(dplyr)
library(stringr)
library(tidyr)

raw <- readLines("love_actually.txt")

lines <- data_frame(raw = raw) %>%
    filter(raw != "", !str_detect(raw, "(song)")) %>%
    mutate(is_scene = str_detect(raw, " Scene "),
           scene = cumsum(is_scene)) %>%
    filter(!is_scene) %>%
    separate(raw, c("speaker", "dialogue"), sep = ":", fill = "left") %>%
    group_by(scene, line = cumsum(!is.na(speaker))) %>%
    summarize(speaker = speaker[1], dialogue = str_c(dialogue, collapse = " "))

I also set up a CSV file matching characters to their actors, which you can read in separately. (I chose 20 characters that have notable roles in the story).

cast <- read.csv(url("http://varianceexplained.org/files/love_actually_cast.csv"))

lines <- lines %>%
    inner_join(cast) %>%
    mutate(character = paste0(speaker, " (", actor, ")"))

Now we have a tidy data frame with one row per line, along with columns describing the scene number and characters:

lines data.frame

From here it’s easy to count the lines-per-scene-per-character, and to turn it into a binary speaker-by-scene matrix.

by_speaker_scene <- lines %>%
    count(scene, character)

by_speaker_scene
## Source: local data frame [162 x 3]
## Groups: scene [?]
## 
##    scene                character     n
##    (int)                    (chr) (int)
## 1      2       Billy (Bill Nighy)     5
## 2      2      Joe (Gregor Fisher)     3
## 3      3      Jamie (Colin Firth)     5
## 4      4     Daniel (Liam Neeson)     3
## 5      4    Karen (Emma Thompson)     6
## 6      5    Colin (Kris Marshall)     4
## 7      6    Jack (Martin Freeman)     2
## 8      6       Judy (Joanna Page)     1
## 9      7    Mark (Andrew Lincoln)     4
## 10     7 Peter (Chiwetel Ejiofor)     4
## ..   ...                      ...   ...
library(reshape2)
speaker_scene_matrix <- by_speaker_scene %>%
    acast(character ~ scene, fun.aggregate = length)

dim(speaker_scene_matrix)
## [1] 20 76

Now we can get to the interesting stuff!

Analysis

Whenever we have a matrix, it’s worth trying to cluster it. Let’s start with hierarchical clustering.1

norm <- speaker_scene_matrix / rowSums(speaker_scene_matrix)

h <- hclust(dist(norm, method = "manhattan"))

plot(h)

center

This looks about right! Almost all the romantic pairs are together (Natalia/PM; Aurelia/Jamie, Harry/Karen; Karl/Sarah; Juliet/Peter; Jack/Judy) as are the friends (Colin/Tony; Billy/Joe) and family (Daniel/Sam).

One thing this tree is perfect for is giving an ordering that puts similar characters close together:

ordering <- h$labels[h$order]
ordering
##  [1] "Natalie (Martine McCutcheon)" "PM (Hugh Grant)"             
##  [3] "Aurelia (Lúcia Moniz)"        "Jamie (Colin Firth)"         
##  [5] "Daniel (Liam Neeson)"         "Sam (Thomas Sangster)"       
##  [7] "Jack (Martin Freeman)"        "Judy (Joanna Page)"          
##  [9] "Colin (Kris Marshall)"        "Tony (Abdul Salis)"          
## [11] "Billy (Bill Nighy)"           "Joe (Gregor Fisher)"         
## [13] "Mark (Andrew Lincoln)"        "Juliet (Keira Knightley)"    
## [15] "Peter (Chiwetel Ejiofor)"     "Karl (Rodrigo Santoro)"      
## [17] "Sarah (Laura Linney)"         "Mia (Heike Makatsch)"        
## [19] "Harry (Alan Rickman)"         "Karen (Emma Thompson)"

This ordering can be used to make other graphs more informative. For instance, we can visualize a timeline of all scenes:

scenes <- by_speaker_scene %>%
    filter(n() > 1) %>%        # scenes with > 1 character
    ungroup() %>%
    mutate(scene = as.numeric(factor(scene)),
           character = factor(character, levels = ordering))

ggplot(scenes, aes(scene, character)) +
    geom_point() +
    geom_path(aes(group = scene))

center

If you’ve seen the film as many times as I have (you haven’t), you can stare at this graph and the film’s scenes spring out, like notes engraved in vinyl.

One reason it’s good to lay out raw data like this (as opposed to processed metrics like distances) is that anomalies stand out. For instance, look at the last scene: it’s the “coda” at the airport that includes 15 (!) characters. If we’re going to plot this as a network (and we totally are!) we’ve got to ignore that scene, or else it looks like almost everyone is connected to everyone else.

After that, we can create a cooccurence matrix (see here) containing how many times two characters share scenes:

non_airport_scenes <- speaker_scene_matrix[, colSums(speaker_scene_matrix) < 10]

cooccur <- non_airport_scenes %*% t(non_airport_scenes)

heatmap(cooccur)

center

This gives us a sense of how the clustering in the above graph occurred. We can then use the igraph package to plot the network.

library(igraph)
g <- graph.adjacency(cooccur, weighted = TRUE, mode = "undirected", diag = FALSE)
plot(g, edge.width = E(g)$weight)

center

A few patterns pop out of this visualization. We see that the majority of characters are tightly connected (often by the scenes at the school play, or by Karen (Emma Thompson), who is friends or family to many key characters). But we see Bill Nighy’s plotline occurs almost entirely separate from everyone else, and that five other characters are linked to the main network by only a single thread (Sarah’s conversation with Mark at the wedding).

One interesting aspect of this data is that this network builds over the course of the movie, growing nodes and connections as characters and relationships are introduced. There are a few ways to show this evolving network (such as an animation), but I decided to make it an interactive Shiny app, which lets the user specify the scene and shows the network that the movie has built up to that point.


network Shiny app

(You can view the code for the Shiny app on GitHub).

Data Actually

Have you heard the complaint that we are “drowning in data”? How about the horror stories about how no one understands statistics, and we need trained statisticians as the “police” to keep people from misinterpreting their methods? It sure makes data science sound like important, dreary work.

Whenever I get gloomy about those topics, I try to spend a little time on silly projects like this, which remind me why I learned statistical programming in the first place. It took minutes to download a movie script and turn it into usable data, and within a few hours, I was able to see the movie in a new way. We’re living in a wonderful world: one with powerful tools like R and Shiny, and one overflowing with resources that are just a Google search away.

Maybe you don’t like ‘Love Actually’; you like Star Wars. Or you like baseball, or you like comparing programming languages. Or you’re interested in dating, or hip hop. Whatever questions you’re interested in, the answers are just a search and a script away. If you look for it, I’ve got a sneaky feeling you’ll find that data actually is all around us.

Footnotes

  1. We made a few important choices in our clustering here. First, we normalized so that the number of scenes for each character adds up to 1: otherwise, we wouldn’t be clustering based on a character’s distribution across scenes so much as the number of scenes they’re in. Secondly, we used Manhattan distance, which for a binary matrix means “how many scenes is one of these characters in that the other isn’t”. Try varying these approaches to see how the clusters change!

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Visualization of World Cuisines

$
0
0

(This article was first published on Design Data Decisions » R, and kindly contributed to R-bloggers)

In a previous post, we had ‘mapped’ the culinary diversity in India through a visualization of food consumption patterns. Since then, one of the topics in my to-do list was a visualization of world cuisines. The primary question was similar to that asked of the Indian cuisine: Are cuisines of geographically and culturally closer regions also similar? I recently came across an article on the analysis of recipe ingredients that distinguish the cuisines of the world. The analysis was conducted on a publicly available dataset consisting of ingredients for more than 13,000 recipes from the recipe website Epicurious. Each recipe was also tagged with the cuisine it belonged to, and there were a total of 26 different cuisines. This dataset was initially reported in an analysis of flavor network and principles of food pairing.

In this post, we (re)look the Epicurious recipe dataset and perform an exploratory analysis and visualization of ingredient frequencies among cuisines. Ingredients that are frequently found in a region’s recipes would also have high consumption in that region, and so an analysis of the ‘ingredient frequency’ of a cuisine should give us similar info as an analysis of ‘ingredient consumption’.

Outline of Analysis Method

Here is a part of the first few lines of data from the Epicurious dataset:

 Vietnamese vinegar cilantro mint olive_oil cayenne fish lime_juice
Vietnamese onion cayenne fish black_pepper seed garlic
Vietnamese garlic soy_sauce lime_juice thai_pepper
Vietnamese cilantro shallot lime_juice fish cayenne ginger  pea
Vietnamese coriander vinegar lemon lime_juice fish cayenne  scallion
Vietnamese coriander lemongrass sesame_oil beef root fish

Each row of the dataset lists the ingredients for one recipe and the first column gives the cuisine the recipe belongs to. As the first step in our analysis, we collect ALL the ingredients for each cuisine (over all the recipes for that cuisine). Then we calculate the frequency of occurrence of each ingredient in each cuisine and normalize the frequencies for each cuisine with the number of recipes available for that cuisine. This matrix of normalized ingredient frequencies is used for further analysis.

We use two approaches for the exploratory analysis of the normalized ingredient frequencies: (1) heatmap and (2) principal component analysis (pca), followed by display using biplots. The complete R code for the analysis is given at the end of this post.

Results

There are a total of 350 ingredients occurring in the dataset (among all cuisines). Some of the ingredients occur in just one cuisine, which, though interesting, will not be of much use for the current analysis. For better visual display, we restrict attention to ingredients showing most variation in normalized frequency across cuisines. The results are shown below:

Heatmap:

heatmap_food

 Biplot:

biplot_food

The figures look self-explanatory and does show the clustering together of geographically nearby regions on the basis of commonly used ingredients. Moreover, we also notice the grouping together of regions with historical travel patterns (North Europe and American, Spanish_Portuguese and SouthAmerican/Mexican) or historical trading patterns (Indian and Middle East).

We need to further test the stability of the grouping obtained here by including data from the Allrecipes dataset. Also, probably taking the third principal component might dissipate some of the crowd along the PC2 axis. These would be some of the tasks for the next post…

Here is the complete R code used for the analysis:

workdir <- "C:\Path\To\Dataset\Directory"
datafile <- file.path(workdir,"epic_recipes.txt")
data <- read.table(datafile, fill=TRUE, col.names=1:max(count.fields(datafile)),
na.strings=c("", "NA"), stringsAsFactors = FALSE)

a <- aggregate(data[,-1], by=list(data[,1]), paste, collapse=",")
a$combined <- apply(a[,2:ncol(a)], 1, paste, collapse=",")
a$combined <- gsub(",NA","",a$combined) ## this column contains the totality of all ingredients for a cuisine

cuisines <- as.data.frame(table(data[,1])) ## Number of recipes for each cuisine
freq <- lapply(lapply(strsplit(a$combined,","), table), as.data.frame) ## Frequency of ingredients
names(freq) <- a[,1]
prop <- lapply(seq_along(freq), function(i) {
colnames(freq[[i]])[2] <- names(freq)[i]
freq[[i]][,2] <- freq[[i]][,2]/cuisines[i,2] ## proportion (normalized frequency)
freq[[i]]}
)
names(prop) <- a[,1] ## this is a list of 26 elements, one for each cuisine

final <- Reduce(function(...) merge(..., all=TRUE, by="Var1"), prop)
row.names(final) <- final[,1]
final <- final[,-1]
final[is.na(final)] <- 0 ## If ingredient missing in all recipes, proportion set to zero
final <- t(final) ## proportion matrix

s <- sort(apply(final, 2, sd), decreasing=TRUE)
## Selecting ingredients with maximum variation in frequency among cuisines and
## Using standardized proportions for final analysis
final_imp <- scale(subset(final, select=names(which(s > 0.1)))) 

## heatmap 
library(gplots) 
heatmap.2(final_imp, trace="none", margins = c(6,11), col=topo.colors(7), 
key=TRUE, key.title=NA, keysize=1.2, density.info="none") 

## PCA and biplot 
p <- princomp(final_imp) 
biplot(p,pc.biplot=TRUE, col=c("black","red"), cex=c(0.9,0.8), 
xlim=c(-2.5,2.5), xlab="PC1, 39.7% explained variance", ylab="PC2, 24.5% explained variance") 

 

To leave a comment for the author, please follow the link and comment on their blog: Design Data Decisions » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fun with Heatmaps and Plotly

$
0
0

(This article was first published on Modern Data » R, and kindly contributed to R-bloggers)



Just because we all like numbers doesn’t mean we can’t have some fun.

Here’s to wishing to everyone a very Happy New Year !

# install.packages("jpeg") 

library(jpeg)
library(plotly)

# Download a jpeg file from imgur
URL <- "http://i.imgur.com/FWsFq6r.jpg"
file <- tempfile()
download.file(URL, file, mode = "wb")

# Read in JPEG file
j <- readJPEG(file)
j <- j[,,1]

# Create an empty matrix
img.mat <-  mat.or.vec(nrow(j), ncol(j))

# Identify elements where there is data
idx <- j > 0

# Add some glitter like effect
img.mat[idx] <-  sample(x = seq(0,1,by = 0.1), size = sum(idx), replace = T)

# Add some glitter to background
idx <-  j == 0
img.mat[idx] <-  sample(seq(0.7,0.9,0.01), size = sum(idx), replace = T)

# Invert the matrix or else it prints upside down
img.mat[nrow(img.mat):1,] <- img.mat[1:nrow(img.mat),]

# Plot !!!
x.axisSettings <- list(
  title = "Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning. -Albert Einstein",
  titlefont = list(
    family = 'Arial, sans-serif',
    size = 12,
    color = 'black'
  ),
  zeroline = FALSE,
  showline = FALSE,
  showticklabels = FALSE,
  showgrid = FALSE,
  ticks = ""
)

y.axisSettings <- list(
  title = "",
  zeroline = FALSE,
  showline = FALSE,
  showticklabels = FALSE,
  showgrid = FALSE,
  ticks = ""
)


bordercolor = "#ffa64d"
borderwidth = 20

nCol = ncol(img.mat)
nRow = nrow(img.mat)

plot_ly(z = img.mat, colorscale = "Hot", type = "heatmap", showscale = F, hoverinfo = "none") %>%
  layout(xaxis = x.axisSettings,
         yaxis = y.axisSettings,

         # Add a border
         shapes = list(

           # left border
           list(type = 'rect', fillcolor = bordercolor, line = list(color = bordercolor),
                x0 = 0, x1 = borderwidth,
                y0 = 0, y1 = nRow),

           # Right border
           list(type = 'rect', fillcolor = bordercolor, line = list(color = bordercolor),
                x0 = nCol - borderwidth, x1 = nCol,
                y0 = 0, y1 = nRow),

           # Top border
           list(type = 'rect', fillcolor = bordercolor, line = list(color = bordercolor),
                x0 = 0, x1 = nCol,
                y0 = nRow, y1 = nRow - borderwidth),

           # Bottom border
           list(type = 'rect', fillcolor = bordercolor, line = list(color = bordercolor),
                x0 = 0, x1 = nCol,
                y0 = 0, y1 = borderwidth)))



To leave a comment for the author, please follow the link and comment on their blog: Modern Data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R

$
0
0

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Moritz Stefaner started off 2016 with a very spiffy post on “a visual exploration of the spatial patterns in the endings of German town and village names”. Moritz was exploring some new data processing & visualization tools for the post, but when I saw what he was doing I wondered how hard it would be to do something similar in R and also used it as an opportunity to start practicing a new habit in 2016: packages vs projects.

To state more precisely the goals for this homage, the plan was to:

  • use as close to the same data sets Mortiz has in his github repo, including the ones in pure javascript
  • generate an HTML page as output that is as close to the style in Moritz’s visualization
  • use R for everything (i.e. no “cheating” by sneaking in some javascript via htmlwidgets)
  • bundle everything into a package to take advantage of all the good stuff that comes with R package validation

You may want to take a look at the result to see if you want to continue reading (I hope you will!).

The Setup

rud_is_zellingenach_htmlBy using an R package as the framework for the visualization, it’s possible to keep the data with the code and also organize and document the code in a way that makes it easy for folks to use and explore without cutting and pasting (our sourceing) code. It also makes it possible to list all the dependencies for the project and help ensure they’ll be installed when someone tries to work with it.

While I could have converted Moritz’s processed data into R data files, I left the CSV intact and the javascript file of suffix groupings also intact to show that R is extremely flexible when it comes to data processing (which is a “duh” for most folks by this point but the use of javascript data structures might give some folks ideas as how to reduce data duplication between projects). Both these files get stored in the inst/alt folder of the source package. I also end up using some CSS for the final visualization and placed that into a file in the same directory, which makes the code that generates the HTML a bit cleaner.

Because R processes some things automatically (like .onAttach) when it interacts with a package one can have it provide helpful instructions (in this case, how to generate the visualization) in similar fashion to the ggplot2 loading messages.

Similarly, there both the package itself and the package functions have documentation to help folks understand both what the package and each component is doing.

The Fun Stuff

rud_is_zellingenach_htmlThe CSV file of places looks something like this:

name,latitude,longitude
Nierskanal,49.01,13.23
Zwiefelhof,49.22,11.18
Zwiefaltendorf,48.21,9.51
Zwiefalten,48.23,9.46
Zwiedorf,53.69,13.05
Zwickgabel,48.58,8.31
Zwickau,50.72,12.48
Zwethau,51.58,13.04
Zwesten,51.05,9.17

and, the suffix groupings list looks like this:

const suffixList = [
  ["ach", "a", "aa", "ah"],
  ["ar", "ahr"],
  ["ate", "te", "nit", "net"],
  ["au", "aue", "oog", "ooge", "ohe", "oie"],
  ["bach", "bach", "bek", "beken", "beck", "bke"],
  ["berg", "bergen", "barg", "bargen"],
  ["born", "bronn"],
  ["bruch", "broich", "brook", "brock", "brauk"],
  ["bruck", "brück", "brügge"],
  ...
];

While read.csv (no need for readr as the file is small) can handle the CSV file, we use the V8 package to source the javascript and convert it to an R object:

ct <- v8()
ct$source(system.file("alt/suffixlist.js", package="zellingenach"))
ct$get("suffixList")

We actually turn that into a vector of regular expressions (for town name ending checking) and a list of vectors (for the HTML visualization creation). Check out suffix_regex() and suffix_names() in the source code.

The read_places() function builds a data.frame of the places combined with the suffix grouping(s) they belong to:

# read in the file
plc <- read.csv(system.file("alt/placenames_de.tsv", package="zellingenach"),
                stringsAsFactors=FALSE)
 
# iterate over each suffix and identify which place names match the grouping
lapply(suf, function(regex) {
  which(stri_detect_regex(plc$name, regex))
}) -> matched_endings
 
plc$found <- ""
 
# add which grouping(s) the place was found to a new column
for(i in 1:length(matched_endings)) {
  where_found <- matched_endings[[i]]
  plc$found[where_found] <-
    paste0(plc$found[where_found], sprintf("%d|", i))
}
 
# some don't match so get rid of them
mutate(filter(plc, found != ""), found=sub("\|$", "", found))

I do something a bit different than Moritz in that in that I allow towns to be part of multiple suffix groups, since:

  • I’m neither a historian nor expert in German town naming conventions, and
  • the javascript version and this R version both take a naive approach to suffix mapping.

This means my numbers (for the “#### places” label) will be different for some of my maps.

R has similar shortcut functions (Mortiz uses D3) to make hexgrids out of shapefiles. Here’s the entirety of create_hexgrid():

de_shp <- getData("GADM", country="DEU", level=0, path=tempdir())
 
de_hex_pts <- spsample(de_shp, type="hexagonal", n=10000, cellsize=0.19,
                       offset=c(0.5, 0.5), pretty=TRUE)
 
HexPoints2SpatialPolygons(de_hex_pts)

You can play with cellsize to change the number of hexes. I tried to find a good number to get close to the # in Moritz’s maps.

This all gets put together in make_maps() where we use ggplot2 to build 52 gridded heatmaps (one for each suffix grouping). I used a log of the counts to map to a binned viridis color scale, so my colors come out a bit different than Moritz’s but the overall patterns are on par with his.

Finally, display_maps() takes the list created by make_maps() and builds out an HTML page using the htmltools package for the page framework and svglite::htmlSVG to make SVGs of the ggplot objects). NOTE that you can use the output_file option of display_maps() to send the HTML to a file as well as display it in the viewer/browser.

Fin

rud_is_zellingenach_htmlBecause the project is in a pacakge, we can run package checks to see if we’re missing anything including other pacakge dependencies, function documentation and other details that the package tools are gleeful to point out. We can also include code to test out our various components to ensure they are behaving as expected (i.e. generating the right data/output).

Once nice thing about the output is that it’s “responsive”, which means it handles multiple screen sizes quite well. So, if your screen is huge, you’ll have many map boxes on one line and if it’s small (like the iframe below) it will have fewer.

You’ll see that my maps are a bit bigger than Moritz’s. This is due to both the hex grid size and the fact that the SVG output is just slightly larger overall than the ones made by D3. Of note: I noticed some suffix subtitle components wrapped at the “-” so I converted the plain dashes to non-breaking ones &#8209;/”‑”.

The one downside to using a package for this is that it’s harder to post complete code into a blog post, but you can clone the repo to look at the code and skip the dissection and just generate the visualization locally via:

devtools::install_github("hrbrmstr/zellingenach")
display_maps()

By targeting SVG & HTML, we can make a cross-platform, crisp and responsive visualization all without leaving RStudio.

If you caught any errors or made something cool with any of the code, please drop an issue on github and a note in the comments (respectively)!

Happy New YeaR!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R trends in 2015 (based on cranlogs)

$
0
0

(This article was first published on R – G-Forge, and kindly contributed to R-bloggers)
What are the current tRends? The image is CC from coco + kelly.
What are the current tRends? The image is CC from coco + kelly.

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.

Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
library(rvest)
library(dplyr)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(magrittr)
library(lubridate)
 
getCranberriesElmnt <- function(txt, elmnt_name){
  desc <- grep(sprintf("^%s:", elmnt_name), txt)
  if (length(desc) == 1){
    txt <- txt[desc:length(txt)]
    end <- grep("^[A-Za-z/@]{2,}:", txt[-1])
    if (length(end) == 0)
      end <- length(txt)
    else
      end <- end[1]
 
    desc <-
      txt[1:end] %>% 
      gsub(sprintf("^%s: (.+)", elmnt_name),
           "\1", .) %>% 
      paste(collapse = " ") %>% 
      gsub("[ ]{2,}", " ", .) %>% 
      gsub(" , ", ", ", .)
  }else if (length(desc) == 0){
    desc <- paste("No", tolower(elmnt_name))
  }else{
    stop("Could not find ", elmnt_name, " in text: n",
         paste(txt, collapse = "n"))
  }
  return(desc)
}
 
convertCharset <- function(txt){
  if (grepl("Windows", Sys.info()["sysname"]))
    txt <- iconv(txt, from = "UTF-8", to = "cp1252")
  return(txt)
}
 
getAuthor <- function(txt, package){
  author <- getCranberriesElmnt(txt, "Author")
  if (grepl("No author|See AUTHORS file", author)){
    author <- getCranberriesElmnt(txt, "Maintainer")
  }
 
  if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || 
      is.null(author) ||
      nchar(author)  <= 2){
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    author <- cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Author", .)] %>% 
      gsub(".*n", "", .)
 
    # If not found then the package has probably been
    # removed from the repository
    if (length(author) == 1)
      author <- author
    else
      author <- "No author"
  }
 
  # Remove stuff such as:
  # [cre, auth]
  # (worked on the...)
  # <my@email.com>
  # "John Doe"
  author %<>% 
    gsub("^Author: (.+)", 
         "\1", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("\([^)]+\)", " ", .) %>% 
    gsub("([ ]*<[^>]+>)", " ", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("[ ]{2,}", " ", .) %>% 
    gsub("(^[ '"]+|[ '"]+$)", "", .) %>% 
    gsub(" , ", ", ", .)
  return(author)
}
 
getDate <- function(txt, package){
  date <- 
    grep("^Date/Publication", txt)
  if (length(date) == 1){
    date <- txt[date] %>% 
      gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*",
           "\1", .)
  }else{
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    date <- 
      cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Published", .)] %>% 
      gsub(".*n", "", .)
 
 
    # The main page doesn't contain the original date if 
    # new packages have been submitted, we therefore need
    # to check first entry in the archives
    if(cran_txt %>% 
       html_nodes("tr") %>% 
       html_text %>% 
       gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
       grepl("^Old.{1,4}sources", .) %>% 
       any){
      archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/",
                                       package))
      pkg_date <- 
        archive_txt %>% 
        html_nodes("tr") %>% 
        lapply(function(x) {
          nodes <- html_nodes(x, "td")
          if (length(nodes) == 5){
            return(nodes[3] %>% 
                     html_text %>% 
                     as.Date(format = "%d-%b-%Y"))
          }
        }) %>% 
        .[sapply(., length) > 0] %>% 
        .[!sapply(., is.na)] %>% 
        head(1)
 
      if (length(pkg_date) == 1)
        date <- pkg_date[[1]]
    }
  }
  date <- tryCatch({
    as.Date(date)
  }, error = function(e){
    "Date missing"
  })
  return(date)
}
 
getNewPkgStats <- function(published_in){
  # The parallel is only for making cranlogs requests
  # we can therefore have more cores than actual cores
  # as this isn't processor intensive while there is
  # considerable wait for each http-request
  cl <- create_cluster(parallel::detectCores() * 4)
  parallel::clusterEvalQ(cl, {
    library(cranlogs)
  })
  set_default_cluster(cl)
  on.exit(stop_cluster())
 
  berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/"))
  pkgs <- 
    # Select the divs of the package class
    html_nodes(berries, ".package") %>% 
    # Extract the text
    html_text %>% 
    # Split the lines
    strsplit("[n]+") %>% 
    # Now clean the lines
    lapply(.,
           function(pkg_txt) {
             pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ t]+", "", x)) > 0}, 
                            USE.NAMES = FALSE)] %>% 
               gsub("^[ t]+", "", .) 
           })
 
  # Now we select the new packages
  new_packages <- 
    pkgs %>% 
    # The first line is key as it contains the text "New package"
    sapply(., function(x) x[1], USE.NAMES = FALSE) %>% 
    grep("^New package", .) %>% 
    pkgs[.] %>% 
    # Now we extract the package name and the date that it was published
    # and merge everything into one table
    lapply(function(txt){
      txt <- convertCharset(txt)
      ret <- data.frame(
        name = gsub("^New package ([^ ]+) with initial .*", 
                     "\1", txt[1]),
        stringsAsFactors = FALSE
      )
 
      ret$desc <- getCranberriesElmnt(txt, "Description")
      ret$author <- getAuthor(txt, ret$name)
      ret$date <- getDate(txt, ret$name)
 
      return(ret)
    }) %>% 
    rbind_all %>% 
    # Get the download data in parallel
    partition(name) %>% 
    do({
      down <- cran_downloads(.$name[1], 
                             from = max(as.Date("2015-01-01"), .$date[1]), 
                             to = "2015-12-31")$count 
      cbind(.[1,],
            data.frame(sum = sum(down), 
                       avg = mean(down))
      )
    }) %>% 
    collect %>% 
    ungroup %>% 
    arrange(desc(avg))
 
  return(new_packages)
}
 
pkg_list <- 
  lapply(2010:2015,
         getNewPkgStats)
 
pkgs <- 
  rbind_all(pkg_list) %>% 
  mutate(time = as.numeric(as.Date("2016-01-01") - date),
         year = format(date, "%Y"))

Downloads and time on CRAN

The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:

?View Code RSPLUS
1
2
3
4
5
6
7
8
pkgs %<>% 
  mutate(time_yrs = time/365.25)
fit <- lm(avg ~ time_yrs, data = pkgs)
 
# Test for non-linearity
library(splines)
anova(fit,
      update(fit, .~.-time_yrs+ns(time_yrs, 2)))
Analysis of Variance Table

Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
  Res.Df       RSS Df Sum of Sq      F Pr(>F)
1   7348 189661922                           
2   7347 189656567  1    5355.1 0.2075 0.6488

Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(quantreg)
library(htmlTable)
lapply(c(.5, .75, .95, .99),
       function(tau){
         rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau)
         rq_sum <- summary(rq_fit)
         c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), 
           `95 % CI` = txtRound(rq_sum$coefficients[2, 1] + 
                                        c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% 
             paste(collapse = " to "))
       }) %>% 
  do.call(rbind, .) %>% 
  htmlTable(rnames = c("Median",
                       "Upper quartile",
                       "Top 5%",
                       "Top 1%"))
Estimate 95 % CI
Median 0.6 0.6 to 0.6
Upper quartile 1.2 1.2 to 1.1
Top 5% 9.7 11.9 to 7.6
Top 1% 182.5 228.2 to 136.9

The above table conveys a slightly more interesting picture. Most packages don’t get that much attention while the top 1% truly reach the masses.

Top downloaded packages

In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).

Downloads
Name Author Total Average/day Description
Top 10 packages published in 2015
xml2 Hadley Wickham, Jeroen Ooms, RStudio, R Foundation 348,222 1635 Work with XML files …
rversions Gabor Csardi 386,996 1524 Query the main R SVN…
git2r Stefan Widgren 411,709 1303 Interface to the lib…
praise Gabor Csardi, Sindre Sorhus 96,187 673 Build friendly R pac…
readxl David Hoerl 99,386 379 Import excel files i…
readr Hadley Wickham, Romain Francois, R Core Team, RStudio 90,022 337 Read flat/tabular te…
DiagrammeR Richard Iannone 84,259 236 Create diagrams and …
visNetwork Almende B.V. (vis.js library in htmlwidgets/lib, 41,185 233 Provides an R interf…
plotly Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy 9,745 217 Easily translate ggp…
DT Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc 24,806 120 Data objects in R ca…
Top 10 packages published in 2014
stringi Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc. 1,316,900 3608 stringi allows for v…
magrittr Stefan Milton Bache and Hadley Wickham 1,245,662 3413 Provides a mechanism…
mime Yihui Xie 1,038,591 2845 This package guesses…
R6 Winston Chang 920,147 2521 The R6 package allow…
dplyr Hadley Wickham, Romain Francois 778,311 2132 A fast, consistent t…
manipulate JJ Allaire, RStudio 626,191 1716 Interactive plotting…
htmltools RStudio, Inc. 619,171 1696 Tools for HTML gener…
curl Jeroen Ooms 599,704 1643 The curl() function …
lazyeval Hadley Wickham, RStudio 572,546 1569 A disciplined approa…
rstudioapi RStudio 515,665 1413 This package provide…
Top 10 packages published in 2013
jsonlite Jeroen Ooms, Duncan Temple Lang 906,421 2483 This package is a fo…
BH John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois 691,280 1894 Boost provides free …
highr Yihui Xie and Yixuan Qiu 641,052 1756 This package provide…
assertthat Hadley Wickham 527,961 1446 assertthat is an ext…
httpuv RStudio, Inc. 310,699 851 httpuv provides low-…
NLP Kurt Hornik 270,682 742 Basic classes and me…
TH.data Torsten Hothorn 242,060 663 Contains data sets u…
NMF Renaud Gaujoux, Cathal Seoighe 228,807 627 This package provide…
stringdist Mark van der Loo 123,138 337 Implements the Hammi…
SnowballC Milan Bouchet-Valat 104,411 286 An R interface to th…
Top 10 packages published in 2012
gtable Hadley Wickham 1,091,440 2990 Tools to make it eas…
knitr Yihui Xie 792,876 2172 This package provide…
httr Hadley Wickham 785,568 2152 Provides useful tool…
markdown JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte 636,888 1745 Markdown is a plain-…
Matrix Douglas Bates and Martin Maechler 470,468 1289 Classes and methods …
shiny RStudio, Inc. 427,995 1173 Shiny makes it incre…
lattice Deepayan Sarkar 414,716 1136 Lattice is a powerfu…
pkgmaker Renaud Gaujoux 225,796 619 This package provide…
rngtools Renaud Gaujoux 225,125 617 This package contain…
base64enc Simon Urbanek 223,120 611 This package provide…
Top 10 packages published in 2011
scales Hadley Wickham 1,305,000 3575 Scales map data to a…
devtools Hadley Wickham 738,724 2024 Collection of packag…
RcppEigen Douglas Bates, Romain Francois and Dirk Eddelbuettel 634,224 1738 R and Eigen integrat…
fpp Rob J Hyndman 583,505 1599 All data sets requir…
nloptr Jelmer Ypma 583,230 1598 nloptr is an R inter…
pbkrtest Ulrich Halekoh Søren Højsgaard 536,409 1470 Test in linear mixed…
roxygen2 Hadley Wickham, Peter Danenberg, Manuel Eugster 478,765 1312 A Doxygen-like in-so…
whisker Edwin de Jonge 413,068 1132 logicless templating…
doParallel Revolution Analytics 299,717 821 Provides a parallel …
abind Tony Plate and Richard Heiberger 255,151 699 Combine multi-dimens…
Top 10 packages published in 2010
reshape2 Hadley Wickham 1,395,099 3822 Reshape lets you fle…
labeling Justin Talbot 1,104,986 3027 Provides a range of …
evaluate Hadley Wickham 862,082 2362 Parsing and evaluati…
formatR Yihui Xie 640,386 1754 This package provide…
minqa Katharine M. Mullen, John C. Nash, Ravi Varadhan 600,527 1645 Derivative-free opti…
gridExtra Baptiste Auguie 581,140 1592 misc. functions
memoise Hadley Wickham 552,383 1513 Cache the results of…
RJSONIO Duncan Temple Lang 414,373 1135 This is a package th…
RcppArmadillo Romain Francois and Dirk Eddelbuettel 410,368 1124 R and Armadillo inte…
xlsx Adrian A. Dragulescu 401,991 1101 Provide R functions …


Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more.

R-star authors

Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
top_coders <- list(
  "2015" = 
    pkgs %>% 
    filter(format(date, "%Y") == 2015) %>% 
    partition(author) %>% 
    do({
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(10),
  "all" =
    pkgs %>% 
    partition(author) %>% 
    do({
      if (grepl("Jeroen Ooms", .$author))
        browser()
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(30))
 
interactiveTable(
  do.call(rbind, top_coders) %>% 
    mutate(download_ave = txtInt(download_ave)),
  align = "lrr",
  header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"),
  tspanner = c("Top coders 2015",
               "Top coders 2010-2015"),
  n.tspanner = sapply(top_coders, nrow),
  minimized.columns = 4, 
  rnames = FALSE, 
  col.rgroup = c("white", "#F0F0FF"))
Coder Total ave. downloads No. of packages Packages
Top coders 2015
Gabor Csardi 2,312 11 sankey, franc, rvers…
Stefan Widgren 1,563 1 git2r
RStudio 781 16 shinydashboard, with…
Hadley Wickham 695 12 withr, cellranger, c…
Jeroen Ooms 541 10 rjade, js, sodium, w…
Richard Cotton 501 22 assertive.base, asse…
R Foundation 490 1 xml2
David Hoerl 455 1 readxl
Sindre Sorhus 409 2 praise, clisymbols
Richard Iannone 294 2 DiagrammeR, stationa…
Top coders 2010-2015
Hadley Wickham 32,115 55 swirl, lazyeval, ggp…
Yihui Xie 9,739 18 DT, Rd2roxygen, high…
RStudio 9,123 25 shinydashboard, lazy…
Jeroen Ooms 4,221 25 JJcorr, gdtools, bro…
Justin Talbot 3,633 1 labeling
Winston Chang 3,531 17 shinydashboard, font…
Gabor Csardi 3,437 26 praise, clisymbols, …
Romain Francois 2,934 20 int64, LSD, RcppExam…
Duncan Temple Lang 2,854 6 RMendeley, jsonlite,…
Adrian A. Dragulescu 2,456 2 xlsx, xlsxjars
JJ Allaire 2,453 7 manipulate, htmlwidg…
Simon Urbanek 2,369 15 png, fastmatch, jpeg…
Dirk Eddelbuettel 2,094 33 Rblpapi, RcppSMC, RA…
Stefan Milton Bache 2,069 3 import, blatr, magri…
Douglas Bates 1,966 5 PKPDmodels, RcppEige…
Renaud Gaujoux 1,962 6 NMF, doRNG, pkgmaker…
Jelmer Ypma 1,933 2 nloptr, SparseGrid
Rob J Hyndman 1,933 3 hts, fpp, demography
Baptiste Auguie 1,924 2 gridExtra, dielectri…
Ulrich Halekoh Søren Højsgaard 1,764 1 pbkrtest
Martin Maechler 1,682 11 DescTools, stabledis…
Mirai Solutions GmbH 1,603 3 XLConnect, XLConnect…
Stefan Widgren 1,563 1 git2r
Edwin de Jonge 1,513 10 tabplot, tabplotGTK,…
Kurt Hornik 1,476 12 movMF, ROI, qrmtools…
Deepayan Sarkar 1,369 4 qtbase, qtpaint, lat…
Tyler Rinker 1,203 9 cowsay, wakefield, q…
Yixuan Qiu 1,131 12 gdtools, svglite, hi…
Revolution Analytics 1,011 4 doParallel, doSMP, r…
Torsten Hothorn 948 7 MVA, HSAUR3, TH.data…

It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.

My own 2015-R-experience

My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable.

When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous.

When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:

  • DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
  • checkmate A neat package for checking function arguments.
  • covr An excellent package for testing how much of a package’s code is tested.
  • rex A package for making regular easier.
  • openxlsx I wish I didn’t have to but I still get a lot of things in Excel-format – perhaps this package solves the Excel-import inferno…
  • R6 The successor to reference classes – after working with the Gmisc::Transition-class I appreciate the need for a better system.

Flattr this!

To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

100 “must read” R-bloggers’ posts for 2015

$
0
0

The site R-bloggers.com is now 6 years young. It strives to be an (unofficial) online news and tutorials website for the R community, written by over 600 bloggers who agreed to contribute their R articles to the website. In 2015, the site served almost 17.7 million pageviews to readers worldwide.

In celebration to R-bloggers’ 6th birth-month, here are the top 100 most read R posts written in 2015, enjoy:

  1. How to Learn R
  2. How to Make a Histogram with Basic R
  3. How to Make a Histogram with ggplot2
  4. Choosing R or Python for data analysis? An infographic
  5. How to Get the Frequency Table of a Categorical Variable as a Data Frame in R
  6. How to perform a Logistic Regression in R
  7. A new interactive interface for learning R online, for free
  8. How to learn R: A flow chart
  9. Learn Statistics and R online from Harvard
  10. Twitter’s new R package for anomaly detection
  11. R 3.2.0 is released (+ using the installr package to upgrade in Windows OS)
  12. What’s the probability that a significant p-value indicates a true effect?
  13. Fitting a neural network in R; neuralnet package
  14. K-means clustering is not a free lunch
  15. Why you should learn R first for data science
  16. How to format your chart and axis titles in ggplot2
  17. Illustrated Guide to ROC and AUC
  18. The Single Most Important Skill for a Data Scientist
  19. A first look at Spark
  20. Change Point Detection in Time Series with R and Tableau
  21. Interactive visualizations with R – a minireview
  22. The leaflet package for online mapping in R
  23. Programmatically create interactive Powerpoint slides with R
  24. My New Favorite Statistics & Data Analysis Book Using R
  25. Dark themes for writing
  26. How to use SparkR within Rstudio?
  27. Shiny 0.12: Interactive Plots with ggplot2
  28. 15 Questions All R Users Have About Plots
  29. This R Data Import Tutorial Is Everything You Need
  30. R in Business Intelligence
  31. 5 New R Packages for Data Scientists
  32. Basic text string functions in R
  33. How to get your very own RStudio Server and Shiny Server with DigitalOcean
  34. Think Bayes: Bayesian Statistics Made Simple
  35. 2014 highlight: Statistical Learning course by Hastie & Tibshirani
  36. ggplot 2.0.0
  37. Machine Learning in R for beginners
  38. Top 77 R posts for 2014 (+R jobs)
  39. Introducing Radiant: A shiny interface for R
  40. Eight New Ideas From Data Visualization Experts
  41. Microsoft Launches Its First Free Online R Course on edX
  42. Imputing missing data with R; MICE package
  43. Variable Importance Plot” and Variable Selection
  44. The Data Science Industry: Who Does What (Infographic)
  45. d3heatmap: Interactive heat maps
  46. R + ggplot2 Graph Catalog
  47. Time Series Graphs & Eleven Stunning Ways You Can Use Them
  48. Working with “large” datasets, with dplyr and data.table
  49. Why the Ban on P-Values? And What Now?
  50. Part 3a: Plotting with ggplot2
  51. Importing Data Into R – Part Two
  52. How-to go parallel in R – basics + tips
  53. RStudio v0.99 Preview: Graphviz and DiagrammeR
  54. Downloading Option Chain Data from Google Finance in R: An Update
  55. R: single plot with two different y-axes
  56. Generalised Linear Models in R
  57. Hypothesis Testing: Fishing for Trouble
  58. The advantages of using count() to get N-way frequency tables as data frames in R
  59. Playing with R, Shiny Dashboard and Google Analytics Data
  60. Benchmarking Random Forest Implementations
  61. Fuzzy String Matching – a survival skill to tackle unstructured information
  62. Make your R plots interactive
  63. R #6 in IEEE 2015 Top Programming Languages, Rising 3 Places
  64. How To Analyze Data: Seven Modern Remakes Of The Most Famous Graphs Ever Made
  65. dplyr 0.4.0
  66. Installing and Starting SparkR Locally on Windows OS and RStudio
  67. Making R Files Executable (under Windows)
  68. Evaluating Logistic Regression Models
  69. Awesome-R: A curated list of the best add-ons for R
  70. Introducing Distributed Data-structures in R
  71. SAS vs R? The right answer to the wrong question?
  72. But I Don’t Want to Be a Statistician!
  73. Get data out of excel and into R with readxl
  74. Interactive R Notebooks with Jupyter and SageMathCloud
  75. Learning R: Index of Online R Courses, October 2015
  76. R User Group Recap: Heatmaps and Using the caret Package
  77. R Tutorial on Reading and Importing Excel Files into R
  78. R 3.2.2 is released
  79. Wanted: A Perfect Scatterplot (with Marginals)
  80. KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!
  81. Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
  82. 10 Top Tips For Becoming A Better Coder!
  83. James Bond movies
  84. Modeling and Solving Linear Programming with R – Free book
  85. Scraping Web Pages With R
  86. Why you should start by learning data visualization and manipulation
  87. R tutorial on the Apply family of functions
  88. The relation between p-values and the probability H0 is true is not weak enough to ban p-values
  89. A Bayesian Model to Calculate Whether My Wife is Pregnant or Not
  90. First year books
  91. Using rvest to Scrape an HTML Table
  92. dplyr Tutorial: verbs + split-apply
  93. RStudio Clone for Python – Rodeo
  94. Time series outlier detection (a simple R function)
  95. Building Wordclouds in R
  96. Should you teach Python or R for data science?
  97. Free online data mining and machine learning courses by Stanford University
  98. Centering and Standardizing: Don’t Confuse Your Rows with Your Columns
  99. Network analysis with igraph
  100. Regression Models, It’s Not Only About Interpretation 

    (oh hack, why not include a few more posts…)

  101. magrittr: The best thing to have ever happened to R?
  102. How to Speak Data Science
  103. R vs Python: a Survival Analysis with Plotly
  104. 15 Easy Solutions To Your Data Frame Problems In R
  105. R for more powerful clustering
  106. Using the R MatchIt package for propensity score analysis
  107. Interactive charts in R
  108. R is the fastest-growing language on StackOverflow
  109. Hash Table Performance in R: Part I
  110. Review of ‘Advanced R’ by Hadley Wickham
  111. Plotting Time Series in R using Yahoo Finance data
  112. R: the Excel Connection
  113. Cohort Analysis with Heatmap
  114. Data Visualization cheatsheet, plus Spanish translations
  115. Back to basics: High quality plots using base R graphics
  116. 6 Machine Learning Visualizations made in Python and R
  117. An R tutorial for Microsoft Excel users
  118. Connecting R to Everything with IFTTT
  119. Data Manipulation with dplyr
  120. Correlation and Linear Regression
  121. Why has R, despite quirks, been so successful?
  122. Introducing shinyjs: perform common JavaScript operations in Shiny apps using plain R code
  123. R: How to Layout and Design an Infographic
  124. New package for image processing in R
  125. In-database R coming to SQL Server 2016
  126. Making waffle charts in R (with the new ‘waffle’ package)
  127. Revolution Analytics joins Microsoft
  128. Six Ways You Can Make Beautiful Graphs (Like Your Favorite Journalists)

 

p.s.: 2015 was also a great year for R-users.com, a job board site for R users. If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds). If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

 

R_single_01

Health Care Indicators in Utah Counties

$
0
0

(This article was first published on data science ish, and kindly contributed to R-bloggers)

The state of Utah (my adopted home) has an Open Data Catalog with lots of interesting data sets, including a collection of health care indicators from 2014 for the 29 counties in Utah. The observations for each county include measurements such as the infant mortality rate, the percent of people who don’t have insurance, what percent of people have diabetes, and so forth. Let’s see how these health care indicators are related to each other and if we can use these data to cluster Utah counties into similar groups.

Something to Keep in Mind

Before we start, let’s look at one demographic map of Utah that is important to remember.

center

The population in Utah is not evenly distributed among counties. Salt Lake County, where I live, has a population over 1 million people and the rest of the counties have much lower populations. Utah County, just to the south of Salt Lake, has a population that is about half of Salt Lake’s, and the numbers go down very quickly after that; there are a number of counties with populations only in the 1000s. This will effect both the actual health care indicators (rural populations can have different healthcare issues than more urban ones) and the measurements of the health care indicators.

Getting Started

The data sets at Utah’s Open Data Catalog can be downloaded via Socrata Open API. Let’s load the data, fix the data types, and remove the row that contains numbers for the state as a whole.

library(RSocrata)
allHealth <- read.socrata("https://opendata.utah.gov/resource/qmsu-gki4.csv")
allHealth[,3:67] <- lapply(allHealth[,3:67], as.numeric)
allHealth <- allHealth[c(-1),]

Now let’s explore how some of these health care indicators are related to each other. Some of the indicators are correlated with each other in ways that make sense.

ggplot(data = allHealth, 
       aes(x = Median.Household.Income, y = Children.Eligible.Free.Lunch...Free.Lunch)) +
        geom_point(alpha = 0.6, size = 3) +
        stat_smooth(method = "lm") +
        geom_point(data = subset(allHealth, County == "Salt Lake"), size = 5, colour = "maroon") +
        xlab("Median household income (dollars)") +
        ylab("Children eligible for free lunch (percent)")

center

myCor <- cor.test(allHealth$Median.Household.Income, allHealth$Children.Eligible.Free.Lunch...Free.Lunch)

I’ve highlighted Salt Lake County in this plot and the following ones, just to give some context. The correlation coefficient between these two economic/health indicators is -0.652 with a 95% confidence interval from -0.822 to -0.374. Counties with higher incomes have fewer children eligible for free lunch.

ggplot(data = allHealth, 
       aes(x = X65.and.over, y = X..Diabetic)) +
        geom_point(alpha = 0.6, size = 3) +
        stat_smooth(method = "lm") +
        geom_point(data = subset(allHealth, County == "Salt Lake"), size = 5, colour = "maroon") +
        xlab("Population over 65 (percent)") +
        ylab("Diabetic population (percent)")

center

myCor <- cor.test(allHealth$X65.and.over, allHealth$X..Diabetic)

The correlation coefficient between the population percentage over 65 and the percentage of the population with diabetes is 0.667 with a 95% confidence interval from 0.398 to 0.831. Counties with more older people in them have more people with diabetes in them. Notice that Salt Lake County has less than 10% of its population 65 or older; we are very young here in Utah, the youngest in the nation, in fact.

Then there are lots of health care indicators that are not correlated with each other.

ggplot(data = allHealth, 
       aes(x = Premature.Age.adjusted.Mortality, y = X..Uninsured.Children.1)) +
        geom_point(alpha = 0.6, size = 3) +
        stat_smooth(method = "lm") +
        geom_point(data = subset(allHealth, County == "Salt Lake"), size = 5, colour = "maroon") +
        xlab("Premature mortality rate (per 100,000 population)") +
        ylab("Uninsured children (percent)")

center

myCor <- cor.test(allHealth$Premature.Age.adjusted.Mortality, allHealth$X..Uninsured.Children.1)

The correlation coefficient between the percentage of uninsured children and premature age adjusted mortality is 0.221 with a 95% confidence interval from -0.173 to 0.555.

To facilitate exploring all of the health care indicators in the data set, I made a Shiny app where the user can plot any two indicators from the data set, add a linear regression line, calculate a correlation coefficient, and highlight any county of choice. Use the app to explore the data, and check out the code for the app on Github.

Shiny App Screen Shot

Woe Is Me, NA Values…

The clustering analysis we would like to do requires that each county has complete information for all columns, i.e. no missing values. The populations of some Utah counties are so low that some of these health care indicators cannot be measured or are zero. Let’s look at how this plays out.

health <- allHealth[,c(4:5,18,22,24,27,31,34,36,38,42,44,48,51,55,60,63,64)]
rownames(health) <- allHealth$County
colnames(health) <- c("PercentUnder18",
              "PercentOver65",
              "DiabeticRate", 
              "HIVRate",
              "PrematureMortalityRate",
              "InfantMortalityRate",
              "ChildMortalityRate",
              "LimitedAccessToFood",
              "FoodInsecure", 
              "MotorDeathRate",
              "DrugDeathRate",
              "Uninsured", 
              "UninsuredChildren",
              "HealthCareCosts", 
              "CouldNotSeeDr",
              "MedianIncome",
              "ChildrenFreeLunch",
              "HomicideRate")
scaledhealth <- scale(health)
library(viridis)
heatmap(scaledhealth, Colv = NA, Rowv = NA, margins = c(10,4), 
        main = "Heatmap of Data Set Values", col = viridis(32, 1))

center

The values for the health indicators have been scaled for this heat map (otherwise, for example, the numbers for the median income would swamp out the numbers for the HIV rate because of the units they are measured with). The blank spaces in the heat map show where we have NA values to deal with. HIV/AIDs is not a very common disease and there are no reported cases of HIV in many of the sparsely populated counties in Utah. It probably makes sense to just put a zero in those spots because more urban areas have more HIV cases. Having an infant die is also quite uncommon in the United States and there are many counties in Utah where no infants died in 2014. Does it make sense to just put a zero in those spots?

center

Probably not, right? Outcomes for newborn babies appear to be worse in more rural counties. While plugging in a zero for the infant mortality rate in a county where zero newborns died does make sense on one level, it is a problematic thing to do.

One option is to impute the missing values based on the values for other, similar counties. One possible method for this is the random forest, an ensemble decision tree algorithm.

library(missForest)
healthimputed <- missForest(health)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

We can access the new matrix with the imputed values via healthimputed$ximp. Unfortunately, this was not a screaming success because some of the columns have so few real measured values; the mean squared error was not good and this approach doesn’t seem like a good idea. The good news is that I tested the rest of this analysis both with the random forest imputed data and just replacing NA values with 0, and the results were pretty much the same. There were some minor differences in exactly how the counties clustered, but no major differences in the main results. Given that, let’s just replace all the NA values with zeroes, scale and center the data, and move forward.

health[is.na(health)] <- 0
health <- scale(health)

Principal Component Wonderfulness

We can think of a data set like this as a high-dimensional space where each county is at a certain spot in that space. At this point in the analysis we are working with 18 columns of observations. We removed the columns that directly measure how many people live in each county such as population number, percentage of population who are rural dwellers, etc. and kept the columns on health care indicators such as child mortality rate, homicide rate, and percentage of population who is uninsured. Thus we have an 18-dimensional space and each county is located at its own spot in that space. Principal component analysis is a way to project these data points onto a new, special coordinate system. In our new coordinate system, each coordinate, or principal component, is a weighted sum of the original coordinates. The first principal component has the most variance in the data in its direction, the second principal component has the second most variance in the data in its direction, and so forth. Let’s do it!

myPCA <- prcomp(health)

Welp, that was easy.

Success Kid Does PCA

I just love PCA; it’s one of my very favorite algorithmic-y things. Let’s see what the first few of the principal components actually look like.

library(reshape2)
melted <- melt(myPCA$rotation[,1:9])
ggplot(data = melted) +
        theme(legend.position = "none", axis.text.x = element_blank(), 
              axis.ticks.x = element_blank()) + 
        xlab("Health care indicator measurements") +
        ylab("Relative importance in each principle component") +
        ggtitle("Variables in Principal Component Analysis") +
        geom_bar(aes(x=Var1, y=value, fill=Var1), stat="identity") +
        facet_wrap(~Var2)

center

So each of these components are orthogonal to each other, and the colored bars show the contribution of each original health care indicator to that principal component. Each principal component is uncorrelated to the others and together, the principal components contain the information in the data set. Let’s zoom in on the first principal component, the one that has the largest variance and accounts for the most variability between the counties.

ggplot(data = melted[melted$Var2 == "PC1",]) +
         theme(legend.position = "none", 
               axis.text.x= element_text(angle=45, hjust = 1), 
               axis.ticks.x = element_blank()) + 
         xlab("Health care indicator measurements") +
         ylab("Relative importance in principle component") +
         ggtitle("Variables in PC1") +
         geom_bar(aes(x=Var1, y=value, fill=Var1), stat="identity")

center

We can see here that counties with higher positive values for PC1 (the component that accounts for the most variability among the counties) have fewer children, more older people, low HIV and homicide rates, are more poor, and have more uninsured people. These sound like more rural counties.

It’s Clustering Time

Now let’s see if this data set of health care indicators can be used to cluster similar counties together. Clustering is an example of unsupervised machine learning, where we want to use an algorithm to find structure in unlabeled data. Let’s begin with hierarchical clustering. This method of clustering begins with all the individual items (counties, in our case) alone by themselves and then starts merging them into clusters with the items that are closest to them within the space we are considering. First, the algorithm merges them into two-item clusters, then it will merge another nearby item into each cluster, and so forth, until all the items are merged together into one big cluster. We can examine the tree structure the algorithm used to do the clustering to see what kind of clustering makes sense for the data, given the context, etc. Hierarchical clustering can be done with different methods of computing the distance (or similarity) of the items.

Let’s use the fpc package, a package with lots of resources for clustering algorithms, to do some hierarchical clustering of this county health data. Let’s do the hierarchical clustering algorithm, but let’s do it with bootstrap resampling of the county sample to assess how stable the clusters are to individual counties within the sample and what the best method for computing the distance/similarity is.

library(fpc)
myClusterBoot <- clusterboot(health,clustermethod=hclustCBI, method="ward.D", k=3, seed = 6789)

I tested different methods for computing the distance and found Ward clustering to be the most stable. The bootstrap results also indicate that 3 clusters is a stable, sensible choice. Let’s look at the results for these parameters for the hierarchical clustering.

myClusterBoot$bootmean
## [1] 0.8090913 0.7216360 0.6972370
myClusterBoot$bootbrd
## [1] 10 26 21

The bootmean value measures the cluster stability, where a value close to 1 indicates a stable cluster. The bootbrd value measures how many times (out of the 100 resampling runs) that cluster dissolved. These three clusters look pretty stable, so let’s take a look at how the hierarchical clustering algorithm has grouped the counties here.

library(dendextend)
myDend <- health %>% dist %>% hclust(method = "ward.D") %>% 
    as.dendrogram %>%
    set("branches_k_color", k = 3) %>% 
    set("labels_col", k=3) %>%
    hang.dendrogram(hang_height=0.8)
par(mar = c(3,3,3,7))
plot(myDend, horiz = TRUE, 
     main = "Clustering in Utah County Health Care Indicators")

center

The scale along the bottom shows a measure of how separated the branches of the tree structure are. As a resident of Utah, these county names look like they may be in a certain order to me; let’s check it out. What if I looked at county names ordered from lowest population to highest? (This is from the original data frame, not the data used to do the clustering.)

allHealth$County[order(allHealth$Population)]
##  [1] "Daggett"    "Piute"      "Rich"       "Wayne"      "Garfield"  
##  [6] "Beaver"     "Kane"       "Grand"      "Morgan"     "Juab"      
## [11] "Emery"      "Millard"    "San Juan"   "Duchesne"   "Sevier"    
## [16] "Carbon"     "Wasatch"    "Sanpete"    "Uintah"     "Summit"    
## [21] "Iron"       "Box Elder"  "Tooele"     "Cache"      "Washington"
## [26] "Weber"      "Davis"      "Utah"       "Salt Lake"

Yes indeed! The pink counties are the lowest population counties, the green ones are intermediate in population, and the blue counties are the most populous. The hierarchical clustering algorithm groups the counties by population based on their health care indicators.

Another algorithm for grouping similar objects is k-means clustering. K-means clustering works a bit differently than hierarchical clustering. You decide ahead of time how many clusters you are going to have (the number k) and randomly pick centers for each cluster (perhaps by picking data points at random to be the centers of each cluster). Then, the algorithm assigns each data point (county, in our case) to the closest cluster. After the clusters have their new members, the algorithm calculates new centers for each cluster. These steps of calculating the centers and assigning points to the clusters are repeated until the assignment of points to clusters converges (hopefully to a real minimum). Then you have your final cluster assignments!

The kmeansruns function in the fpc library will run k-means clustering many times to find the best clustering.

myKmeans <- kmeansruns(health, krange=1:5)

Helpfully, this function estimates the number of clusters in the data; it can use two different methods for this estimate but both give the same answer for our county health data here. If we include 1 in the range for krange, this function also tests whether there should even be more than one cluster at all. For the county health data, the best k is 2. Let’s plot what this k-means clustering looks like.

library(ggfortify)
library(ggrepel)
set.seed(2346)
autoplot(kmeans(health, 2), data = health, size = 3, aes = 0.8) + 
        ggtitle("K-Means Clustering of Utah Counties") +
        theme(legend.position="none") + 
        geom_label_repel(aes(PC1, PC2, 
                             fill = factor(kmeans(health, 2)$cluster), 
                             label = rownames(health)),
                         fontface = 'bold', color = 'white', 
                         box.padding = unit(0.5, "lines"))

center

This plot puts the counties on a plane where the x-axis is the first principal component and the y-axis is the second principal component; this kind of plotting can be helpful to show how data points are different from each other. Like with hierarchical clustering, k-means clustering has grouped counties by population. The cluster on the right is a low-population cluster while the cluster on the left is a high population cluster.

Remember that when we looked in detail at PC1, lower negative values of PC1 correspond to higher homicide rate, higher HIV rate, more children and fewer older poeple, higher income, lower rates of being food insecure and children eligible for free lunch, etc. Notice which counties have the lowest negatives values for PC1: the three most populous counties in Utah. That is heartening to see.

The methods for estimating numbers of clusters in the k-means algorithm indicated that 2 was the best number, but we can do a 3-cluster k-means clustering to compare to the groups found by hierarchical clustering.

set.seed(2350)
autoplot(kmeans(health, 3), data = health, size = 3, aes = 0.8) + 
        ggtitle("K-Means Clustering of Utah Counties") + 
        theme(legend.position="none") + 
        geom_label_repel(aes(PC1, PC2, 
                             fill = factor(kmeans(health, 3)$cluster), 
                             label = rownames(health)), 
                         fontface = 'bold', color = 'white', 
                         box.padding = unit(0.5, "lines"))

center

These groups are very similar to those found by hierarchical clustering.

The End

If you have a skeptical turn of mind (as I tend to do), you might suggest that what the clustering algorithms are actually finding is just how many NA values each county had. The least populous counties have the most NA values, counties with more intermediate populations have just a few NA values, and the most populous counties have none. There are a couple of things to consider about this perspective. One is that the pattern of NA values is not random; it could be considered informative in itself so perhaps it is not a problem if that affected the clustering results. Another is that I tested this clustering analysis with a subset of the data that excluded the columns that had many missing values (HIV rate, homicide rate, infant mortality rate, and child mortality rate). The clustering results still showed groups of low and high population counties, although the results were messier since there was less data and the excluded columns were highly predictive. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback and other perspectives!

To leave a comment for the author, please follow the link and comment on their blog: data science ish.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NEWS of my BioC packages

$
0
0

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.

ChIPseeker

Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.

annotatePeak

Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.

Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.

Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.

getBioRegion

getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.

Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.

visualization

GEO data mining

ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.

clusterProfiler

We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.

For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.

read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.

KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.

The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.

In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.

The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.

GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.

I bump the version to 3.0.0 the following three reasons:

  • the changes of function calls
  • can analyze any ontology/pathway annotation (supports user’s customize annotation data)
  • can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)

Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.

This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.

clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:

clusterProfiler also provides several visualization methods to help interpreting enriched results, including:

  • barplot
  • cnetplot
  • dotplot
  • enrichMap
  • gseaplot
  • plotGOgraph (via topGO package)
  • upsetplot (via UpSetR package)

and several useful utilities:

  • bitr (Biological Id TranslatoR)
  • bitr_kegg (bitr using KEGG source)
  • compareCluster (biological theme comparison)
  • dropGO (screen out GO term of specific level or specific term)
  • go2ont (convert GO ID to Ontology)
  • go2term (convert GO ID to descriptive term)
  • gofilter (restrict result at specific GO level)
  • gsfilter (restrict result by gene set size)
  • search_kegg_organism (search kegg supported organism)
  • setReadable (convert IDs stored enrichResult object to gene symbol)
  • simplify (remove redundant GO terms, supported via GOSemSim)

DOSE

DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.

maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.

gsfilter function for restricting enriched results with minimal and maximal gene set sizes.

upsetplot was implemented to visualize overlap of enriched gene sets.

The dot sizes in enrichMap now scaled by category sizes

All these changes also affect clusterProfiler and ReactomePA.

ggtree

I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.

IO

  • supports NHX file format via read.nhx function
  • supports phylip tree format via read.phylip function
  • raxml2nwk for converting raxml bootstrap tree to newick text
  • all parser functions support passing textConnection(text_string) as a file
  • supports ape bootstrap analysis
  • supports annotating tree with ancestral sequences inferred by phangorn
  • supports obkData object defined by OutbreakTools package
  • supports phyloseq object defined by phyloseq package

layers

  • geom_point2,geom_text2, geom_segment2 and geom_label2 to support subsetting
  • geom_treescale for adding scale of branch length
  • geom_cladelabel for labeling selected clade
  • geom_tiplab2 for adding tiplab of circular tree
  • geom_taxalink for connecting related taxa
  • geom_range for adding range to present uncertainty of branch lengths
  • subview and inset now support annotating with image files

utilities

  • rescale_tree function to rescale branch lengths using numerical variable
  • MRCA for finding Most Recent Common Ancestor among a vector of tips
  • viewClade to zoom in a selected clade

vignettes

Split the long vignette to several small ones and add more examples.

Here is the NEWS record:

CHANGES IN VERSION 1.3.16
------------------------
 o geom_treescale() supports family argument <2016-04-27, Wed>
   + https://github.com/GuangchuangYu/ggtree/issues/56
 o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/54
 o support passing textConnection(text_string) as a file <2016-04-21, Thu>
   + contributed by Casey Dunn <casey_dunn@brown.edu>
   + https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
 
CHANGES IN VERSION 1.3.15
------------------------
 o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
 o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
 o geom_label2 that support subsetting <2016-04-07, Thu>
 o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
 o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
 o geom_taxalink for connecting related taxa <2016-04-01, Fri> 
 o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
 
CHANGES IN VERSION 1.3.14
------------------------
 o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
 o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
 o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
   + see https://github.com/GuangchuangYu/ggtree/issues/46
 o subview and inset now supports annotating with img files <2016-02-23, Tue>
 
CHANGES IN VERSION 1.3.13
------------------------
 o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
 o geom_cladelabel works with collapse <2016-02-07, Sun>
   + see https://github.com/GuangchuangYu/ggtree/issues/38 

CHANGES IN VERSION 1.3.12
------------------------
 o exchange function name of geom_tree and geom_tree2  <2016-01-25, Mon>
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/hadley/ggplot2/issues/1512
 o colnames_level parameter in gheatmap <2016-01-25, Mon>
 o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon> 
 
CHANGES IN VERSION 1.3.11
------------------------
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/GuangchuangYu/ggtree/issues/36
 o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
   + fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
 o support phyloseq object <2016-01-21, Thu>
 o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
 o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
 o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
 o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
 
CHANGES IN VERSION 1.3.10
------------------------
 o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
 o remove dependency of colorspace <2016-01-20, Wed>
 o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>

CHANGES IN VERSION 1.3.9
------------------------
 o optimize getYcoord <2016-01-14, Thu>
 o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
 o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
   these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
   + > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
 
CHANGES IN VERSION 1.3.8
------------------------
 o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
 o add viewClade function <2016-01-12, Tue>
 o support obkData object defined by OutbreakTools <2016-01-12, Tue>
 o update vignettes <2016-01-07, Thu>
 o 05 advance tree annotation vignette <2016-01-04, Mon>
 o export theme_inset <2016-01-04, Mon>
 o inset, nodebar, nodepie functions <2015-12-31, Thu>
 
CHANGES IN VERSION 1.3.7
------------------------
 o split the long vignette to several vignettes
   + 00 ggtree <2015-12-29, Tue>
   + 01 tree data import <2015-12-28, Mon>
   + 02 tree visualization <2015-12-28, Mon>
   + 03 tree manipulation <2015-12-28, Mon>
   + 04 tree annotation <2015-12-29, Tue>
 
CHANGES IN VERSION 1.3.6
------------------------
 o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
 o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
   - remove annotation_clade and annotation_clade2 functions.
 o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
 
CHANGES IN VERSION 1.3.5
------------------------
 o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
   + see https://github.com/GuangchuangYu/ggtree/issues/30

CHANGES IN VERSION 1.3.4
------------------------
 o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
 o get_clade_position function <2015-11-26, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/28
 o get_heatmap_column_position function <2015-11-25, Wed>
   + see https://github.com/GuangchuangYu/ggtree/issues/26
 o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
 o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
 
CHANGES IN VERSION 1.3.3
------------------------
 o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
 o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
 
CHANGES IN VERSION 1.3.2
------------------------
 o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
 o add support of ape bootstrap analysis <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/20
 o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/21
 
CHANGES IN VERSION 1.3.1
------------------------
 o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
   + see https://github.com/GuangchuangYu/ggtree/issues/17
 o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
   + see https://github.com/hadley/ggplot2/issues/1380
 o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>

GOSemSim

update IC data using update OrgDb packages.

ReactomePA

We published ReactomePA in Molecular BioSystems.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NEWS of my BioC packages

$
0
0

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.

ChIPseeker

Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.

annotatePeak

Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.

Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.

Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.

getBioRegion

getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.

Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.

visualization

GEO data mining

ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.

clusterProfiler

We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.

For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.

read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.

KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.

The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.

In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.

The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.

GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.

I bump the version to 3.0.0 the following three reasons:

  • the changes of function calls
  • can analyze any ontology/pathway annotation (supports user’s customize annotation data)
  • can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)

Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.

This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.

clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:

clusterProfiler also provides several visualization methods to help interpreting enriched results, including:

  • barplot
  • cnetplot
  • dotplot
  • enrichMap
  • gseaplot
  • plotGOgraph (via topGO package)
  • upsetplot (via UpSetR package)

and several useful utilities:

  • bitr (Biological Id TranslatoR)
  • bitr_kegg (bitr using KEGG source)
  • compareCluster (biological theme comparison)
  • dropGO (screen out GO term of specific level or specific term)
  • go2ont (convert GO ID to Ontology)
  • go2term (convert GO ID to descriptive term)
  • gofilter (restrict result at specific GO level)
  • gsfilter (restrict result by gene set size)
  • search_kegg_organism (search kegg supported organism)
  • setReadable (convert IDs stored enrichResult object to gene symbol)
  • simplify (remove redundant GO terms, supported via GOSemSim)

DOSE

DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.

maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.

gsfilter function for restricting enriched results with minimal and maximal gene set sizes.

upsetplot was implemented to visualize overlap of enriched gene sets.

The dot sizes in enrichMap now scaled by category sizes

All these changes also affect clusterProfiler and ReactomePA.

ggtree

I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.

IO

  • supports NHX file format via read.nhx function
  • supports phylip tree format via read.phylip function
  • raxml2nwk for converting raxml bootstrap tree to newick text
  • all parser functions support passing textConnection(text_string) as a file
  • supports ape bootstrap analysis
  • supports annotating tree with ancestral sequences inferred by phangorn
  • supports obkData object defined by OutbreakTools package
  • supports phyloseq object defined by phyloseq package

layers

  • geom_point2,geom_text2, geom_segment2 and geom_label2 to support subsetting
  • geom_treescale for adding scale of branch length
  • geom_cladelabel for labeling selected clade
  • geom_tiplab2 for adding tiplab of circular tree
  • geom_taxalink for connecting related taxa
  • geom_range for adding range to present uncertainty of branch lengths
  • subview and inset now support annotating with image files

utilities

  • rescale_tree function to rescale branch lengths using numerical variable
  • MRCA for finding Most Recent Common Ancestor among a vector of tips
  • viewClade to zoom in a selected clade

vignettes

Split the long vignette to several small ones and add more examples.

Here is the NEWS record:

CHANGES IN VERSION 1.3.16
------------------------
 o geom_treescale() supports family argument <2016-04-27, Wed>
   + https://github.com/GuangchuangYu/ggtree/issues/56
 o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/54
 o support passing textConnection(text_string) as a file <2016-04-21, Thu>
   + contributed by Casey Dunn <casey_dunn@brown.edu>
   + https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
 
CHANGES IN VERSION 1.3.15
------------------------
 o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
 o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
 o geom_label2 that support subsetting <2016-04-07, Thu>
 o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
 o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
 o geom_taxalink for connecting related taxa <2016-04-01, Fri> 
 o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
 
CHANGES IN VERSION 1.3.14
------------------------
 o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
 o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
 o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
   + see https://github.com/GuangchuangYu/ggtree/issues/46
 o subview and inset now supports annotating with img files <2016-02-23, Tue>
 
CHANGES IN VERSION 1.3.13
------------------------
 o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
 o geom_cladelabel works with collapse <2016-02-07, Sun>
   + see https://github.com/GuangchuangYu/ggtree/issues/38 

CHANGES IN VERSION 1.3.12
------------------------
 o exchange function name of geom_tree and geom_tree2  <2016-01-25, Mon>
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/hadley/ggplot2/issues/1512
 o colnames_level parameter in gheatmap <2016-01-25, Mon>
 o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon> 
 
CHANGES IN VERSION 1.3.11
------------------------
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/GuangchuangYu/ggtree/issues/36
 o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
   + fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
 o support phyloseq object <2016-01-21, Thu>
 o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
 o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
 o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
 o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
 
CHANGES IN VERSION 1.3.10
------------------------
 o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
 o remove dependency of colorspace <2016-01-20, Wed>
 o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>

CHANGES IN VERSION 1.3.9
------------------------
 o optimize getYcoord <2016-01-14, Thu>
 o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
 o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
   these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
   + > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
 
CHANGES IN VERSION 1.3.8
------------------------
 o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
 o add viewClade function <2016-01-12, Tue>
 o support obkData object defined by OutbreakTools <2016-01-12, Tue>
 o update vignettes <2016-01-07, Thu>
 o 05 advance tree annotation vignette <2016-01-04, Mon>
 o export theme_inset <2016-01-04, Mon>
 o inset, nodebar, nodepie functions <2015-12-31, Thu>
 
CHANGES IN VERSION 1.3.7
------------------------
 o split the long vignette to several vignettes
   + 00 ggtree <2015-12-29, Tue>
   + 01 tree data import <2015-12-28, Mon>
   + 02 tree visualization <2015-12-28, Mon>
   + 03 tree manipulation <2015-12-28, Mon>
   + 04 tree annotation <2015-12-29, Tue>
 
CHANGES IN VERSION 1.3.6
------------------------
 o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
 o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
   - remove annotation_clade and annotation_clade2 functions.
 o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
 
CHANGES IN VERSION 1.3.5
------------------------
 o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
   + see https://github.com/GuangchuangYu/ggtree/issues/30

CHANGES IN VERSION 1.3.4
------------------------
 o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
 o get_clade_position function <2015-11-26, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/28
 o get_heatmap_column_position function <2015-11-25, Wed>
   + see https://github.com/GuangchuangYu/ggtree/issues/26
 o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
 o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
 
CHANGES IN VERSION 1.3.3
------------------------
 o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
 o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
 
CHANGES IN VERSION 1.3.2
------------------------
 o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
 o add support of ape bootstrap analysis <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/20
 o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/21
 
CHANGES IN VERSION 1.3.1
------------------------
 o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
   + see https://github.com/GuangchuangYu/ggtree/issues/17
 o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
   + see https://github.com/hadley/ggplot2/issues/1380
 o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>

GOSemSim

update IC data using update OrgDb packages.

ReactomePA

We published ReactomePA in Molecular BioSystems.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NEWS of my BioC packages

$
0
0

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.

ChIPseeker

Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.

annotatePeak

Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.

Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.

Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.

getBioRegion

getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.

Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.

visualization

GEO data mining

ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.

clusterProfiler

We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.

For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.

read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.

KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.

The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.

In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.

The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.

GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.

I bump the version to 3.0.0 the following three reasons:

  • the changes of function calls
  • can analyze any ontology/pathway annotation (supports user’s customize annotation data)
  • can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)

Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.

This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.

clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:

clusterProfiler also provides several visualization methods to help interpreting enriched results, including:

  • barplot
  • cnetplot
  • dotplot
  • enrichMap
  • gseaplot
  • plotGOgraph (via topGO package)
  • upsetplot (via UpSetR package)

and several useful utilities:

  • bitr (Biological Id TranslatoR)
  • bitr_kegg (bitr using KEGG source)
  • compareCluster (biological theme comparison)
  • dropGO (screen out GO term of specific level or specific term)
  • go2ont (convert GO ID to Ontology)
  • go2term (convert GO ID to descriptive term)
  • gofilter (restrict result at specific GO level)
  • gsfilter (restrict result by gene set size)
  • search_kegg_organism (search kegg supported organism)
  • setReadable (convert IDs stored enrichResult object to gene symbol)
  • simplify (remove redundant GO terms, supported via GOSemSim)

DOSE

DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.

maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.

gsfilter function for restricting enriched results with minimal and maximal gene set sizes.

upsetplot was implemented to visualize overlap of enriched gene sets.

The dot sizes in enrichMap now scaled by category sizes

All these changes also affect clusterProfiler and ReactomePA.

ggtree

I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.

IO

  • supports NHX file format via read.nhx function
  • supports phylip tree format via read.phylip function
  • raxml2nwk for converting raxml bootstrap tree to newick text
  • all parser functions support passing textConnection(text_string) as a file
  • supports ape bootstrap analysis
  • supports annotating tree with ancestral sequences inferred by phangorn
  • supports obkData object defined by OutbreakTools package
  • supports phyloseq object defined by phyloseq package

layers

  • geom_point2,geom_text2, geom_segment2 and geom_label2 to support subsetting
  • geom_treescale for adding scale of branch length
  • geom_cladelabel for labeling selected clade
  • geom_tiplab2 for adding tiplab of circular tree
  • geom_taxalink for connecting related taxa
  • geom_range for adding range to present uncertainty of branch lengths
  • subview and inset now support annotating with image files

utilities

  • rescale_tree function to rescale branch lengths using numerical variable
  • MRCA for finding Most Recent Common Ancestor among a vector of tips
  • viewClade to zoom in a selected clade

vignettes

Split the long vignette to several small ones and add more examples.

Here is the NEWS record:

CHANGES IN VERSION 1.3.16
------------------------
 o geom_treescale() supports family argument <2016-04-27, Wed>
   + https://github.com/GuangchuangYu/ggtree/issues/56
 o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/54
 o support passing textConnection(text_string) as a file <2016-04-21, Thu>
   + contributed by Casey Dunn <casey_dunn@brown.edu>
   + https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
 
CHANGES IN VERSION 1.3.15
------------------------
 o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
 o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
 o geom_label2 that support subsetting <2016-04-07, Thu>
 o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
 o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
 o geom_taxalink for connecting related taxa <2016-04-01, Fri> 
 o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
 
CHANGES IN VERSION 1.3.14
------------------------
 o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
 o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
 o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
   + see https://github.com/GuangchuangYu/ggtree/issues/46
 o subview and inset now supports annotating with img files <2016-02-23, Tue>
 
CHANGES IN VERSION 1.3.13
------------------------
 o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
 o geom_cladelabel works with collapse <2016-02-07, Sun>
   + see https://github.com/GuangchuangYu/ggtree/issues/38 

CHANGES IN VERSION 1.3.12
------------------------
 o exchange function name of geom_tree and geom_tree2  <2016-01-25, Mon>
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/hadley/ggplot2/issues/1512
 o colnames_level parameter in gheatmap <2016-01-25, Mon>
 o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon> 
 
CHANGES IN VERSION 1.3.11
------------------------
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/GuangchuangYu/ggtree/issues/36
 o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
   + fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
 o support phyloseq object <2016-01-21, Thu>
 o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
 o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
 o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
 o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
 
CHANGES IN VERSION 1.3.10
------------------------
 o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
 o remove dependency of colorspace <2016-01-20, Wed>
 o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>

CHANGES IN VERSION 1.3.9
------------------------
 o optimize getYcoord <2016-01-14, Thu>
 o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
 o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
   these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
   + > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
 
CHANGES IN VERSION 1.3.8
------------------------
 o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
 o add viewClade function <2016-01-12, Tue>
 o support obkData object defined by OutbreakTools <2016-01-12, Tue>
 o update vignettes <2016-01-07, Thu>
 o 05 advance tree annotation vignette <2016-01-04, Mon>
 o export theme_inset <2016-01-04, Mon>
 o inset, nodebar, nodepie functions <2015-12-31, Thu>
 
CHANGES IN VERSION 1.3.7
------------------------
 o split the long vignette to several vignettes
   + 00 ggtree <2015-12-29, Tue>
   + 01 tree data import <2015-12-28, Mon>
   + 02 tree visualization <2015-12-28, Mon>
   + 03 tree manipulation <2015-12-28, Mon>
   + 04 tree annotation <2015-12-29, Tue>
 
CHANGES IN VERSION 1.3.6
------------------------
 o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
 o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
   - remove annotation_clade and annotation_clade2 functions.
 o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
 
CHANGES IN VERSION 1.3.5
------------------------
 o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
   + see https://github.com/GuangchuangYu/ggtree/issues/30

CHANGES IN VERSION 1.3.4
------------------------
 o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
 o get_clade_position function <2015-11-26, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/28
 o get_heatmap_column_position function <2015-11-25, Wed>
   + see https://github.com/GuangchuangYu/ggtree/issues/26
 o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
 o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
 
CHANGES IN VERSION 1.3.3
------------------------
 o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
 o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
 
CHANGES IN VERSION 1.3.2
------------------------
 o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
 o add support of ape bootstrap analysis <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/20
 o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/21
 
CHANGES IN VERSION 1.3.1
------------------------
 o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
   + see https://github.com/GuangchuangYu/ggtree/issues/17
 o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
   + see https://github.com/hadley/ggplot2/issues/1380
 o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>

GOSemSim

update IC data using update OrgDb packages.

ReactomePA

We published ReactomePA in Molecular BioSystems.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

BioC 3.3: NEWS of my packages

$
0
0

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.

ChIPseeker

Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.

annotatePeak

Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.

Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.

Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.

getBioRegion

getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.

Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.

visualization

GEO data mining

ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.

clusterProfiler

We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.

For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.

read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.

KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.

The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.

In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analyses.

The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.

GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.

I bump the version to 3.0.0 the following three reasons:

  • the changes of function calls
  • can analyze any ontology/pathway annotation (supports user’s customize annotation data)
  • can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)

Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.

This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.

clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:

clusterProfiler also provides several visualization methods to help interpreting enriched results, including:

  • barplot
  • cnetplot
  • dotplot
  • enrichMap
  • gseaplot
  • plotGOgraph (via topGO package)
  • upsetplot (via UpSetR package)

and several useful utilities:

  • bitr (Biological Id TranslatoR)
  • bitr_kegg (bitr using KEGG source)
  • compareCluster (biological theme comparison)
  • dropGO (screen out GO term of specific level or specific term)
  • go2ont (convert GO ID to Ontology)
  • go2term (convert GO ID to descriptive term)
  • gofilter (restrict result at specific GO level)
  • gsfilter (restrict result by gene set size)
  • search_kegg_organism (search kegg supported organism)
  • setReadable (convert IDs stored enrichResult object to gene symbol)
  • simplify (remove redundant GO terms, supported via GOSemSim)

DOSE

DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.

maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.

gsfilter function for restricting enriched results with minimal and maximal gene set sizes.

upsetplot was implemented to visualize overlap of enriched gene sets.

The dot sizes in enrichMap now scaled by category sizes

All these changes also affect clusterProfiler and ReactomePA.

ggtree

I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.

IO

  • supports NHX file format via read.nhx function
  • supports phylip tree format via read.phylip function
  • raxml2nwk for converting raxml bootstrap tree to newick text
  • all parser functions support passing textConnection(text_string) as a file
  • supports ape bootstrap analysis
  • supports annotating tree with ancestral sequences inferred by phangorn
  • supports obkData object defined by OutbreakTools package
  • supports phyloseq object defined by phyloseq package

layers

  • geom_point2,geom_text2, geom_segment2 and geom_label2 to support subsetting
  • geom_treescale for adding scale of branch length
  • geom_cladelabel for labeling selected clade
  • geom_tiplab2 for adding tiplab of circular tree
  • geom_taxalink for connecting related taxa
  • geom_range for adding range to present uncertainty of branch lengths
  • subview and inset now support annotating with image files

utilities

  • rescale_tree function to rescale branch lengths using numerical variable
  • MRCA for finding Most Recent Common Ancestor among a vector of tips
  • viewClade to zoom in a selected clade

vignettes

Split the long vignette to several small ones and add more examples.

Here is the NEWS record:

CHANGES IN VERSION 1.3.16
------------------------
 o geom_treescale() supports family argument <2016-04-27, Wed>
   + https://github.com/GuangchuangYu/ggtree/issues/56
 o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/54
 o support passing textConnection(text_string) as a file <2016-04-21, Thu>
   + contributed by Casey Dunn <casey_dunn@brown.edu>
   + https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
 
CHANGES IN VERSION 1.3.15
------------------------
 o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
 o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
 o geom_label2 that support subsetting <2016-04-07, Thu>
 o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
 o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
 o geom_taxalink for connecting related taxa <2016-04-01, Fri> 
 o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
 
CHANGES IN VERSION 1.3.14
------------------------
 o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
 o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
 o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
   + see https://github.com/GuangchuangYu/ggtree/issues/46
 o subview and inset now supports annotating with img files <2016-02-23, Tue>
 
CHANGES IN VERSION 1.3.13
------------------------
 o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
 o geom_cladelabel works with collapse <2016-02-07, Sun>
   + see https://github.com/GuangchuangYu/ggtree/issues/38 

CHANGES IN VERSION 1.3.12
------------------------
 o exchange function name of geom_tree and geom_tree2  <2016-01-25, Mon>
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/hadley/ggplot2/issues/1512
 o colnames_level parameter in gheatmap <2016-01-25, Mon>
 o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon> 
 
CHANGES IN VERSION 1.3.11
------------------------
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/GuangchuangYu/ggtree/issues/36
 o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
   + fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
 o support phyloseq object <2016-01-21, Thu>
 o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
 o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
 o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
 o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
 
CHANGES IN VERSION 1.3.10
------------------------
 o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
 o remove dependency of colorspace <2016-01-20, Wed>
 o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>

CHANGES IN VERSION 1.3.9
------------------------
 o optimize getYcoord <2016-01-14, Thu>
 o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
 o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
   these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
   + > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
 
CHANGES IN VERSION 1.3.8
------------------------
 o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
 o add viewClade function <2016-01-12, Tue>
 o support obkData object defined by OutbreakTools <2016-01-12, Tue>
 o update vignettes <2016-01-07, Thu>
 o 05 advance tree annotation vignette <2016-01-04, Mon>
 o export theme_inset <2016-01-04, Mon>
 o inset, nodebar, nodepie functions <2015-12-31, Thu>
 
CHANGES IN VERSION 1.3.7
------------------------
 o split the long vignette to several vignettes
   + 00 ggtree <2015-12-29, Tue>
   + 01 tree data import <2015-12-28, Mon>
   + 02 tree visualization <2015-12-28, Mon>
   + 03 tree manipulation <2015-12-28, Mon>
   + 04 tree annotation <2015-12-29, Tue>
 
CHANGES IN VERSION 1.3.6
------------------------
 o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
 o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
   - remove annotation_clade and annotation_clade2 functions.
 o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
 
CHANGES IN VERSION 1.3.5
------------------------
 o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
   + see https://github.com/GuangchuangYu/ggtree/issues/30

CHANGES IN VERSION 1.3.4
------------------------
 o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
 o get_clade_position function <2015-11-26, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/28
 o get_heatmap_column_position function <2015-11-25, Wed>
   + see https://github.com/GuangchuangYu/ggtree/issues/26
 o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
 o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
 
CHANGES IN VERSION 1.3.3
------------------------
 o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
 o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
 
CHANGES IN VERSION 1.3.2
------------------------
 o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
 o add support of ape bootstrap analysis <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/20
 o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/21
 
CHANGES IN VERSION 1.3.1
------------------------
 o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
   + see https://github.com/GuangchuangYu/ggtree/issues/17
 o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
   + see https://github.com/hadley/ggplot2/issues/1380
 o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>

GOSemSim

update IC data using update OrgDb packages.

ReactomePA

We published ReactomePA in Molecular BioSystems.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

BioC 3.3: NEWS of my packages

$
0
0

(This article was first published on R on Guangchuang Yu, and kindly contributed to R-bloggers)

Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.

ChIPseeker

Although ChIPseeker was designed for ChIP-seq annotation, I am very glad to find that someone else use it to annotate other data including copy number variants and DNA breakpoints.

annotatePeak

Several parameters including sameStrand,ignoreOverlap, ignoreUpstream and ignoreDownstream were added in annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.

Another parameter overlap was also introduced. By default overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.

Now annotatePeak also support using user’s customize regions to annotate their data by passing TxDb=user_defined_GRanges.

getBioRegion

getPromoters() function prepare a GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.

Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function getBioRegion to output GRanges object of Intron/Exon start regions.

visualization

GEO data mining

ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.

clusterProfiler

We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software is almost identical.

For comparing biological themes, clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.

read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in clusterProfiler for both hypergeometric test and GSEA.

KEGG Module was supported just like the KEGG Pathway, clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.

The KEGG database was updated quite frequently. The KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010), the analyzed result may totally changed if we use a recently updated data. Indeed clusterProfiler is more reliable as we always use the latest data.

In addition to bitr function that can translate biological ID using OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the search_kegg_species function) as in KEGG Pathway and Module analysis.

The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via AnnotationHub or build with user’s own data. With this update, enrichGO and gseGO can input any gene ID type if only the ID type was supported in the OrgDb.

GO enrichment analysis alwasy output redundant terms, we implemented a simplify function to remove redundant terms by calculating GO semantic similarity using GOSemSim. Several useful utilities include dropGO, go2ont, go2term, gofilter and gsfilter are also provided.

I bump the version to 3.0.0 the following three reasons:

  • the changes of function calls
  • can analyze any ontology/pathway annotation
  • can analyze all speices

Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.

This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.

clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:

clusterProfiler also provides several visualization methods to help interpreting enriched results, including:

  • barplot
  • cnetplot
  • dotplot
  • enrichMap
  • gseaplot
  • plotGOgraph (via topGO package)
  • upsetplot

and several useful utilities:

  • bitr (Biological Id TranslatoR)
  • bitr_kegg (bitr using KEGG source)
  • compareCluster (biological theme comparison)
  • dropGO (screen out GO term of specific level or specific term)
  • go2ont (convert GO ID to Ontology)
  • go2term (convert GO ID to descriptive term)
  • gofilter (restrict result at specific GO level)
  • gsfilter (restrict result by gene set size)
  • search_kegg_organism (search kegg supported organism)
  • setReadable (convert IDs stored enrichResult object to gene symbol)
  • simplify (remove redundant GO terms, supported via GOSemSim)

DOSE

DOSE now test bimodal separately in GSEA and the output pvalues are more conserved.

maxGSSize parameter was added, with default value of 500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.

gsfilter function for restricting enriched results with minimal and maximal gene set sizes.

upsetplot was implemented to visualize overlap of enriched gene sets.

The dot sizes in enrichMap now scaled by category sizes

ggtree

I put more efforts to extend ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.

IO

  • support NHX file format via read.nhx function
  • support phylip tree format via read.phylip function
  • raxml2nwk for converting raxml bootstrap tree to newick text
  • all parser functions support passing textConnection(text_string) as a file
  • support ape bootstrap analysis
  • support annotating tree with ancestral sequences inferred by phangorn
  • support obkData object defined by OutbreakTools package
  • support phyloseq object defined by phyloseq package

layers

  • geom_point2,geom_text2, geom_segment2 and geom_label2 to support subsetting
  • geom_treescale for adding scale of branch length
  • geom_cladelabel for labeling selected clade
  • geom_tiplab2 for adding tiplab of circular tree
  • geom_taxalink for connecting related taxa
  • geom_range for adding range to present uncertainty of branch lengths
  • subview and inset now support annotating with image files

utilities

  • rescale_tree function to rescale branch lengths using numerical variable
  • MRCA for finding Most Recent Common Ancestor among a vector of tips
  • viewClade to zoom in a selected clade

vignettes

Split the long vignette to several small ones and add more examples.

Here is the NEWS record:

CHANGES IN VERSION 1.3.16
------------------------
 o geom_treescale() supports family argument <2016-04-27, Wed>
   + https://github.com/GuangchuangYu/ggtree/issues/56
 o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/54
 o support passing textConnection(text_string) as a file <2016-04-21, Thu>
   + contributed by Casey Dunn <casey_dunn@brown.edu>
   + https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693
 
CHANGES IN VERSION 1.3.15
------------------------
 o geom_tiplab2 supports parameter hjust <2016-04-18, Mon>
 o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu>
 o geom_label2 that support subsetting <2016-04-07, Thu>
 o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed>
 o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed>
 o geom_taxalink for connecting related taxa <2016-04-01, Fri> 
 o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri>
 
CHANGES IN VERSION 1.3.14
------------------------
 o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat>
 o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat>
 o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue>
   + see https://github.com/GuangchuangYu/ggtree/issues/46
 o subview and inset now supports annotating with img files <2016-02-23, Tue>
 
CHANGES IN VERSION 1.3.13
------------------------
 o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun>
 o geom_cladelabel works with collapse <2016-02-07, Sun>
   + see https://github.com/GuangchuangYu/ggtree/issues/38 

CHANGES IN VERSION 1.3.12
------------------------
 o exchange function name of geom_tree and geom_tree2  <2016-01-25, Mon>
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/hadley/ggplot2/issues/1512
 o colnames_level parameter in gheatmap <2016-01-25, Mon>
 o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon> 
 
CHANGES IN VERSION 1.3.11
------------------------
 o solved issues of geom_tree2 <2016-01-25, Mon>
   + https://github.com/GuangchuangYu/ggtree/issues/36
 o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu>
   + fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36
 o support phyloseq object <2016-01-21, Thu>
 o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu>
 o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu>
 o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed>
 o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed>
 
CHANGES IN VERSION 1.3.10
------------------------
 o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed>
 o remove dependency of colorspace <2016-01-20, Wed>
 o support phylip tree format and update vignette of phylip example <2016-01-15, Fri>

CHANGES IN VERSION 1.3.9
------------------------
 o optimize getYcoord <2016-01-14, Thu>
 o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed>
 o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL.
   these function will access the last plot if tree_view=NULL. <2016-01-13, Wed>
   + > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe.
 
CHANGES IN VERSION 1.3.8
------------------------
 o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed>
 o add viewClade function <2016-01-12, Tue>
 o support obkData object defined by OutbreakTools <2016-01-12, Tue>
 o update vignettes <2016-01-07, Thu>
 o 05 advance tree annotation vignette <2016-01-04, Mon>
 o export theme_inset <2016-01-04, Mon>
 o inset, nodebar, nodepie functions <2015-12-31, Thu>
 
CHANGES IN VERSION 1.3.7
------------------------
 o split the long vignette to several vignettes
   + 00 ggtree <2015-12-29, Tue>
   + 01 tree data import <2015-12-28, Mon>
   + 02 tree visualization <2015-12-28, Mon>
   + 03 tree manipulation <2015-12-28, Mon>
   + 04 tree annotation <2015-12-29, Tue>
 
CHANGES IN VERSION 1.3.6
------------------------
 o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue>
 o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon>
   - remove annotation_clade and annotation_clade2 functions.
 o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon>
 
CHANGES IN VERSION 1.3.5
------------------------
 o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon>
   + see https://github.com/GuangchuangYu/ggtree/issues/30

CHANGES IN VERSION 1.3.4
------------------------
 o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri>
 o get_clade_position function <2015-11-26, Thu>
   + https://github.com/GuangchuangYu/ggtree/issues/28
 o get_heatmap_column_position function <2015-11-25, Wed>
   + see https://github.com/GuangchuangYu/ggtree/issues/26
 o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue>
 o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu>
 
CHANGES IN VERSION 1.3.3
------------------------
 o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri>
 o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue>
 
CHANGES IN VERSION 1.3.2
------------------------
 o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon>
 o add support of ape bootstrap analysis <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/20
 o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon>
   see https://github.com/GuangchuangYu/ggtree/issues/21
 
CHANGES IN VERSION 1.3.1
------------------------
 o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu>
   + see https://github.com/GuangchuangYu/ggtree/issues/17
 o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed>
   + see https://github.com/hadley/ggplot2/issues/1380
 o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>

GOSemSim

update IC data using update OrgDb packages.

ReactomePA

Internal implementation was updated according to the change of DOSE.

We published ReactomePA in Molecular BioSystems.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang Yu.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 152 articles
Browse latest View live