Quantcast
Channel: Search Results for “heatmap”– R-bloggers
Viewing all 152 articles
Browse latest View live

Attention Is Preference: A Foundation Derived from Brand Involvement Segmentation

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)
"A wealth of information creates a poverty of attention."

We categorize our world so that we can ignore most of it. In order to see figure, everything else must become ground. Once learned, the process seems automatic, and we forget how hard and long it took to achieve automaticity. It is not easy learning how to ride a bicycle, but we never forget. The same can be said of becoming fluent in a foreign language or learning R or deciding what toothpaste to buy. The difficulty of the task varies, yet the process remains the same.
Our attention is selective, as is our exposure to media and marketing communications. Neither is passive, although our awareness is limited because the process is automatic. We do not notice advertising for products we do not buy or use. We walk pass aisles in the grocery store and never see anything on the shelves until a recipe or changing circumstances require that we look. We are lost in conversation at the cocktail party until we hear someone call our name. Source separation requires a considerable amount of learning and active processing of which we are unaware until it is brought to our attention.

Attention is Preference and the First Step in Brand Involvement

To attend is to prefer, though that preference may be negative as in avoidance rather than approach. Attention initiates the purchase process, so this is where we should begin our statistical modeling. We are not asking the consumer for inference, "Which of these contributes most to your purchase choice?" We are merely taking stock or checking inventory. If you wish to have more than a simple checklist, one can inquire about awareness, familiarity and usage for all of these are stored in episodic memory. In a sense, we are measuring attentional intensity with a behaviorally anchored scale. Awareness, familiarity and usage are three hurdles that a brand must surpass in order to achieve success. Attention becomes a continuum measured with milestones as brand involvement progresses from awareness to habit.

Still, the purpose of selective attention is simplification so that much of the market and its features will never past the first hurdle. We recognize and attend to that which is already known and familiar, and in the process, all else become background. Take a moment the next time for are at the supermarket making your usual purchases. As you reach for your brand, look at the surrounding area for all the substitute products that you never noticed because you were focused on one object on the shelf. In order to focus on one product or brand or feature, we must inhibit our response to all the rest. As the number of alternatives grow, attention becomes more scarce.

The long tail illustrates the type of data that needs to be modeled. If you enter "long tail" into a search engine looking for images, you will discover that the phenomena seems to be everywhere as a descriptive model of product purchase, feature usage, search results and more. We need to be careful and keep the model descriptive rather than as claims that the future is selling less of more. For some childlike reason, I personally prefer the following image describing search results with the long tail represented by the dinosaur rather than the more traditional product popularity of the new marketplace.



Unfortunately, this figure conceals the heterogeneity that produces the long tail. In the aggregate we appear to have homogeneity when the tail may be produced by many niche segments seeking distinct sets of products or features. Attention is selective and enables us to ignore most of the market, yet individual consumers attend to their own products and features. Though we may wish to see ourselves as unique individuals, there are always many others with similar needs and interests so that each of us belongs to a community whether we know it or not. Consequently, we start our study of preference by identify consumer types who live in disparate worlds created by selective exposure and attention to different products and features.

Building a Foundation with Brand Involvement Segmentation

Even without intent, our attention is directed by prior experience and selective exposure through our social network and the means by which we learn about and buy products and services. Sparsity is not accidental but shaped by wants and needs within a particular context. For instance, knowing the brand and type of makeup that you buy tells me a great deal about your age and occupation and social status (and perhaps even your sex). Even if we restrict our sample to those who regularly buy makeup, the variety among users, products and brands is sufficient to generate segments who will never buy the same makeup through the same channel.

Why not just ask about benefits they are seeking or the features that interest them and cluster on the ratings or some type of forced choice (e.g., best-worst scaling)? Such questions do not access episodic memory and do not demand that the respondent relive past events. Instead, the responses are relatively complex constructions controlled by conversational rules that govern how and what we say about ourselves when asked by strangers.

As I have tried to outline in two previous posts, consumers do not possess direct knowledge of their purchase processes. Instead, they observe themselves and infer why they seem to like this but not that. Moreover, unless the question asks for recall of a specific occurrence, the answer will reflect the gist of the memory and measure overall affect (e.g., a halo effect). Thus, let us not contaminate our responses by requesting inference but restrict ourselves to concrete questions that can be answered by more direct retrieval. While all remembering is a constructive process, episodic memory require less assembly.

Nonnegative Matrix Factorization (NMF)

Do we rely so much on rating scales because our statistical models cannot deal easily with highly skewed variables where the predominant response is never or not applicable? If so, R provides an interface to nonnegative matrix factorization (NMF), an algorithm that thrives on such sparse data matrices. During the past six weeks my posts have presented the R code needed to perform a NMF and have tried to communicate an intuitive sense of how and why such matrix factorization works in practice. You need only look in the titles for the keywords "matrix factorization" to find additional details in those previous posts.

I will draw an analogy with topic modeling in an attempt to explain this approach. Topic modeling starts with a bag of words used in a collection of documents. The assumptions are that the documents cover different topics and that the words used reflect the topics discussed by each document. In our makeup example, we might present a long checklist of brands and products replacing the bag of words in topic modeling. Then, instead of word counts as our intensity measure, we might ask about familiarity using an ordinal intensity scale (e.g., 0=never heard, 1=heard but not familiar, 2=somewhat familiar but never used, 3=used but not regularly, and 4=use regularly). Just as the word "401K" implies that the document deals with a financial topic, regular purchasing of Clinique Repairwear Foundation from Nordstrom helps me located you within a particular segment of the cosmetics market. Nordstrom is an upscale department store, Clinique is not a mass market brand, and you can probably guess who Repairwear Foundation is for by the name alone.

Unfortunately, it is very difficult in a blog post to present brand usage data at a level of specificity that demonstrates how NMF is capable of handling many variables with much of the data being zeros (no awareness). Therefore, I will only attempt to give a taste of what such an analysis would look like with actual data. I have aggregated the data into brand-level familiarity using the ordinal scale discussed in the previous paragraph. I will not present any R code because the data are proprietary, and instead refer you to previous posts where you can find everything you need to run the nmf function from the NMF package (e.g., continuous or discrete latent structure, learn from top rankings, and pathways in consumer decision journey).

The output can be summarized with two heatmaps: one indicating the "loadings" of the brands on the latent features so that we can name those hidden constructs and the second clustering individuals based on those latent features.

Like factor analysis one can vary the number of latent variables until an acceptable solution is found. The NMF package offers a number of criteria, but interpretability must take precedence. In general, we want to see a lot of yellow indicating that we have achieved some degree of simple structure. It would be helpful if each latent features was anchored, that is, a few rows or columns with values near one. This is a restatement of the varimax criteria in factor rotation (see Varimax, page 3). The variance of factor loadings is maximized when the distribution is bimodal, and this type of separation is what we are seeking from our NMF.

The dendrogram at the top of the following heatmap displays the results of a hierarchical clustering of the brands based on their association with the latent features. It is a good place to start. I am not going into much detail, but let me name the latent features from Rows 1-6:  1. Direct Sales, 2. Generics, 3. Style, 4. Mass Market, 5. Upscale, and 6. Beauty Tools. The segments were given names that would be accessible to those with no knowledge of the cosmetics market. That is, differentiated retail markets in general tend to have a lower end with generic brands, a mass market in the middle with the largest share, and a group of more upscale brands at the high end. The distribution channel also has its impact with direct sales adding differentiation to the usual separation between supermarkets, drug stores, department and specialty stores.


Now, let us look at the same latent features in the second heatmap below using the dendrogram on the left as our guide. You should recall that the rows are consumers, so that the hierarchical clustering displayed by the dendrogram can be considered a consumer segmentation. As we work our way down from the top, we see the mass market in Column 4 (looking for both reddish blocks and gaps in the dendrogram), direct sales in Column1 (again based on darker color but also glancing at dendrogram), and beauty tools in Column 6. All three of these clusters are shown to be joined by the dendrogram later in the hierarchical process. The upscale in Column 5 form their own cluster according to the dendrogram, as do the generic in Column 2. Finally, Column 3 represents those consumer who are more familiar with artistic brands.


My claim is that segments live in disparate worlds or at least segregated neighborhoods defined in this case study by user imagery (e.g., age and social status) and place of purchase  (e.g., direct selling, supermarkets and drug stores, and the more upscale department and specialty store). These segments may use similar vocabulary but probably mean something different. Everyone speaks of product quality and price, however, each segment is applying such terms relative to their own circumstances. The drugstore and the department store shoppers have a different price range in mind when they tells us that price is not an important consideration in their purchase.

Without knowing the segment or the context, we learn little from asking importance ratings or forced tradeoffs such as MaxDiff, which is why the word "foundation" describes the brand involvement segmentation. We now have a basis for the interpretation of all perceptual and importance data collected with questions that have no concrete referent. The resulting segments ought to be analyzed separately for they are different communities speaking their own languages or at least having their own definitions of terms such as cost, quality, innovative, prestige, easy, service and support.

Of course, I have oversimplified to some extent in order for you to see the pattern that can be recovered from the heatmaps. We need to examine the dendrogram more carefully since each individual buys more than one brand as makeup for different occasions (e.g., day and evening, work and social). In fact, NMF is able to get very concrete and analyze the many possible combinations of product, brand, and usage occasion. More importantly, NMF excels with sparse data matrices so do not be concerned if 90% of your data are zeros. The key to probing episodic memory is maintaining high imagery by asking for specifics with details about the occasion, the product and the brand so that the respondent may relive the experience. It may be a long list, but relevance and realism will encourage the respondent to complete a lengthy but otherwise easy task.

Lastly, one does need to accept the default of hierarchical clustering provided in the heatmap function. Some argue that an all-or-none hard clustering based on the highest latent feature weight or mixing coefficient is sufficient, and it may be if the individuals are well separated. However, you have the weights for every respondent so that any clustering method is an alternative. K-means is often suggested as it is the workhorse of clustering for good reason. Of course, the choice of clustering method depends on your prior beliefs concerning the underlying cluster structure, which would require some time to discuss. I will only note that I have experimented with some interesting options, including affinity propagation, and have had some success.

Postscript: It is not necessary to measure brand involvement across its entire range from attention through acquaintance to familiarity and habit. I have been successful with an awareness checklist. Yes, preference can be accessed with a simple recognition task (e.g., presenting a picture from a retail store with all the toothpastes in their actual places on the shelves and asking which ones have they been seen before). Preference is everywhere because affect guides everything we notice, search for, learn about, discuss with others, buy, use, make a habit of, or recommend. All we needed was a statistical model for uncovering the pattern hidden in the data matrix.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: August 2014 Roundup

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from August of particular interest to R users:  

R is the most popular software in the KDNuggets poll for the 4th year running.

The frequency of R user group meetings continues to rise, and there are now 147 R user groups worldwide.

A video interview with David Smith, Chief Community Officer at Revolution Analytics, at the useR! 2014 conference.

In a provocative op-ed, Norm Matloff worries that Statistics is losing ground to Computer Science.

A new certification program for Revolution R Enterprise.

An interactive map of R user groups around the world, created with R and Shiny.

Using R to generate calendar entries (and create photo opportunities).

Integrating R with production systems with Domino.

The New York Times compares data science to janitorial work.

Rdocumentation.org provides search for CRAN, GitHub and BioConductor packages and publishes a top-10 list of packages by downloads.

An update to the "airlines" data set (the "iris" of Big Data) with flights through the end of 2012. 

A consultant compares the statistical capabilities of R, Matlab, SAS, Stata and SPSS.

Using heatmaps to explore correlations in financial portfolios.

Video of John Chambers' keynote at the useR! 2014 conference on the interfaces, efficiency, big data and the history of R.

CIO magazine says the open source R language is becoming pervasive.

Reviews of some presentations at the JSM 2014 conference that used R.

GRAN is a new R package to manage package repositories to support reproducibility.

The ASA launches a PR campaign to promote the role of statisticians in society.

Video replay of the webinar Applications in R, featuring examples from several companies using R.

General interest stories (not related to R) in the past month included: dance moves from Japan, an earthquake's signal in personal sensors, a 3-minute movie in less than 4k, smooth time-lapse videos, representing mazes as trees and the view from inside a fireworks display.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to publish R and ggplot2 to the web

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Matt Sundquist, Plotly Co-founder

It's delightfully smooth to publish R code, plots, and presentations to the web. For example:

  • Shiny makes interactive apps from R.
  • Pretty R highlights R code for HTML.
  • Slidify makes slides from R Markdown.
  • Knitr and RPubs let you publish R Markdown docs.
  • GitHub and devtools let you quickly release packages and collaborate.

Now, Plotly lets you collaboratively edit and publish interactive ggplot2 graphs using these tools. This post shows how. Find us on GitHub, at feedback at plot ly, and @plotlygraphs. For more on our ggplot2 and R support, see our API docs.



You can copy and paste the code below — highlighted with Pretty R — into your R console to install Plotly and make an interactive, web-based plot. Or sign-up and generate your own key to add to the script. You control the privacy of your data and plots, own your work, and public sharing is free and unlimited.
install.packages("devtools")  # so we can install from github
library("devtools")
install_github("ropensci/plotly")  # plotly is part of ropensci
library(plotly)
 
py <- plotly(username="r_user_guide", key="mw5isa4yqp")  # open plotly connection
ggiris <- qplot(Petal.Width, Sepal.Length, data = iris, color = Species)
 
py$ggplotly(ggiris)  # send to plotly

Adding py$ggplotly() to your ggplot2 plot creates a Plotly graph online, drawn with D3.js, a popular JavaScript visualization library. The plot, data, and code for making the plot in Julia, Python, R, and MATLAB are all online and editable by you and your collaborators. In this case, it's here: https://plot.ly/~r_user_guide/2; if you forked the plot and wanted to tweak and share it, a new version of the plot would be saved into your profile.



We can share the URL over email or Twitter, add collaborators, export the image and data, or embed the plot in an iframe in this blog post. Click and drag to zoom or hover to see data.



Our iframe points to https://plot.ly/~r_user_guide/1.embed. For more, here is how to embed plots.

Plotting is interoperable, meaning you can make a plot with ggplot2, add data with Python or our Excel plug-in, and edit the plot with someone on your team who uses MATLAB. Your iframe will always show the most up to date version of your plots. For all plots you can edit, share, and download data and plots from within a web GUI, adding fits, styling, and more. Thus, if your ggplot2 plot doesn't precisely translate through to Plotly, you and your team can use the web app to tweak, edit, and style.







Now let's make a plot in a knitr doc (here's a knitr and RPubs tutorial). First, you'll want to open a new R Markdown doc within RStudio.



You can copy and paste this code into RStudio and press "Knit HTML":

## 1. Putting Plotly Graphs in Knitr
 
```{r}
library("knitr")
library("devtools")
url<-"https://plot.ly/~MattSundquist/1971"
plotly_iframe <- paste("<center><iframe scrolling='no' seamless='seamless' style='border:none' src='", url, 
    "/800/1200' width='800' height='1200'></iframe><center>", sep = "")
 
```
`r I(plotly_iframe)`

You'll want to press the "publish" button on the generated RPub preview to push the RPub online with a live graph. A published RPub from the code above is here. You can see how the embedded plot looks in the screenshot below. The RPub shows a Plotly graph of the ggplot2 NBA heatmap from a Learning R post.



Thus, we have three general options to publish interactive plots with your favorite R tools. First, use iframes to embed in RPubs, blogs, and on websites. Or in slides, as seen in Karthik Ram's Slidify presentation from useR 2014.

Second, you can make plots as part of an `.Rmd` document or in IPython Notebooks using R. For a `.Rmd` doc, you specify the `plotly=TRUE` chunk option. Here is an example and source to see the process in action.

Third, you can publish a plot in an iframe in a Shiny app, defining how users interact with your plot. Here is an example with the same plot, and here's how it looks:

.

A final note. For any Plotly graph, you can call the figure:

py <- plotly("ggplot2examples", "3gazttckd7")  # key and username for your call
figure <- py$get_figure("r_user_guide", 1)  # graph id for plot you want to access
str(figure)

or the data. That means you don't have to store data, plots, and code in different places. It's together and editable on Plotly.
figure$data[]

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

TURF Analysis: A Bad Answer to the Wrong Question

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)
Now that R has a package performing Total Unduplicated Reach and Frequency (TURF) Analysis, it might be a good time to issue a warning to all R users. DON'T DO IT!

The technique itself is straight out of media buying from the 1950s. Given some number of n alternative advertising options (e.g., magazines), which set of size k will reach the most readers and be seen the most often? Unduplicated reach is the primary goal because we want everyone in the target audience to see the ad. In addition, it was believed that seeing the ad more than once would make the ad more effective (that is, until wearout), which is why frequency is a component. When TURF is used to create product lines (e.g., flavors of ice cream to carry given limited freezer space), frequency tends to be downplayed and the focus placed on reaching the largest percentage of potential customers. All this seems simple enough until one looks carefully at the details, and then one realizes that we are interpreting random variation.

The R package turfR includes an example showing how to use their turf() function by setting n to 10 and letting k range from 3 to 6.

library(turfR)
data(turf_ex_data)
ex1 <- turf(turf_ex_data, 10, 3:6)
ex1
Created by Pretty R at inside-R.org

This code produces a considerable amount of output. I will show only the first 10 best triplets from the 120 possible sets of three that can be formed from 10 alternatives. The rchX columns tells the weighted proportion of the 180 individuals in the dataset that would buy one of the 10 products listed in the columns labeled with integers from 1 to 10. Thus, according to the first row, 99.9% would buy something if Items 8, 9, and 10 were offered for sale.

combo
rchX
frqX
1
2
3
4
5
6
7
8
9
10
1
120
0.998673
2.448993
0
0
0
0
0
0
0
1
1
1
2
119
0.998673
2.431064
0
0
0
0
0
0
1
0
1
1
3
99
0.995773
1.984364
0
0
0
1
0
0
0
1
0
1
4
110
0.992894
2.185398
0
0
0
0
1
0
0
0
1
1
5
64
0.991567
1.898693
0
1
0
0
0
0
0
0
1
1
6
109
0.990983
2.106944
0
0
0
0
1
0
0
1
0
1
7
97
0.99085
1.966436
0
0
0
1
0
0
1
0
0
1
8
116
0.989552
2.341179
0
0
0
0
0
1
0
0
1
1
9
85
0.989552
2.042792
0
0
1
0
0
0
0
0
1
1
10
36
0.989552
1.800407
1
0
0
0
0
0
0
0
1
1

The sales pitch for TURF depends on showing only the "best" solution for 3 through 6. Once we look down the list, we find that there are lots of equally good combinations with different products (e.g., the combination in the 7th position yields 99.1% reach with products 4, 7 and 10). With a sample size of 180, I do not need to run a bootstrap to know that the drop from 99.9% to 99.1% reflects random variation or error.

Of course, the data from turfR is simulated, but I have worked with many clients and many different datasets across a range of categories and I have never found anything but random differences among the top solutions. I have seen solutions where the top several hundred combinations cannot be distinguished based on reach, which is reasonable given that the number of combinations increases rapidly with n and k (e.g., the R function choose(30,5) indicates that there are 142,506 possible combinations of 30 things in sets of 5). You can find an example of what I see over and over again by visiting the TURF website for XLSTAT software.

Obviously, there is no single best item combination that dominates all others. It could have been otherwise. For example, it is possible that the market consists of distinct segments with each wanting one and only one item.

With no overlapping in this Venn diagram, it is clear that vanilla is the best single item, followed by vanilla and chocolate as the best pair, and so on had there been more flavors separated in this manner.

However, consumer segments are seldom defined by individual offerings in the market. You do not stop buying toothpaste because your brand has been discontinued. TURF asks the wrong question because consumer segmentation is not item-based.

As a quick example, we can think about credit card reward programs with its categories covering airlines, cash back, gas rebates, hotel, points, shopping and travel. Each category could contain multiple reward offers. A TURF analysis would seek the best individual rewards ignoring the categories. Yet, comparison websites use categories to organize searches because consumer segments are structured around the benefits offered by each category.

The TURF Analysis procedure from XLSTAT allows you to download an Excel file with purchase intention ratings for 27 items from 185 respondents. A TURF analysis would require that we set a cutoff score to transform the 1 through 5 ratings into a 0/1 binary measure. I prefer to maintain the 5-point scale and treat purchase intent as an intensity score after subtracting one so that the scale now ranges from 0=not at all to 4=quite sure. A nonnegative matrix factorization (NMF) reveals that the 27 items in the columns fall into 8 separable row categories marked by the red indicating a high probability of membership and yellow with values close to zero showing the categories where the product does not belong.

The above heatmap displays the coefficients for each of the 27 products, as the original Excel file names them. Unfortunately, we have only the numbers and no description of the 27 products. Still, it is clear that interest has an underlying structure and that perhaps we ought to consider grouping the products based on shared features, benefits or usages. For example, what do Products 5, 6 and 17 clustered together at the end of this heatmap have in common? Understand, we are looking for stable effects that can be found in the data and in the market where purchases are actually made.

The right question asks about consumer heterogeneity and whether it supports product differentiation. Different product offerings are only needed when the market contains segments seeking different benefits. Those advocating TURF analysis often use ice cream flavors as their example, as I did in the above Venn diagram. What if the benefit driving sales of less common flavors was not the flavor itself but the variety associated with a new flavor or a special occasion when one wants to deviate from their norm? A segmentation, whether NMF or another clustering procedure, would uncover a group interested in less typical flavors (probably many such flavors). This is what I found from the purchase history of whiskey drinkers, a number of segments each buying one of the major brands and a special occasion or variety seeking segment buying many niche brands. All of this is missed by a TURF analysis that gives us instead a bad answer to the wrong question.

Appendix with R Code needed to generate the heatmap:

First, download the Excel file, convert it to csv format, and set the working directory to the location of the data file.

test<-read.csv("demoTurf.csv")
library(NMF)
fit<-nmf(test[,-1]-1, 8, method="lee", nrun=20)
coefmap(fit)

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Twitter Pop-up Analytics

$
0
0

(This article was first published on everyday analytics, and kindly contributed to R-bloggers)

Introduction


So I've been thinking a lot lately. Well, that's always true. I should say, I've been thinking a lot lately about the blog. When I started this blog I was very much into the whole quantified self thing, because it was new to me, I liked the data collection and analysis aspect, and I had a lot of time to play around with these little side projects.

When I started the blog I called it "everyday analytics" because that's what I saw it always being; analysis of data on topics that were part of everyday life, the ordinary viewed under the analytical lens, things that everyone can relate to. You can see this in my original about page for the blog which has remained the same since inception.

I was thinking a lot lately about how as my interest in data analysis, visualization and analytics has matured, and so that's not really the case so much anymore. The content of everyday analytics has become a lot less everyday. Analyzing the relative nutritional value of different items on the McDonald's menu (yeesh, looking back now those graphs are pretty bad) is very much something to which most everyone could relate. 2-D Histograms in R? PCA and K-means clustering? Not so much.

So along this line of thinking, for this reason, I thought it's high time to get back into the original spirit of the site when it was started. So I thought I'd do some quick quantified-self type analysis, about something everyone can relate to, nothing fancy. 

Let's look at my Twitter feed.

Background

It wasn't always easy to get data out of Twitter. If you look back at how Twitter's API has changed over the years, there has been considerable uproar about the restrictions they've made in updates, however they're entitled to do so as they do hold the keys to the kingdom, after all (it is there product). In fact, I thought it'd be a easiest to do this analysis just using the twitteR package, but it appears to be broken since Twitter has made said updates to their API.

Luckily I am not a developer. My data needs are simple for some ad hoc analysis. All I need is the data pulled and I am ready to go. Twitter now makes this easy now for anyone to do, just go to your settings page:


And then select the 'Download archive' button under 'Your Twitter Archive' (here it is a prompt to resend mine, as I took the screenshot after):


And boom! A CSV of all your tweets is in your inbox ready for analysis. After all this talk about working with "Big Data" and trawling through large datasets, it's nice to take a breather a work with something small and simple.

Analysis

So, as I said, nothing fancy here, just wrote some intentionally hacky R code to do some "pop-up" analytics given Twitter's output CSV. Why did I do it this way, which results in 1990ish looking graphs, instead of in Excel and making it all pretty? Why, for you, of course. Reproducibility. You can take my same R code and run it on your twitter archive (which is probably a lot larger and more interesting than mine) and get the same graphs.

The data set comprises 328 tweets sent by myself between 2012-06-03 and 2014-10-02. The fields I examined were the datetime field (time parting analysis), the tweet source and the text / content.

Time Parting
First let's look at the time trending of my tweeting behaviour:

We can see there is some kind of periodicity, with peaks and valleys in how many tweets I send. The sharp decline hear the end is because there are only 2 days of data for October. Also, compared to your average Twitter user, I'd say I don't tweet alot, generally only once every two days or so on average:

> summary(as.vector(monthly))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   12.00   11.31   15.00   21.00 

Let's take a look and see if there is any rhyme or reason to these peaks and valleys:

Looking at the total counts per month, it looks like I've tweeted less often in March, July and December for whatever reason (for all of this, pardon my eyeballing..)

What about by day of week?

Look like I've tweeted quite a bit more on Tuesday, and markedly less on the weekend. Now, how does that look over the course of the day;

My peak tweeting time seems to be around 4 PM. Apparently I have sent tweets even in the wee hours of the morning - this was a surprise to me. I took a stab at making a heatmap, but it was quite sparse; however the 4-6 PM peak does persist across the days of the week.

Tweets by Source
Okay, that was interesting. Where am I tweeting from?

Look like the majority of my tweets are actually sent from the desktop site, followed by my phone, and then sharing on sites. I attribute this to the fact that I mainly use twitter to share articles, which isn't easy to do on my smartphone.

Content Analysis
Ah, now on to the interesting stuff! What's actually in those tweets?

First let's look at the length of my tweets in a simple histogram:


Looks like generally my tweets are above 70 characters or so, with a large peak close to the absolute limit of 160 characters. 

Okay, but what I am actually tweeting about? Using the very awesome tm package it's easy to do some simple text mining and pull out both top frequent terms, as well as hashtags.

So apparently I tweet a lot about data, analysis, Toronto and visualization. To anyone who's read the blog this shouldn't be overly surprisingly. Also you can see I pass along articles and interact with others as "via" and "thanks" are in there too. Too bad about that garbage ampersand.


Overwhelmingly the top hashtag I use is #dataviz, followed of course by #rstats. Again, for anyone who knows me (or has seen one of my talks) this should not come as a surprise. You can also see my use of Toronto Open Data in the #opendata and #dataeh hashtags.

Conclusion

That's all for now. As I said, this was just a fun exercise to write some quick, easy R code to do some simple personal analytics on a small dataset. On the plus side the code is generalized, so I invite you to take it and look at your own twitter archive.

Or, you could pull all of someone else's tweets, but that would, of course, require a little more work.

References

code at github

Twitter Help Center: Downloading Your Archive

The R Text Mining (tm) package at CRAN

twitteR package at CRN

To leave a comment for the author, please follow the link and comment on his blog: everyday analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Modeling Plenitude and Speciation by Jointly Segmenting Consumers and their Preferences

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)
In 1993, when music was sold in retail stores, it may have been informative to ask about preference across a handful of music genre. Today, now that the consumer has seized control and the music industry has responded, the market has exploded into more than a thousand different fragmented pairings of artists and their audiences. Grant McCracken, the cultural anthropologist, refers to such proliferation as speciation and the resulting commotion as plenitude. As with movies, genre become microgenre forcing recommender systems to deal with more choices and narrower segments.

This mapping from the website Every Noise at Once is constantly changing. As the website explains, there is a generating algorithm with some additional adjustments in order to make it all readable, and it all seems to work as an enjoyable learning interface. One clicks on the label to play a music sample. Then, you can continue to a list of artists associated with the category and hear additional samples from each artist. Although the map seems to have interpretable dimensions and reflects similarity among the microgenre, it does not appear to be a statistical model in its present form.

At any given point in time, we are stepping into a dynamic process of artists searching for differentiation and social media seeking to create new communities who share at least some common preferences. Word of mouth is most effective when consumers expect new entries and when spreading the word is its own reward. It is no longer enough for a brand to have a good story if customers do not enjoy telling that story to others. Clearly, this process is common to all product categories even if they span a much smaller scale. Thus, we are looking for a scalable statistical model that captures the dynamics through which buyers and sellers come to a common understanding. 

Borrowing a form of matrix factorization from recommender systems, I have argued in previous posts for implementing this kind of joint clustering of the rows and columns of a data matrix as a replacement for traditional forms of market segmentation. We can try it with a music preference dataset from the R package prefmod. Since I intend to compare my finding with another analysis of the same 1993 music preference data using the new R package RCA and reported in the American Journal of Sociology, we will begin by duplicating the few data modifications that were made in that paper (see the R code at the end of this post). 

In previous attempts to account for music preferences, psychologists have focused on the individual and turned to personality theory for an explanation. For the sociologist, there is always the social network. As marketing researchers, we will add the invisible hand of the market. What is available? How do consumers learn about the product category and obtain recommendations? Where is it purchased? When and where is it consumed? Are others involved (public vs private consumption)?

The Internet opens new purchase pathways, encourages new entities, increases choice and transfers control to the consumer. The resulting postmodern market with its plenitude of products, services, and features cannot be contained within a handful of segments. Speciation and micro-segmentation demand a model that reflects the joint evolution where new products and features are introduced to meet the needs of specific audiences and consumers organize their attention around those microgenre. Nonnegative matrix factorization (NMF) represents this process with a single set of latent variables describing both the rows and the columns at the same time.

After attaching the music dataset, NMF will produce a cluster heatmap summarizing the "loadings" of the 17 music genre (columns below) on the five latent features (rows below): Blues/Jazz, Heavy Metal/Rap, Country/Bluegrass, Opera/Classical, and Rock. The dendrogram at the top displays the results of a hierarchical clustering. Although there are five latent features, we could use the dendrogram to extract more than five music genre clusters. For example, Big Band and Folk music seem to be grouped together, possibly as a link from classical to country. In addition, Gospel may play a unique role linking country and Blues/Jazz. Whatever we observe in the columns will need to be verified by examining the rows. That is, one might expect to find a segment drawn to country and jazz/blues who also like gospel.


We would have seen more of the lighter colors with coefficients closer to zero had we found greater separation. Yet, this is not unexpected given the coarseness of music genre. As we get more specific, the columns become increasingly separated by consumers who only listen to or are aware of a subset of the available alternatives. These finer distinctions define today's market for just about everything. In addition, the use of a liking scale forces us to recode missing values to a neutral liking. We would have preferred an intensity scale with missing values coded as zeros because they indicate no interaction with the genre. Recoding missing to zero is not an issue when zero is the value given to "never heard of" or unaware.

Now, a joint segmentation means that listeners in the rows can be profiled using the same latent features accounting for covariation among the columns. Based on the above coefficient map, we expect those who like opera to also like classical music so that we do not require two separate scores for opera and classical but only one latent feature score. At least this is what we found with this data matrix. A second heatmap enables us to take a closer look at over 1500 respondents at the same time.


We already know how to interpret this heatmap because we have had practice with the coefficients. These colors indicate the values of the mixing weights for each respondent. Thus, in the middle of the heatmap you can find a dark red rectangle for latent feature #3, which we have already determined to represent country/bluegrass. These individuals give the lowest possible rating to everything except for the genre loading on this latent feature. We do not observe that much yellow or lighter colors in this heatmap because less than 13% of the responses fell into the lowest box labeled "dislike very much." However, most of the lighter regions are where you might expect them to be, for example, heavy metal/rap (#2), although we do uncover a heavy metal segment at the bottom of the figure.

Measuring Attraction and Ignoring Repulsion

We often think of liking as a bipolar scale, although what determines attraction can be different from what drives repulsion. Music is one of those product categories where satisfiers and dissatisfiers tend to be different. Negative responses can become extreme so that preference is defined by what one dislikes rather than what one likes. In fact, it is being forced to listen to music that we do not like that may be responsible for the lowest scores (e.g., being dragged to the opera or loud music from a nearby car). So, what would we find if we collapsed the bottom three categories and measured only attraction on a 3-point scale with 0=neutral, dislike or dislike very much, 1=like, and 2=like very much?

NMF thrives on sparsity, so increasing the number of zeros in the data matrix does not stress the computational algorithm. Indeed, the latent features become more separated as we can see in the coefficient heatmap. Gospel stands alone as its own latent feature. Country and bluegrass remain, as does opera/classical, blues/jazz, and rock. When we "remove" dislike for heavy metal and rap, heavy metal moves into rock and rap floats with reggae between jazz and rock. The same is true for folk and easy mood music, only now both are attractive to country and classical listeners.

More importantly, we can now interpret the mixture weights for individual respondents as additive attractors so that the first few rows are the those with interest in all the musical genre. In addition, we can easily identify listeners with specific interests. As we continue to work our way down the heatmap, we find jazz/blues(#4), followed by rock(#5) and a combination of jazz and rock. Continuing, we see country(#2) plus rock and country alone, after which is a variety of gospel (#1) plus some other genre. We end with opera and classical music, by itself and in combination with jazz.

Comparison with the Cultural Omnivore Hypothesis

As mentioned earlier, we can compare our findings to a published study testing whether inclusiveness rules tastes in music (the eclectic omnivore) or whether cultural distinctions between highbrow and lowbrow still dominate. Interestingly, the cluster analysis is approached as a graph-partitioning problem where the affinity matrix is defined as similarity in the score pattern regardless of mean level. All do not agree with this calculation, and we have a pair of dueling R packages using different definitions of similarity (the RCA vs. the CCA).

None of this is news for those of us who perform cluster analysis using the affinity propagation R package apcluster, which enables several different similarity measures including correlations (signed and unsigned). If you wish to learn more, I would suggest starting with the Orange County R User webinar for apcluster. The quality and breadth of the documentation will flatten your learning curve.

Both of the dueling R packages argue that preference similarity ought to be defined by the highs and lows in the score profiles ignoring the mean ratings for different individuals. This is a problem for marketing since consumers who do not like anything ought to be treated differently from consumers who like everything. One is a prime target and the other is probably not much of a user at all.

Actually, if I were interesting in testing the cultural omnivore hypothesis, I would be better served by collecting familiarity data on a broader range of more specific music genre, perhaps not as detailed as the above map but more revealing than the current broad categories. The earliest signs of preference can be seen in what draws our attention. Recognition tends to be a less obtrusive measure than preference, and we can learn a great deal knowing who visits each region in the music genre map and how long they stayed.

NMF identifies a sizable audience who are familiar with the same subset of music genre. These are the latent features, the building blocks as we have seen in the coefficient heatmaps. The lowbrow and the highbrow each confine themselves to separate latent features, residing in gated communities within the music genre map and knowing little of the other's world. The omnivore travels freely across these borders. Such class distinctions may be even more established in the cosmetics product category (e.g., women's makeup). Replacing genre with brand, you can read how this was handled in a prior post using NMF to analyze brand involvement.

R code to perform all the analyses reported in this post
library(prefmod)
data(music)
 
# keep only the 17 genre used
# in the AMJ Paper (see post)
prefer<-music[,c(1:11,13:18)]
 
# calculate number of missing values for each
# respondent and keep only those with no more
# than 6 missing values
miss<-apply(prefer,1,function(x) sum(is.na(x)))
prefer<-prefer[miss<7,]
 
# run frequency tables for all the variables
apply(prefer,2,function(x) table(x,useNA="always"))
# recode missing to the middle of the 5-point scale
prefer[is.na(prefer)]<-3
# reverse the scale so that larger values are
# associated with more liking and zero is
# the lowest value
prefer<-5-prefer
 
# longer names are easier to interpret
names(prefer)<-c("BigBand",
"Bluegrass",
"Country",
"Blues",
"Musicals",
"Classical",
"Folk",
"Gospel",
"Jazz",
"Latin",
"MoodEasy",
"Opera",
"Rap",
"Reggae",
"ConRock",
"OldRock",
"HvyMetal")
 
library(NMF)
fit<-nmf(prefer, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
 
# recode bottom three boxes to zero
# and rerun NMF
prefer2<-prefer-2
prefer2[prefer2<0]<-0
# need to remove respondents with all zeros
total<-apply(prefer2,1,sum)
table(total)
prefer2<-prefer2[total>0,]
 
fit<-nmf(prefer2, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quarterback Completion Heatmap Using dplyr

$
0
0

(This article was first published on Decisions and R, and kindly contributed to R-bloggers)

Several months ago, I found Bryan Povlinkski's (really nicely cleaned) dataset with 2013 NFL play-by-play information, based on data released by Brian Burke at Advanced Football Analytics.

I decided to browse QB completion rates based on Pass Location (Left, Middle, Right), Pass Distance (Short or Deep), and Down. I ended up focusing on the 5 quarterbacks with the most passing attempts.

The plot above (based on code below) shows a heatmap based on completion rate. Darker colors correspond to a better completion percentage.

Because we've only got data from one year, even looking at the really high-volume passers means that the data are pretty sparse for some combinations of these variables. It's a little rough, but in these cases, I deced not to plot anything. This plot could definitely be improved by plotting gray areas instead of white.

There are a few patterns here – first, it's iteresting to look at each player's success with Short compared to Deep passes. Every player, as we would expect, has more success with Short rather than Deep passes, but this difference seems especially pronounced for Drew Brees (who seems to have more success with Short passes compared to the other players). Brees seems to have pretty uniform completion rates across the three pass locations at short distance too – most other players have slightly better completion rates to the outside, espeically at short distance.

As we would expect, we can also see a fairly pronounced difference in completion rates for deep throws on 3rd down vs. 1st and 2nd down. The sample size is small, so the estimates aren't very precise, this pattern is definitely there – probably best exemplified by Tom Brady and Peyton Manning's data.

As a next step, it would be interesting to make the same plot with pass attempts rather than completion rates.

library(dplyr)
library(ggplot2)
 
# note: change path to the dataset
df = read.csv("C:/Users/Mark/Desktop/RInvest/nflpbp/2013 NFL Play-by-Play Data.csv",
stringsAsFactors = F)
 
passers = df %>% filter(Play.Type == "Pass") %>% group_by(Passer) %>% summarize(n.obs = length(Play.Type)) %>% arrange(desc(n.obs))
top.passers = head(passers$Passer,5)
 
 
df %>% filter(Play.Type == "Pass",
Passer %in% top.passers) %>%
mutate(Pass.Distance = factor(Pass.Distance, levels = c("Short","Deep"))) %>%
group_by(Down,Passer,Pass.Location, Pass.Distance) %>% summarize(share = (sum(Pass.Result == "Complete") / length(Pass.Result)),
n.obs = length(Pass.Result)) %>%
filter(n.obs > 5) %>%
ggplot(., aes(Pass.Location, Pass.Distance)) + geom_tile(aes(fill = share),
colour = "white") +
facet_wrap(Passer ~ Down, ncol = 3) +
scale_fill_gradient(low = "white", high = "steelblue", limits = c(0,1)) + theme_bw() +
ggtitle("NFL QB completion by Pass Distance, Location, and Down")

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Decisions and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing Stepwise Correlation Rank

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

So in the last post, I attempted to replicate the Flexible Asset Allocation paper. I’d like to offer a thanks to Pat of Intelligent Trading Tech (not updated recently, hopefully this will change) for helping me corroborate the results, so that I have more confidence there isn’t an error in my code.

One of the procedures the authors of the FAA paper used is a correlation rank, which I interpreted as the average correlation of each security to the others.

The issue, pointed out to me in a phone conversation I had with David Varadi is that when considering correlation, shouldn’t the correlations the investor is concerned about be between instruments within the portfolio, as opposed to simply all the correlations, including to instruments not in the portfolio? To that end, when selecting assets (or possibly features in general), conceptually, it makes more sense to select in a stepwise fashion–that is, start off at a subset of the correlation matrix, and then rank assets in order of their correlation to the heretofore selected assets, as opposed to all of them. This was explained in Mr. Varadi’s recent post.

Here’s a work in progress function I wrote to formally code this idea:

stepwiseCorRank <- function(corMatrix, startNames=NULL, stepSize=1, bestHighestRank=FALSE) {
  if(is.null(startNames)) {
    corSums <- rowSums(corMatrix)
    corRanks <- rank(corSums)
    startNames <- names(corRanks)[corRanks <= stepSize]
  }
  nameList <- list()
  nameList[[1]] <- startNames
  rankList <- list()
  rankCount <- 1
  rankList[[1]] <- rep(rankCount, length(startNames))
  rankedNames <- do.call(c, nameList)
  
  while(length(rankedNames) < nrow(corMatrix)) {
    rankCount <- rankCount+1
    subsetCor <- corMatrix[, rankedNames]
    if(class(subsetCor) != "numeric") {
      subsetCor <- subsetCor[!rownames(corMatrix) %in% rankedNames,]
      if(class(subsetCor) != "numeric") {
        corSums <- rowSums(subsetCor)
        corSumRank <- rank(corSums)
        lowestCorNames <- names(corSumRank)[corSumRank <= stepSize]
        nameList[[rankCount]] <- lowestCorNames
        rankList[[rankCount]] <- rep(rankCount, min(stepSize, length(lowestCorNames)))
      } else { #1 name remaining
        nameList[[rankCount]] <- rownames(corMatrix)[!rownames(corMatrix) %in% names(subsetCor)]
        rankList[[rankCount]] <- rankCount
      }
    } else {  #first iteration, subset on first name
      subsetCorRank <- rank(subsetCor)
      lowestCorNames <- names(subsetCorRank)[subsetCorRank <= stepSize]
      nameList[[rankCount]] <- lowestCorNames
      rankList[[rankCount]] <- rep(rankCount, min(stepSize, length(lowestCorNames)))
    }    
    rankedNames <- do.call(c, nameList)
  }
  
  ranks <- do.call(c, rankList)
  names(ranks) <- rankedNames
  if(bestHighestRank) {
    ranks <- 1+length(ranks)-ranks
  }
  ranks <- ranks[colnames(corMatrix)] #return to original order
  return(ranks)
}

So the way the function works is that it takes in a correlation matrix, a starting name (if provided), and a step size (that is, how many assets to select per step, so that the process doesn’t become extremely long when dealing with larger amounts of assets/features). Then, it iterates–subset the correlation matrix on the starting name, and find the minimum value, and add it to a list of already-selected names. Next, subset the correlation matrix columns on the selected names, and the rows on the not selected names, and repeat, until all names have been accounted for. Due to R’s little habit of wiping out labels when a matrix becomes a vector, I had to write some special case code, which is the reason for two nested if/else statements (the first one being for the first column subset, and the second being for when there’s only one row remaining).

Here’s a test script I wrote to test this function out:

#mid 1997 to end of 2012
getSymbols(mutualFunds, from="1997-06-30", to="2012-12-31")
tmp <- list()
for(fund in mutualFunds) {
  tmp[[fund]] <- Ad(get(fund))
}

#always use a list hwne intending to cbind/rbind large quantities of objects
adPrices <- do.call(cbind, args = tmp)
colnames(adPrices) <- gsub(".Adjusted", "", colnames(adPrices))

adRets <- Return.calculate(adPrices)

subset <- adRets["2012"]
corMat <- cor(subset)


tmp <- list()
for(i in 1:length(mutualFunds)) {
  rankRow <- stepwiseCorRank(corMatrix, startNames=mutualFunds[i])
  tmp[[i]] <- rankRow
}
rankDemo <- do.call(rbind, tmp)
rownames(rankDemo) <- mutualFunds
origRank <- rank(rowSums(corMat))
rankDemo <- rbind(rankDemo, origRank)
rownames(rankDemo)[8] <- "Original (VBMFX)"

heatmap(-rankDemo, Rowv=NA, Colv=NA, col=heat.colors(8), margins=c(6,6))

Essentially, using the 2012 year of returns for the 7 FAA mutual funds, I compared how different starting securities changed the correlation ranking sequence.

Here are the results:

               VTSMX FDIVX VEIEX VFISX VBMFX QRAAX VGSIX
VTSMX              1     6     7     4     2     3     5
FDIVX              6     1     7     4     2     5     3
VEIEX              6     7     1     4     2     3     5
VFISX              2     6     7     1     3     4     5
VBMFX              2     6     7     4     1     3     5
QRAAX              5     6     7     4     2     1     3
VGSIX              5     6     7     4     2     3     1
Non-Sequential     5     6     7     2     1     3     4

In short, the algorithm is rather robust to starting security selection, at least judging by this small example. However, comparing VBMFX start to the non-sequential ranking, we see that VFISX changes from rank 2 in the non-sequential to rank 4, with VTSMX going from rank 5 to rank 2. From an intuitive perspective, this makes sense, as both VBMFX and VFISX are bond funds, which have a low correlation with the other 5 equity-based mutual funds, but a higher correlation with each other, thus signifying that the algorithm seems to be working as intended, at least insofar as this small example demonstrates. Here’s a heatmap to demonstrate this in visual form.

The ranking order (starting security) is the vertical axis, and the horizontal are the ranks, from white being first, to red being last. Notice once again that the ranking orders are robust in general (consider each column of colors descending), but each particular ranking order is unique.

So far, this code still has to be tested in terms of its applications to portfolio management and asset allocation, but for those interested in such an idea, it’s my hope that this provides a good reference point.

Thanks for reading.


To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Trading The Odds Volatility Risk Premium: Addressing Data Mining and Curve-Fitting

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

Several readers, upon seeing the risk and return ratio along with other statistics in the previous post stated that the result may have been the result of data mining/over-optimization/curve-fitting/overfitting, or otherwise bad practice of creating an amazing equity curve whose performance will decay out of sample.

Fortunately, there’s a way to test that assertion. In their book “Trading Systems: A New Approach to System Development and Portfolio Optimization”, Urban Jaekle and Emilio Tomasini use the concept of the “stable region” to demonstrate a way of visualizing whether or not a parameter specification is indeed overfit. The idea of a stable region is that going forward, how robust is a parameter specification to slight changes? If the system just happened to find one good small point in a sea of losers, the strategy is likely to fail going forward. However, if small changes in the parameter specifications still result in profitable configurations, then the chosen parameter set is a valid configuration.

As Frank’s trading strategy only has two parameters (standard deviation computation period, aka runSD for the R function, and the SMA period), rather than make line graphs, I decided to do a brute force grid search just to see other configurations, and plotted the results in the form of heatmaps.

Here’s the modified script for the computations (no parallel syntax in use for the sake of simplicity):

download("https://dl.dropboxusercontent.com/s/jk6der1s5lxtcfy/XIVlong.TXT",
         destfile="longXIV.txt")

download("https://dl.dropboxusercontent.com/s/950x55x7jtm9x2q/VXXlong.TXT", 
         destfile="longVXX.txt") #requires downloader package

xiv <- xts(read.zoo("longXIV.txt", format="%Y-%m-%d", sep=",", header=TRUE))
vxx <- xts(read.zoo("longVXX.txt", format="%Y-%m-%d", sep=",", header=TRUE))
vxmt <- xts(read.zoo("vxmtdailyprices.csv", format="%m/%d/%Y", sep=",", header=TRUE))

getSymbols("^VIX", from="2004-03-29")

vixvxmt <- merge(Cl(VIX), Cl(vxmt))
vixvxmt[is.na(vixvxmt[,2]),2] <- vixvxmt[is.na(vixvxmt[,2]),1]

xivRets <- Return.calculate(Cl(xiv))
vxxRets <- Return.calculate(Cl(vxx))

getSymbols("^GSPC", from="1990-01-01")
spyRets <- diff(log(Cl(GSPC)))

t1 <- Sys.time()
MARmatrix <- list()
SharpeMatrix <- list()
for(i in 2:21) {
  
  smaMAR <- list()
  smaSharpe <- list()
  for(j in 2:21){
    spyVol <- runSD(spyRets, n=i)
    annSpyVol <- spyVol*100*sqrt(252)
    vols <- merge(vixvxmt[,2], annSpyVol, join='inner')
    
    
    vols$smaDiff <- SMA(vols[,1] - vols[,2], n=j)
    vols$signal <- vols$smaDiff > 0
    vols$signal <- lag(vols$signal, k = 1)
    
    stratRets <- vols$signal*xivRets + (1-vols$signal)*vxxRets
    #charts.PerformanceSummary(stratRets)
    #stratRets[is.na(stratRets)] <- 0
    #plot(log(cumprod(1+stratRets)))
    
    stats <- data.frame(cbind(Return.annualized(stratRets)*100, 
                              maxDrawdown(stratRets)*100, 
                              SharpeRatio.annualized(stratRets)))
    
    colnames(stats) <- c("Annualized Return", "Max Drawdown", "Annualized Sharpe")
    MAR <- as.numeric(stats[1])/as.numeric(stats[2])    
    smaMAR[[j-1]] <- MAR
    smaSharpe[[j-1]] <- stats[,3]
  }
  rm(vols)
  smaMAR <- do.call(c, smaMAR)
  smaSharpe <- do.call(c, smaSharpe)
  MARmatrix[[i-1]] <- smaMAR
  SharpeMatrix[[i-1]] <- smaSharpe
}
t2 <- Sys.time()
print(t2-t1)

Essentially, just wrap the previous script in a nested for loop over the two parameters.

I chose GGplot2 to plot the heatmaps for more control with coloring.

Here’s the heatmap for the MAR ratio (that is, returns over max drawdown):

MARmatrix <- do.call(cbind, MARmatrix)
rownames(MARmatrix) <- paste0("SMA", c(2:21))
colnames(MARmatrix) <- paste0("runSD", c(2:21))
MARlong <- melt(MARmatrix)
colnames(MARlong) <- c("SMA", "runSD", "MAR")
MARlong$SMA <- as.numeric(gsub("SMA", "", MARlong$SMA))
MARlong$runSD <- as.numeric(gsub("runSD", "", MARlong$runSD))
MARlong$scaleMAR <- scale(MARlong$MAR)
ggplot(MARlong, aes(x=SMA, y=runSD, fill=scaleMAR))+geom_tile()+scale_fill_gradient2(high="skyblue", mid="blue", low="red")

Here’s the result:

Immediately, we start to see some answers to questions regarding overfitting. First off, is the parameter set published by TradingTheOdds optimized? Yes. In fact, not only is it optimized, it’s by far and away the best value on the heatmap. However, when discussing overfitting, curve-fitting, and the like, the question to ask isn’t “is this the best parameter set available”, but rather “is the parameter set in a stable region?” The answer, in my opinion to that, is yes, as noted by the differing values of the SMA for the 2-day sample standard deviation. Note that this quantity, due to being the sample standard deviation, is actually the square root of the two squared residuals of that time period.

Here are the MAR values for those configurations:

> MARmatrix[1:10,1]
    SMA2     SMA3     SMA4     SMA5     SMA6     SMA7     SMA8     SMA9    SMA10    SMA11 
2.471094 2.418934 2.067463 3.027450 2.596087 2.209904 2.466055 1.394324 1.860967 1.650588 

In this case, not only is the region stable, but the MAR values are all above 2 until the SMA9 value.

Furthermore, note that aside from the stable region of the 2-day sample standard deviation, a stable region using a standard deviation of ten days with less smoothing from the SMA (because there’s already an average inherent in the sample standard deviation) also exists. Let’s examine those values.

> MARmatrix[2:5, 9:16]
      runSD10  runSD11  runSD12  runSD13  runSD14  runSD15  runSD16   runSD17
SMA3 1.997457 2.035746 1.807391 1.713263 1.803983 1.994437 1.695406 1.0685859
SMA4 2.167992 2.034468 1.692622 1.778265 1.828703 1.752648 1.558279 1.1782665
SMA5 1.504217 1.757291 1.742978 1.963649 1.923729 1.662687 1.248936 1.0837615
SMA6 1.695616 1.978413 2.004710 1.891676 1.497672 1.471754 1.194853 0.9326545

Apparently, a standard deviation between 2 and 3 weeks with minimal SMA smoothing also produced some results comparable to the 2-day variant.

Off to the northeast of the plot, using longer periods for the parameters simply causes the risk-to-reward performance to drop steeply. This is essentially an illustration of the detriments of lag.

Finally, there’s a small rough patch between the two aforementioned stable regions. Here’s the data for that.

> MARmatrix[1:5, 4:8]
       runSD5    runSD6    runSD7   runSD8   runSD9
SMA2 1.928716 1.5825265 1.6624751 1.033216 1.245461
SMA3 1.528882 1.5257165 1.2348663 1.364103 1.510653
SMA4 1.419722 0.9497827 0.8491229 1.227064 1.396193
SMA5 1.023895 1.0630939 1.3632697 1.547222 1.465033
SMA6 1.128575 1.3793244 1.4085513 1.440324 1.964293

As you can see, there are some patches where the MAR is below 1, and many where it’s below 1.5. All of these are pretty detached from the stable regions.

Let’s repeat this process with the Sharpe Ratio heatmap.

SharpeMatrix <- do.call(cbind, SharpeMatrix)
rownames(SharpeMatrix) <- paste0("SMA", c(2:21))
colnames(SharpeMatrix) <- paste0("runSD", c(2:21))
sharpeLong <- melt(SharpeMatrix)
colnames(sharpeLong) <- c("SMA", "runSD", "Sharpe")
sharpeLong$SMA <- as.numeric(gsub("SMA", "", sharpeLong$SMA))
sharpeLong$runSD <- as.numeric(gsub("runSD", "", sharpeLong$runSD))
ggplot(sharpeLong, aes(x=SMA, y=runSD, fill=Sharpe))+geom_tile()+
  scale_fill_gradient2(high="skyblue", mid="blue", low="darkred", midpoint=1.5)

And the result:

Again, the TradingTheOdds parameter configuration lights up, but among a region of strong configurations. This time, we can see that in comparison to the rest of the heatmap, the northern stable region seems to have become clustered around the 10-day standard deviation (or 11) with SMAs of 2, 3, and 4. The regions to the northeast are also more subdued by comparison, with the Sharpe ratio bottoming out around 1.

Let’s look at the numerical values again for the same regions.

Two-day standard deviation region:

> SharpeMatrix[1:10,1]
    SMA2     SMA3     SMA4     SMA5     SMA6     SMA7     SMA8     SMA9    SMA10    SMA11 
1.972256 2.210515 2.243040 2.496178 1.975748 1.965730 1.967022 1.510652 1.963970 1.778401 

Again, numbers the likes of which I myself haven’t been able to achieve with more conventional strategies, and numbers the likes of which I haven’t really seen anywhere for anything on daily data. So either the strategy is fantastic, or something is terribly wrong outside the scope of the parameter optimization.

Two week standard deviation region:

> SharpeMatrix[1:5, 9:16]
      runSD10  runSD11  runSD12  runSD13  runSD14  runSD15  runSD16  runSD17
SMA2 1.902430 1.934403 1.687430 1.725751 1.524354 1.683608 1.719378 1.506361
SMA3 1.749710 1.758602 1.560260 1.580278 1.609211 1.722226 1.535830 1.271252
SMA4 1.915628 1.757037 1.560983 1.585787 1.630961 1.512211 1.433255 1.331697
SMA5 1.684540 1.620641 1.607461 1.752090 1.660533 1.500787 1.359043 1.276761
SMA6 1.735760 1.765137 1.788670 1.687369 1.507831 1.481652 1.318751 1.197707

Again, pretty outstanding numbers.

The rough patch:

> SharpeMatrix[1:5, 4:8]
       runSD5   runSD6   runSD7   runSD8   runSD9
SMA2 1.905192 1.650921 1.667556 1.388061 1.454764
SMA3 1.495310 1.399240 1.378993 1.527004 1.661142
SMA4 1.591010 1.109749 1.041914 1.411985 1.538603
SMA5 1.288419 1.277330 1.555817 1.753903 1.685827
SMA6 1.278301 1.390989 1.569666 1.650900 1.777006

All Sharpe ratios higher than 1, though some below 1.5

So, to conclude this post:

Was the replication using optimized parameters? Yes. However, those optimized parameters were found within a stable (and even strong) region. Furthermore, it isn’t as though the strategy exhibits poor risk-to-return metrics beyond those regions, either. Aside from raising the lookback period on both the moving average and the standard deviation to levels that no longer resemble the original replication, performance was solid to stellar.

Does this necessarily mean that there is nothing wrong with the strategy? No. It could be that the performance is an artifact of “observe the close, enter at the close” optimistic execution assumptions. For instance, quantstrat (the go-to backtest engine in R for more trading-oriented statistics) uses a next-bar execution method that defaults on the *next* day’s close (so if you look back over my quantstrat posts, I use prefer=”open” so as to get the open of the next bar, instead of its close). It could also be that VXMT itself is an instrument that isn’t very well known in the public sphere, either, seeing as how Yahoo finance barely has any data on it. Lastly, it could simply be the fact that although the risk to reward ratios seem amazing, many investors/mutual fund managers/etc. probably don’t want to think “I’m down 40-60% from my peak”, even though it’s arguably easier to adjust a strategy with a good reward to risk ratio with excess risk by adding cash (to use a cooking analogy, think about your favorite spice. Good in small quantities.), than it is to go and find leverage for a good reward to risk strategy with very small returns (not to mention incurring all the other risks that come with leverage to begin with, such as a 50% drawdown wiping out an account leveraged two to one).

However, to address the question of overfitting, through a modified technique from Jaekle and Tomasini (2009), these are the results I found.

Thanks for reading.

Note: I am a freelance consultant in quantitative analysis on topics related to this blog. If you have contract or full time roles available for proprietary research that could benefit from my skills, please contact me through my LinkedIn here.


To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A New Volatility Strategy, And A Heuristic For Analyzing Robustness

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

This post is motivated by a discussion that arose when I tested a strategy by Frank of Trading The Odds (post here). One point, brought up by Tony Cooper of Double Digit Numerics, the original author of the paper that Trading The Odds now trades (I consider it a huge honor that my blog is read by authors of original trading strategies), is that my heatmap analysis only looked cross-sectional performance, as opposed to performance over time–that is, performance that could have been outstanding over the course of the entire backtest could have been the result of a few lucky months. This is a fair point, which I hope this post will address in terms of a heuristic using both visual and analytical outputs.

The strategy for this post is the following, provided to me kindly by Mr. Helmuth Vollmeier (whose help in all my volatility-related investigations cannot be understated):

Consider VXV and VXMT, the three month and six month implied volatility on the usual SP500. Define contango as VXV/VXMT < 1, and backwardation vice versa. Additionally, take an SMA of said ratio. Go long VXX when the ratio is greater than 1 and above its SMA, and go long XIV when the converse holds. Or in my case, get in at the close when that happens and exit at the next day's close after the converse occurs (that is, my replication is slightly off due to using some rather simplistic coding for illustrative purposes).

In any case, here's the script for setting up the strategy, most of which is just downloading the data–the strategy itself is just a few lines of code:

require(downloader)
require(PerformanceAnalytics)

download("http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/vxvdailyprices.csv", 
         destfile="vxvData.csv")
download("http://www.cboe.com/publish/ScheduledTask/MktData/datahouse/vxmtdailyprices.csv", 
         destfile="vxmtData.csv")

vxv <- xts(read.zoo("vxvData.csv", header=TRUE, sep=",", format="%m/%d/%Y", skip=2))
vxmt <- xts(read.zoo("vxmtData.csv", header=TRUE, sep=",", format="%m/%d/%Y", skip=2))
ratio <- Cl(vxv)/Cl(vxmt)


download("https://dl.dropboxusercontent.com/s/jk6der1s5lxtcfy/XIVlong.TXT",
         destfile="longXIV.txt")

download("https://dl.dropboxusercontent.com/s/950x55x7jtm9x2q/VXXlong.TXT", 
         destfile="longVXX.txt") #requires downloader package

xiv <- xts(read.zoo("longXIV.txt", format="%Y-%m-%d", sep=",", header=TRUE))
vxx <- xts(read.zoo("longVXX.txt", format="%Y-%m-%d", sep=",", header=TRUE))

xiv <- merge(xiv, ratio, join='inner')
vxx <- merge(vxx, ratio, join='inner')
colnames(xiv)[5] <- colnames(vxx)[5] <- "ratio"

xivRets <- Return.calculate(Cl(xiv))
vxxRets <- Return.calculate(Cl(vxx))

retsList <- list()
count <- 1
for(i in 10:200) {
  ratioSMA <- SMA(ratio, n=i)
  vxxSig <- lag(ratio > 1 & ratio > ratioSMA)
  xivSig <- lag(ratio < 1 & ratio < ratioSMA)
  rets <- vxxSig*vxxRets + xivSig*xivRets
  colnames(rets) <- i
  retsList[[i]]  <- rets
  count <- count+1  
}
retsList <- do.call(cbind, retsList)
colnames(retsList) <- gsub("X", "", colnames(retsList))
charts.PerformanceSummary(retsList)
retsList <- retsList[!is.na(retsList[,191]),]
retsList <- retsList[-1,]

retsList <- retsList[-c(1538, 1539, 1450),] #for monthly aggregation, remove start of Dec 2014

About as straightforward as things get (the results, as we'll see, are solid, as well). In this case, I tested every SMA between a 10 day and a classic 200 day SMA. And since this strategy is a single-parameter strategy (unless you want to get into adjusting the ratio critical values up and down away from 1), instead of heatmaps, we'll suffice with basic scatter plots and line plots (which make things about as simple as they come).

The heuristic I decided upon was to take some PerfA functions (Return.annualized, SharpeRatio.annualized, for instance), and compare the rank of the average of their monthly ranks (that is, a two-layer rank, very similar to the process in Flexible Asset Allocation) to the aggregate, whole time-period rank. The idea here is that performance based on a few lucky months may have a high aggregate ranking, but a much lower monthly ranking, which would be reflected in a scatter plot. Ideally, the scatter plot would go from lower left to upper right in terms of ranks comparisons, with a correlation of 1, meaning that the strategy with the best overall return will have the best average monthly return rank, and so on down the list. This can also apply to the Sharpe ratio, and so on.

Here's my off-the-cuff implementation of such an idea:

rankComparison <- function(rets, perfAfun="Return.cumulative") {
  fun <- match.fun(perfAfun)
  monthlyFun <- apply.monthly(rets, fun)
  monthlyRank <- t(apply(monthlyFun, MARGIN=1, FUN=rank))
  meanMonthlyRank <- apply(monthlyRank, MARGIN=2, FUN=mean)
  rankMMR <- rank(meanMonthlyRank)
  
  aggFun <- fun(rets)
  aggFunRank <- rank(aggFun)
  plot(aggFunRank~rankMMR, main=perfAfun)
  print(cor(aggFunRank, meanMonthlyRank))
}

So, I get a chart and a correlation of average monthly ranks against a single-pass whole-period rank. Here are the results for cumulative returns and Sharpe ratio:

> rankComparison(retsList)
[1] 0.8485374

Basically, the interpretation is this: the outliers above and to the left of the main cluster can be interpreted as those having those “few lucky months”, while those to the lower right consistently perform somewhat well, but for whatever reason, are just stricken with bad luck. However, the critical results that we’re looking for is that the best overall performers (the highest aggregate rank) are also those with the most *consistent* performance (the highest monthly rank), which is generally what we see.

Furthermore, the correlation of .85 also lends credence that this is a robust strategy.

Here’s the process repeated with the annualized Sharpe ratio:

> rankComparison(retsList, perfAfun="SharpeRatio.annualized")
[1] 0.8647353

In other words, an even clearer relationship here, and again, we see that the best performers overall are also the best monthly performers, so we can feel safe in the robustness of the strategy.

So what’s the punchline? Well, the idea is now that we’ve established that the best results on aggregate are also the strongest results when analyzing the results across time, let’s look to see if the various rankings of risk and reward metrics reveal which configurations those are.

Here’s a chart of the aggregate rankings of annualized return (aka cumulative return), annualized Sharpe, MAR (return over max drawdown), and max drawdown.

aggReturns <- Return.annualized(retsList)
aggSharpe <- SharpeRatio.annualized(retsList)
aggMAR <- Return.annualized(retsList)/maxDrawdown(retsList)
aggDD <- maxDrawdown(retsList)

plot(rank(aggReturns)~as.numeric(colnames(aggReturns)), type="l", ylab="annualized returns rank", xlab="SMA",
     main="Risk and return rank comparison")
lines(rank(aggSharpe)~as.numeric(colnames(aggSharpe)), type="l", ylab="annualized Sharpe rank", xlab="SMA", col="blue")
lines(rank(aggMAR)~as.numeric(colnames(aggMAR)), type="l", ylab="Max return over max drawdown", xlab="SMA", col="red")
lines(rank(-aggDD)~as.numeric(colnames(aggDD)), type="l", ylab="max DD", xlab="SMA", col="green")
legend("bottomright", c("Return rank", "Sharpe rank", "MAR rank", "Drawdown rank"), pch=0, col=c("black", "blue", "red", "green"))

And the resulting plot itself:

So, looking at these results, here are some interpretations, moving from left to right:

At the lower end of the SMA, the results are just plain terrible. Sure, the drawdowns are lower, but the returns are in the basement.
The spike around the 50-day SMA makes me question if there is some sort of behavioral bias at work here.
Next, there’s a region with fairly solid performance between that and the 100-day SMA, but is surrounded on both sides by pretty abysmal performance.
Moving onto the 100-day SMA region, the annualized returns and Sharpe ratios are strong, but get the parameter estimation incorrect going forward, and there’s a severe risk of incurring heavy drawdowns. The jump improvement in the drawdown metric is also interesting. Again, is there some sort of bias towards some of the round numbers? (50, 100, etc.)
Lastly, there’s nothing particularly spectacular about the performances until we get to the high 100s and the 200 day SMA, at which point, we see a stable region of configurations with high ranks in all categories.

Let’s look at that region more closely:

truncRets <- retsList[,161:191]
stats <- data.frame(cbind(t(Return.annualized(truncRets)),
                 t(SharpeRatio.annualized(truncRets)),
                 t(maxDrawdown(truncRets))))
colnames(stats) <- c("A.Return", "A.Sharpe", "Worst_Drawdown")
stats$MAR <- stats[,1]/stats[,3]
stats <- round(stats, 3)

And the results:

> stats
    A.Return A.Sharpe Worst_Drawdown   MAR
170    0.729    1.562          0.427 1.709
171    0.723    1.547          0.427 1.693
172    0.723    1.548          0.427 1.694
173    0.709    1.518          0.427 1.661
174    0.711    1.522          0.427 1.665
175    0.711    1.522          0.427 1.665
176    0.711    1.522          0.427 1.665
177    0.711    1.522          0.427 1.665
178    0.696    1.481          0.427 1.631
179    0.667    1.418          0.427 1.563
180    0.677    1.441          0.427 1.586
181    0.677    1.441          0.427 1.586
182    0.677    1.441          0.427 1.586
183    0.675    1.437          0.427 1.582
184    0.738    1.591          0.427 1.729
185    0.760    1.637          0.403 1.886
186    0.794    1.714          0.403 1.970
187    0.798    1.721          0.403 1.978
188    0.802    1.731          0.403 1.990
189    0.823    1.775          0.403 2.042
190    0.823    1.774          0.403 2.041
191    0.823    1.774          0.403 2.041
192    0.819    1.765          0.403 2.031
193    0.822    1.772          0.403 2.040
194    0.832    1.792          0.403 2.063
195    0.832    1.792          0.403 2.063
196    0.802    1.723          0.403 1.989
197    0.810    1.741          0.403 2.009
198    0.782    1.677          0.403 1.941
199    0.781    1.673          0.403 1.937
200    0.779    1.670          0.403 1.934

So starting from SMA 186 through SMA 200, we see some fairly strong performance–returns in the high 70s to the low 80s, and MARs in the high 1s to low 2s. And of course, since this is about a trading strategy, equity curves are of course, obligatory. Here is what that looks like:

strongRets <- retsList[,177:191]
charts.PerformanceSummary(strongRets)

Basically, on aggregate, some very strong performance. However, it is certainly not *smooth* performance. New equity highs are followed by strong drawdowns, which are then followed by a recovery and new, higher equity highs.

To conclude (for the moment, I’ll have a new post on this next week with a slight wrinkle that gets even better results), I hope that I presented not only a simple but effective strategy, but also a simple but effective (if a bit time consuming, to do all the monthly computations on 191 return streams) heuristic suggested/implied by Tony Cooper of double digit numerics for analyzing the performance and robustness of your trading strategies. Certainly, while many professors and theorists elucidate on robustness (with plenty of math that makes stiff bagels look digestible), I believe not a lot of attention is actually paid to it in more common circles, using more intuitive methods. After all, if someone would want to be an unscrupulous individual selling trading systems or signals (instead of worrying about the strategy’s capacity for capital), it’s easy to show an overfit equity curve while making up some excuse so as to not reveal the (most likely overfit) strategy. One thing I’d hope this post inspires is for individuals to ask not only look at equity curves, but also plots such as aggregate against average monthly (or higher frequencies, if the strategies are tested over mere months, for instance, such as intraday trading) metric rankings when performing parameter optimization.

Is this heuristic the most robust and advanced that can be done? Probably not. Would one need to employ even more advanced techniques if computing time becomes an issue? Probably (bootstrapping and sampling come to mind). Can this be built on? Of course. *Will* someone build on it? I certainly plan on revisiting this topic in the future.

Lastly, on the nature of the strategy itself: while Trading The Odds presented a strategy functioning on a very short time frame, I’m surprised that instead, we have a strategy whose parameters are on a much higher end of the numerical spectrum.

Thanks for reading.

NOTE: I am a freelance consultant in quantitative analysis on topics related to this blog. If you have contract or full time roles available for proprietary research that could benefit from my skills, please contact me through my LinkedIn here.


To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Contextual Measurement Is a Game Changer

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)



Adding a context can change one's frame of reference:

Are you courteous? 
Are you courteous at work? 





Decontextualized questions tend to activate a self-presentation strategy and retrieve memories of past positioning of oneself (impression management). Such personality inventories can be completed without ever thinking about how we actually behave in real situations. The phrase "at work" may disrupt that process if we do not have a prepared statement concerning our workplace demeanor. Yet, a simple "at work" may not be sufficient, and we may be forced to become more concrete and operationally define what we mean by courteous workplace behavior (performance appraisal). Our measures are still self-reports, but the added specificity requires that we relive the events described by the question (episodic memory) rather than providing inferences concerning the possible causes of our behavior.

We have such a data set in R (verbal in the difR package). The data come from a study of verbal aggression triggered by some event: (S1) a bus fails to stop for me, (S2) I miss a train because a clerk gave faulty information, (S3) the grocery store closes just as I am about to enter, or (S4) the operator disconnect me when I used up my last 10 cents for a call. Obviously, the data were collected during the last millennium when there were still phone booths, but the final item can be updated as "The automated phone support system disconnects me after working my way through the entire menu of options" (which seems even more upsetting than the original wording).

Alright, we are angry. Now, we can respond by shouting, scolding or cursing, and these verbally aggressive behaviors can be real (do) or fantasy (want to). The factorial combination of 4 situations (S1, S2, S3, and S4) by 2 behavioral modes (Want and Do) by 3 actions (Shout, Scold and Curse) yields the 24 items of the contextualized personality questionnaire. Respondents are given each description and asked "yes" or "no" with "perhaps" as an intermediate point on what might be considered an ordinal scale. Our dataset collapses "yes" and "perhaps" to form a dichotomous scale and thus avoids the issue of whether "perhaps" is a true midpoint or another branch of a decision tree.

David Magis et al. provide a rather detailed analysis of this scale as a problem in differential item functioning (DIF) solved using the R package difR. However, I would like to suggest an alternative approach using nonnegative matrix factorization (NMF). My primary concern is scalability. I would like to see a more complete inventory of events that trigger verbal aggression and a more comprehensive set of possible actions. For example, we might begin with a much longer list of upsetting situations that are commonly encountered. We follow up by asking which situations they have experienced and recalling what they did in each situation. The result would be a much larger and sparser data matrix that might overburden a DIF analysis but that NMF could easily handle.

Hopefully, you can see the contrast between the two approaches. Here we have four contextual triggering events (bus, train, store, and phone) crossed with 6 different behaviors (want and do by curse, scold and shout). An item response model assumes that responses to each item reflect each individual's position on a continuous latent variable, in this case, verbal aggression as a personality trait. The more aggressive you are, the more likely you are to engage in more aggressive behaviors. Situations may be more or less aggression-evoking, but individuals maintain their relative standing on the aggression trait.

Nonnegative matrix factorization, on the other hand, searches for a decomposition of the observed data matrix within the constraint that all the matrices contain only nonnegative values. These nonnegative restrictions tend to reproduce the original data matrix by additive parts as if one were layering one component after the other on top of each other. As an illustration, let us say that our sample could be separated into the shouters, the scolders, and those who curse based on their preferred response regardless of the situation. These three components would be the building blocks and those who shout their curses would have their data rows formed by the overlay of shout and curse components. The analysis below will illustrate this point.

The NMF R code is presented at the end of this post. You are encourage to copy and run the analysis after installing difR and NMF. I will limit my discussion to the following coefficient matrix showing the contribution of each of the 24 items after rescaling to fall on a scale from 0 to 1.


Want to and Do Scold

Store Closing

Want to and Do Shout

Want to Curse

Do Curse

S2DoScold

1.00
0.19
0.00
0.00
0.00
S4WantScold

0.96
0.00
0.00
0.08
0.00
S4DoScold

0.95
0.00
0.00
0.00
0.11
S1DoScold

0.79
0.37
0.02
0.05
0.15

S3WantScold

0.00
1.00
0.00
0.08
0.00
S3DoScold

0.00
0.79
0.00
0.00
0.00
S3DoShout

0.00
0.15
0.14
0.00
0.00

S2WantShout

0.00
0.00
1.00
0.13
0.02
S1WantShout

0.00
0.05
0.91
0.17
0.04
S4WantShout

0.00
0.00
0.76
0.00
0.00
S1DoShout

0.00
0.12
0.74
0.00
0.00
S2DoShout

0.08
0.00
0.59
0.00
0.00
S4DoShout

0.10
0.00
0.39
0.00
0.00
S3WantShout

0.00
0.34
0.36
0.00
0.00

S1wantCurse

0.13
0.18
0.03
1.00
0.09
S2WantCurse

0.34
0.00
0.08
0.92
0.20
S3WantCurse

0.00
0.41
0.00
0.85
0.02
S2WantScold

0.59
0.00
0.00
0.73
0.00
S1WantScold

0.40
0.22
0.01
0.69
0.00
S4WantCurse

0.31
0.00
0.00
0.62
0.48

S1DoCurse

0.24
0.16
0.01
0.17
1.00
S2DoCurse

0.47
0.00
0.00
0.00
0.99
S4DoCurse

0.46
0.00
0.02
0.00
0.95
S3DoCurse

0.00
0.54
0.00
0.00
0.69

As you can see, I extracted five latent features (the columns of the above coefficient matrix). Although there are some indices in the NMF package to assist in determining the number of latent features, I followed the common practice of fitting a number of different solutions and picking the "best" of the lot. It is often informative to learn how the solutions changes with the rank of the decomposition. In this case similar structures were uncovered regardless of the number of latent features. References to a more complete discussion of this question can be found in an August 29th comment from a previous post on NMF.

Cursing was the preferred option across all the situations, and the last two columns reveal a decomposition of the data matrix with a concentration of respondents who do curse or want to curse regardless of the trigger. It should be noted that Store Closing (S3) tended to generate less cursing, as well as less scolding and shouting. Evidently there was a smaller group that were upset by the store closing, at least enough to scold. This is why the second latent feature is part of the decomposition; we need to layer store closing for those additional individuals who reacted more than the rest. Finally, we have two latent features for those who shout and those who scold across situations. As in principal component analysis, which is also a matrix factorization, one needs to note the size of the coefficients. For example, the middle latent features reveals a higher contribution for wanting to shout over actually shouting.

Contextualized Measurement Alters the Response Generation Process

When we describe ourselves or other, we make use of the shared understandings that enable communication (meeting of minds or brain to brain transfer). These inferences concerning the causes of our own and others behavior are always smoothed or fitted with context ignored, forgotten or never noticed. Statistical models of decontextualized self-reports reflect this organization imposed by the communication process. We believe that our behavior is driven by traits, and as a result, our responses can be fit with an item response model assuming latent traits.

Matrix factorization suggests a different model for contextualized self-reports. The possibilities explode with the introduction of context. Relatively small changes in the details create a flurry of new contexts and an accompanying surge in the alternative actions available. For instance, it makes a differences if the person closing the store as you are about to enter has the option of letting one more person in when you plea that it is for a quick purchase. The determining factor may be an emotional affordance, that is, an immediate perception that one is not valued. Moreover, the response to such a trigger will likely be specific to the situation and appropriately selected from a large repertoire of possible behaviors. Leaving the details out of the description only invites the respondents to fill in the blanks themselves,

You should be able to build on my somewhat limited example and extrapolate to a data matrix with many more situations and behaviors. As we saw here, individuals may have preferred responses that generalize over context (e.g., cursing tends to be overused) or perhaps there will be situation-specific sensitivity (e.g., store closings). NMF builds the data matrix from additive components that simultaneously cluster both the columns (situation-action pairings) and the rows (individuals). These components are latent, but they are not traits in the sense of dimensions over which individuals are ranked ordered. Instead of differentiating dimensions, we have uncovered the building blocks that are layered to reproduce the data matrix.

Although we are not assuming an underlying dimension, we are open to the possibility. The row heatmap from the NMF may follow a characteristic Guttman scale pattern, but this is only one of many possible outcomes. The process might unfold as follows. One could expect a relationship between the context and response with some situations evoking more aggressive behaviors. We could then array the situations by increasing ability to evoke aggressive actions in the same way that items on an achievement test can be ordered by difficulty. Aggressiveness becomes a dimension when situations accumulated like correct answers on an exam with those displaying less aggressive behaviors encountering only the less aggression-evoking situations. Individuals become more aggressive by finding themselves in or by actively seeking increasingly more aggression-evoking situations.


R Code for the NMF Analysis of the Verbal Aggression Data Set

# access the verbal data from difR
library(difR)
data(verbal)
 
# extract the 24 items
test<-verbal[,1:24]
apply(test,2,table)
 
# remove rows with all 0s
none<-apply(test,1,sum)
table(none)
test<-test[none>0,]
 
library(NMF)
# set seed for nmf replication
set.seed(1219)
 
# 5 latent features chosen after
# examining several different solutions
fit<-nmf(test, 5, method="lee", nrun=20)
summary(fit)
basismap(fit)
coefmap(fit)
 
# scales coefficients and sorts
library(psych)
h<-coef(fit)
max_h<-apply(h,1,function(x) max(x))
h_scaled<-h/max_h
fa.sort(t(round(h_scaled,3)))
 
# hard clusters based on max value
W<-basis(fit)
W2<-max.col(W)
 
# profile clusters
table(W2)
t(aggregate(test, by=list(W2), mean))

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping IPv4 Address (with Hilbert curves) in R

$
0
0

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

While there’s an unholy affinity in the infosec commuinty with slapping IPv4 addresses onto a world map, that isn’t the only way to spatially visualize IP addresses. A better approach (when tabluation with bar charts, tables or other standard visualization techniques won’t do) is to map IPv4 addresses into Hilbert space-filling curve. You can get a good feel for how these work over at The Measurement Factory, which is where this image comes from:

mfhil

This paper [PDF] also is a good primer.

While TMF’s ipv4heatmap command-line software can crank out those visualizations really well, I wanted a way to generate them in R as we explore internet IP space at work. So, I adapted bits of their code to work in a ggplot context and took a stab at an ipv4heatmap package.

The functionality is currently pretty basic. Give ipv4heatmap a vector of IP addresses and you’ll get a heatmap of them. Feed in a CIDR block to boundingBoxFromCIDR and you’ll get a structure suitable for displaying with geom_rect. To get an idea of how it works, here’s a small example.

The following snippet of code reads in a cached copy of an IPv4 block list from blocklist.de and turns the IP addresses into a heatmap (which is mostly one color since there aren’t many blocks per class C). It then grabs the CIDR blocks for China and North Korea since, well, #CHINADPRKHACKSALLTHETHINGS according to “leading” IR firms and the US gov. It then overlays a alpha filled rectangle over the map to see just how many points fall within those CIDRs.

devtools::install_github("vz-risk/ipv4heatmap")
library(ipv4heatmap)
library(data.table)

# read in cached copy of blocklist.de IPs - orig URL http://www.blocklist.de/en/export.html
hm <- ipv4heatmap(readLines("http://dds.ec/data/all.txt"))

# read in CIDRs for China and North Korea
cn <- read.table("http://www.iwik.org/ipcountry/CN.cidr", skip=1)
kp <- read.table("http://www.iwik.org/ipcountry/KP.cidr", skip=1)

# make bounding boxes for the CIDRs

cn_boxes <- rbindlist(lapply(boundingBoxFromCIDR(cn$V1), data.frame))
kp_box <- data.frame(boundingBoxFromCIDR(kp$V1))

# overlay the bounding boxes for China onto the IPv4 addresses we read in and Hilbertized

gg <- hm$gg
gg <- gg + geom_rect(data=cn_boxes, 
                     aes(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax), 
                     fill="white", alpha=0.2)
gg <- gg + geom_rect(data=kp_box, 
                     aes(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax), 
                     fill="white", alpha=0.2)

gg

You’ll want to download that and open it up in a decent image program. The whole image is 4096x4096, so you can zoom in pretty well to see where evil hides itself.

If you find a cool use for ipv4heatmap definitely drop a note in the comments or on github. One thing we’ve noticed is that wrapping a series of individual images up in animation to see changes over time can be really interesting/illuminating.

One caveat: it uses the Boost libraries, so Windows R folk may need to jump through some hoops to get it going.

Countries Of The Internet

Since I was playing around with IPv4 heatmaps, I thought it might be neat to show how country IP address allocations fit on the “map”. So, I took the top 12 countries (by # of IPv4 addresses assigned), used ipv4heatmap to color in their bounding boxes and then whipped up some javascript to let you see/explore the fragmented allocation landscape we live in.

There’s also a non-framed version of that available. The 2D canvas scaling may be off in some browsers, but not by much. Shift-click once in the image to compensate if it’s cut off at all.

The amount of “micro-allocation” (my term) really surprised me. While I “knew” it was this way, seeing it gives you a whole new perspective.

The more I’ve worked with routing, IP & DNS data over the years, the more I’m amazed that anything on the internet works at all.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Top 77 R posts for 2014 (+R jobs)

$
0
0

(   if(like) { Please(share, this_post); print(“Thanks!”) }   )

The site R-bloggers.com is now 5 years old. It strives to be an (unofficial) online journal of the R statistical programming environment, written by bloggers who agreed to contribute their R articles to the site, to be read by the R community.

In this post I wish to celebrate R-bloggers’ 5th birth-month by sharing with you:

  1. Links to the top 77 most read R posts of 2014
  2. Statistics on “how well” R-bloggers did this year
  3. A list of top open R jobs for the beginning of 2015

1. Top 77 R posts for 2014

Enjoy:

  1. Using apply, sapply, lapply in R
  2. Basics of Histograms
  3. Box-plot with R – Tutorial
  4. Adding a legend to a plot
  5. Read Excel files from R
  6. In-depth introduction to machine learning in 15 hours of expert videos
  7. Select operations on R data frames
  8. Setting graph margins in R using the par() function and lots of cow milk
  9. R Function of the Day: tapply
  10. Prediction model for the FIFA World Cup 2014
  11. ANOVA and Tukey’s test on R
  12. Model Validation: Interpreting Residual Plots
  13. ggplot2: Cheatsheet for Visualizing Distributions
  14. How to plot a graph in R
  15. Using R: barplot with ggplot2
  16. Color Palettes in R
  17. A million ways to connect R and Excel
  18. Merging Multiple Data Files into One Data Frame
  19. How to become a data scientist in 8 easy steps: the infographic
  20. ROC curves and classification
  21. A Brief Tour of the Trees and Forests
  22. Polynomial regression techniques
  23. To attach() or not attach(): that is the question
  24. R skills attract the highest salaries
  25. The R apply function – a tutorial with examples
  26. Melt
  27. Two sample Student’s t-test #1
  28. High Resolution Figures in R
  29. Datasets to Practice Your Data Mining
  30. Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R
  31. Basic Introduction to ggplot2
  32. Using R: Two plots of principal component analysis
  33. Reorder factor levels
  34. Fitting a Model by Maximum Likelihood
  35. Pivot tables in R
  36. How do I Create the Identity Matrix in R?
  37. A practical introduction to garch modeling
  38. Download and Install R in Ubuntu
  39. Paired Student’s t-test
  40. dplyr: A gamechanger for data manipulation in R
  41. Computing and visualizing PCA in R
  42. Plotting Time Series data using ggplot2
  43. Automatically Save Your Plots to a Folder
  44. Environments in R
  45. Hands-on dplyr tutorial for faster data manipulation in R
  46. Multiple Y-axis in a R plot
  47. Make R speak SQL with sqldf
  48. MySQL and R
  49. paste, paste0, and sprintf
  50. Creating surface plots
  51. R: Using RColorBrewer to colour your figures in R
  52. Five ways to handle Big Data in R
  53. Free books on statistical learning
  54. Export R Results Tables to Excel – Please don’t kick me out of your club
  55. Linear mixed models in R
  56. Running R on an iPhone/iPad with RStudio
  57. When to Use Stacked Barcharts?
  58. Date Formats in R
  59. Making matrices with zeros and ones
  60. Converting a list to a data frame
  61. A Fast Intro to PLYR for R
  62. Using R: common errors in table import
  63. Text Mining the Complete Works of William Shakespeare
  64. Mastering Matrices
  65. Summarising data using box and whisker plots
  66. R : NA vs. NULL
  67. Scatterplot Matrices
  68. Getting Started with Mixed Effect Models in R
  69. The Fourier Transform, explained in one sentence
  70. Import/Export data to and from xlsx files
  71. An R “meta” book
  72. R Function of the Day: table
  73. Facebook teaches you exploratory data analysis with R
  74. Simple Linear Regression
  75. Regular expressions in R vs RStudio
  76. Fitting distributions with R
  77. Drawing heatmaps in R

2. Statistics – how well did R-bloggers do this year?

There are several matrices one can consider when evaluating the success of a website.  I’ll present a few of them here and will begin by talking about the visitors to the site.

This year, the site was visited by 2.7 million users, in 7 million sessions with 11.6 million pageviews. People have surfed the site from over 230 countries, with the greatest number of visitors coming from the United States (38%) and then followed by the United Kingdom (6.7%), Germany (5.5%), India( 5.1%), Canada (4%), France (2.9%), and other countries. 62% of the site’s visits came from returning users. R-bloggers has between 15,000 to 20,000 RSS/e-mail subscribers.

The site is aggregating posts from 569 bloggers, and there are over a hundred more which I will add in the next couple of months.

I had to upgrade the site’s server and software several times this year to manage the increase in load, and I believe that the site is now more stable than ever.

I gave an interview about R-bloggers in at useR!2014 which you might be interested in:

I am very happy to see that R-bloggers continues to succeed in offering a real service to the global R users community – thank you all for your generosity, professionalism, kindness, and love.

3. Top 10 R jobs from 2014

This year I started a new site for sharing R users to share and fine new jobs called: www.R-users.com.

If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds).

If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

Below are the top 10 open jobs (you can see new jobs at R-users.com)

  1. Senior programmer / business analyst for customized solutions (1,497 views)
  2. MatrixBI Data scientist (1,308 views)
  3. Data genius, modeller, creative analyst (1,222 views)
  4. Data Scientist for developing the algorithmic core of Supersonic (1,154 views)
  5. Content Developer-R (1,054 views)
  6. Looking for a partner to code an algorithm which will trade pairs in R (834 views)
  7. Statistician for a six-month project contract in Milano, Italy. (773 views)
  8. R programmer for spatial data – Germany (738 views)
  9. Team leader Data Analysis of wind turbine data(701 views)
  10. Sr. PRedictive Modeler (688 views)

R_single_01

Some Applications of Item Response Theory in R

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)
The typical introduction to item response theory (IRT) positions the technique as a form of curve fitting. We believe that a latent continuous variable is responsible for the observed dichotomous or polytomous responses to a set of items (e.g., multiple choice questions on an exam or rating scales from a survey). Literally, once I know your latent score, I can predict your observed responses to all the items. Our task is to estimate that function with one, two or three parameters after determining that the latent trait is unidimensional. In the process of measuring individuals, we gather information about the items. Those one, two or three parameters are assessments of each item's difficulty, discriminability and sensitivity to noise or guessing.

All this has been translated into R by William Revelle, and as a measurement task, our work is done. We have an estimate of each individual's latent position on an underlying continuum defined as whatever determines the item responses. Along the way, we discover which items require more of the latent trait in order to achieve a favorable response (e.g., the difficulty of answering correctly or the extremity of the item and/or the response). We can measure ability with achievement items, political ideology with an opinion survey, and brand perceptions with a list of satisfaction ratings.

To be clear, these scales are meant to differentiate among individuals. For example, the R statistical programming language has an underlying structure that orders the learning process so that the more complex concepts are mastered after the simpler material. In this case, learning is shaped by the difficulty of the subject matter with the more demanding content reusing or building onto what has already been learned. When the constraints are sufficient, individuals and their mastery can be arrayed on a common scale. At one end of the continuum are complex concepts that only the more advanced students master. The easier stuff falls toward the bottom of the scale with topics that almost everyone knows. When you take an R programming achievement test, your score tells me how well you performed relative to others who answered similar questions (see normed-referenced testing).

The same reasoning applied to IRT analysis of political ideology (e.g., the R package basicspace). Opinions tend to follow a predictable path from liberal to conservative so that only a limited number of all possible configurations are actually observed. As shown below, legislative voting follows such a pattern with Senators (dark line) and Representatives (light line) separate along the liberal to conservative dimensions based on their votes in the 113th Congress. Although not shown, all the specific votes can also be placed on this same scale so that Pryor, Landrieu, Baucus and Hagan (in blue) are located toward the right because their votes on various bills and resolutions agreed more often with Republicans (in red). As with achievement testing, an order is imposed on the likely responses of objects so that the response space in p dimensions (where p equals the number of behaviors, items or votes) is reduced to a one-dimensional seriation of both votes and voters on the same scale.

My last example comes from marketing research where brand perceptions tend to organized as a pattern of strengths and weaknesses defined by the product category. In a previous post, I showed how preference for Subway fast food restaurants is associated with a specific ordering of product and service attribute ratings. Many believe that Subway offers fresh and healthy food. Fewer like the taste or feel it is filling. Fewer still are happy with the ordering or preparation, and even more dislike the menu and the seating arrangements. These perceptions have an order so that if you are satisfied with the menu then you are likely to be satisfied with the taste and the freshness/healthiness of the food. Just as issues can be ordered from liberal to conservative, brand perceptions reflect the strengths and weaknesses promised by the brand's positioning. Subway promises fresh and healthy food but not prepackaged and waiting under the heat lamp for easy bagging. The mean levels of our satisfaction ratings will be consistent with those brand priorities.

We can look at the same data from another perspective. Heatmaps summarize the triangular pattern observed in data matrices that can be modeled by IRT. In a second post analyzing the Subway data, I described the following heatmap showing the results from the 8-item checklist of features associated with the brand. Each row is a different respondent with the blue indicating that the item was checked and red telling us that the item was not checked. As one moves down the heatmap, the overall perceptions become more positive as additional attributes are endorsed. Positive brand perceptions are incremental, but the increments are not more of the same. Tasty and filling gets added to healthy and fresh. That is, greater satisfaction with Subway is reflected in the willingness to endorse additional components of the brand promise. The heatmap is triangular so that those who are happy with the menu are likely to be at least as satisfied with all the attributes to the right.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Eight New Ideas From Data Visualization Experts

$
0
0

(This article was first published on Plotly, and kindly contributed to R-bloggers)
This post summarizes and visualizes eight key ideas we’ve heard from data visualization experts. Check out our first Case Study to learn more about using Plotly Enterprise on-premise, on your servers. To get started on free online graphing like in this post, check out our tutorials.


Make Interactive Graphs




Pictures of graphs in PowerPoints, dashboards, and emails can be dull. Viewers get value when they can see data with their mouse, zoom, filter, drill down, and study graphs. Plotly uses D3.js so all your graphs are interactive. The graph below models the historical temperature record and associated temperature changes contributed by each factor.








Make Graphs With IPython Widges In Plotly & Domino




Our friends at Domino wrote a blog post showing how you or a developer on your team can use Domino, Plotly’s APIs, and IPython Notebooks to add sliders, filters, and widgets to graphs and take exploration a new direction. See our tutorial to learn more.







Reproducible Research with Plotly & Overleaf




Institutional memory is crucial for research. But data is easily lost if it’s on a thumbdrive, in an email, or on a desktop. The team at Overleaf is enabling reproducible, online publication. You can import Plotly graphs into Overleaf papers and write together.







Plotly graphs are mobile-optimized, reproducible, downloadable, and web-based. Just add the URL in your presentation or Overleaf project to share it all. For examples, for the climate graph above:







Use Statistical Graphs




Graphing pros love using statisical graphs like histograms, heatmaps, 2D histograms, and box plots to investigate and explain data. Below, we’re showing a log axis plot, a boxplot, and a histogram. The numbers are Facebook users per country. Curious? We have tutorials.











Use 3D Graphs




Below see the prestige, education, and income of a few professions, sorted by gender. 3D graphing enables a whole new dimension of interactivity. The points are projected on the outside of the graph. Click and hold to flip the plot or toggle to zoom. Click here to visit the graph. Or take a 3D graphing tutorial.







Embed Interactive Graphs With Dashboards




In a fast-moving world, it’s crucial to get the most recent data. That’s why we make it easy to embed updating graphs in dashboards, like the temperature graph below of San Francisco and Montréal (live version here). See our tutorials on updating graphs, interactive dashboards or graphing from databases.




Customiz Interactive Graphs With JavaScript




For further customizations, use our JavaScript API. You (or a developer on your team) can build custom controls that change anything about an embedded Plotly graph





Embed Interactive Graphs With Shiny




If you are an R user, you can render and embed interactive ggplot2 graphs in Shiny with Plotly. See our tutorial.





If you liked what you read, please consider sharing. We’re at feedback at plot dot ly, and @plotlygraphs.

To leave a comment for the author, please follow the link and comment on his blog: Plotly.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Extracting the original data from a heatmap image

$
0
0

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

The paper Analysis of the Linux Kernel Evolution Using Code Clone Coverage analysed 136 versions of Linux (from 1.0 to 2.6.18.3) and calculated the amount of source code that was shared, going forward, between each pair of these versions. When I saw the heatmap at the end of the paper (see below) I knew it had to appear in my book. The paper was published in 2007 and I knew from experience that the probability of seven year old data still being available was small, but looked so interesting I had to try. I emailed the authors (Simone Livieri, Yoshiki Higo, Makoto Matsushita and Katsuro Inoue) and received a reply from Makoto Matsushita saying that he had searched for the data and had been able to find the original images created for the paper, which he kindly sent me.

Shared code between Linux releases

I was confident that I could reverse engineer the original values from the png image and that is what I have just done (I have previously reverse engineered the points in a pdf plot by interpreting the pdf commands to figure out relative page locations).

The idea I had was to find the x/y coordinates of the edge of the staircase running from top left to bottom right. Those black lines appear to complicate things, but the RGB representation of black follows the same pattern as white, i.e., all three components are equal (0 for black and 1 for white). All I had to do was locate the first pixel containing an RGB value whose three components had at least one different value, which proved to be remarkably easy to do using R’s vector operations.

After reducing duplicate sequences to a single item I now had the x/y coordinates of the colored rectangle for each version pair; extracting an RGB value for each pair of Linux releases was one R statement. Now I needed to map RGB values to the zero to one scale denoting the amount of shared Linux source. The color scale under the heatmap contains the information needed and with some trial and error I isolated a vector of RGB pixels from this scale. Passing the offset of each RGB value on this scale to mapvalues (in the plyr package) converted the extracted RGB values.

The extracted array has 130 rows/columns, which means information on 5 versions has been lost (no history is given for the last version). At the moment I am not too bothered, most of the data has been extracted.

Here is the result of calling the R function readPNG (from the png package) to read the original file, mapping the created array of RGB values to amount of Linux source in each version pair and calling the function image to display this array (I have gone for maximum color impact; the code has no for loops):

Heatmap of extracted data

The original varied the width of the staircase, perhaps by some measure of the amount of source code. I have not done that.

Its suspicious that the letter A is not visible in some form. Its embedded in the original data and I would have expected a couple of hits on that black outline.

The above overview has not bored the reader with my stupidities that occurred along the way.

If you improve the code to handle other heatmap data extraction problems, please share the code.

To leave a comment for the author, please follow the link and comment on his blog: The Shape of Code » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Color extraction with R

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Given all the attention the internet has given to the colors of this dress, I thought it would be interesting to look at the capabilities for extracting colors in R.

R has a number of packages for importing images in various file formats, including PNG, JPG, TIFF, and BMP. (The readbitmap package works with all of these.) In each case, the result is a 3-dimensional R array containing a 2-D image layer for each of the color channels (for example red, green and blue for color images). You can then manipulate the array as ordinary data to extract color information. For example, Derek Jones used the readPNG function to extract data from published heatmaps when the source data has been lost. 

Photographs typically contain thousands or even millions of unique colors, but a very human question is: what are the major colors in the image? In other words, what is the image's palette? This is a difficult question to answer, but Russell Dinnage used R's k-means clustering capabilities to extract the 3 (or 4 or 6 — you decide) most prominent colors from an image, without including almost-identical shades of the same color and filtering out low-saturation background colors (like gray shadows). Without any supervision, his script can easily extract 6 colors from this tail of this beautiful peacock spider. In fact, his script generates five representative palettes:

Peacock

I used a similar process to extract the 3 major colors from "that dress":

Dress Dress palette

So I guess it was black and blue after all! (Plus a heavy dose of white in the background)

Christophe Cariou used a similar palette-extraction process in R and applied it to every cover of Wired magazine since 1993. For each cover he extracted the 4 major colors, and then represented them all on this beautiful scatter diagram arranged on the color wheel:

Wired covers

You can see the individual colors for the last 58 Wired covers here.

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Extracting Heatmap

$
0
0

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)
Inspired by this tweet, I wanted to try to do something similar in JavaScript. Very cool hack: Extracting the original data from a heatmap image with R vector ops #rstats http://t.co/Lbi6FCXdrI pic.twitter.com/LCabkMGjXY— Gregory Piatetsky (@kdnuggets) March 6, 2015 Fortunately, I had this old post Chart from R + Color from Javascript to serve as a reference, and I got lots of help from these

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Interactive Maps for John Snow’s Cholera Data

$
0
0

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the  leaflet package,

devtools::install_github("rstudio/leaflet")
require(leaflet)

To see what can be done with that package, we will use one more time the John Snow’s cholera dataset, discussed in previous posts (one to get a visualisation on a google map background, and the second one on an openstreetmap background),

library(sp)
library(rgdal)
library(maptools)
setwd("/cholera/")
deaths <- readShapePoints("Cholera_Deaths")
df_deaths <- data.frame(deaths@coords)
coordinates(df_deaths)=~coords.x1+coords.x2
proj4string(df_deaths)=CRS("+init=epsg:27700") 
df_deaths = spTransform(df_deaths,CRS("+proj=longlat +datum=WGS84"))
df=data.frame(df_deaths@coords)
lng=df$coords.x1
lat=df$coords.x2

Once installed the leaflet package, we can use the package at the RStudio console (which is what we will do here), or within R Markdown documents, and within Shiny applications. But because of restriction we got on this blog (rules of hypotheses.org) So there will be only copies of my screen. But if you run the code, in RStudio you will get interactvive maps in the viewer window.

First step. To load a map, centered initially in London, use

m = leaflet()%>% addTiles() 
m %>% fitBounds(-.141,  51.511, -.133, 51.516)

In the viewer window of RStudio, it is just like on OpenStreetMap, e.g. we can zoom-in, or zoom-out (with the standard + and – in the top left corner)

And we can add additional material, such as the location of the deaths from cholera (since we now have the same coordinate representation system here)

rd=.5
op=.8
clr="blue"
m = leaflet() %>% addTiles()
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr)

We can also add some heatmap.

X=cbind(lng,lat)
kde2d <- bkde2D(X, bandwidth=c(bw.ucv(X[,1]),bw.ucv(X[,2])))

But there is no heatmap function (so far) so we have to do it manually,

x=kde2d$x1
y=kde2d$x2
z=kde2d$fhat
CL=contourLines(x , y , z)

We have now a list that contains lists of polygons corresponding to isodensity curves. To visualise of of then, use

m = leaflet() %>% addTiles() 
m %>% addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE)

Of course, we can get at the same time the points and the polygon

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE)

We can try to plot many polygons on the map, as different layers, to visualise some kind of heatmaps, but it’s not working that well

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolygons(CL[[1]]$x,CL[[1]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[3]]$x,CL[[3]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[7]]$x,CL[[7]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[9]]$x,CL[[9]]$y,fillColor = "red", stroke = FALSE)

An alternative is to hightlight (only) the contour of the polygon

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolylines(CL[[1]]$x,CL[[1]]$y,color = "red") %>%
  addPolylines(CL[[5]]$x,CL[[5]]$y,color = "red") %>%
  addPolylines(CL[[8]]$x,CL[[8]]$y,color = "red")

Again, the goal of those functions is to get some maps we can zoom-in, or -out (what I call interactive)

And we can get at the same time the contour but also some polygon filled with some light red color

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolygons(CL[[1]]$x,CL[[1]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[3]]$x,CL[[3]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[7]]$x,CL[[7]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[9]]$x,CL[[9]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolylines(CL[[1]]$x,CL[[1]]$y,color = "red") %>%
  addPolylines(CL[[5]]$x,CL[[5]]$y,color = "red") %>%
  addPolylines(CL[[8]]$x,CL[[8]]$y,color = "red")

Copies of my RStudio screen is nice, but visualising it is just awesome. I will try to find a way to load that map on my blog, but it might be difficult (so far, it is possible to visualise it on http://rpubs.com/freakonometrics/)

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Another Interactive Map for the Cholera Dataset

$
0
0

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Following my previous post, François (aka @FrancoisKeck) posted a comment mentionning another package I could use to get an interactive map, the rleafmap package. And the heatmap was here easy to include.

This time, we do not use openstreetmap. The first part is still the same, to get the data,

> require(rleafmap)
> library(sp)
> library(rgdal)
> library(maptools)
> library(KernSmooth)
> setwd("/home/arthur/Documents/")
> deaths <- readShapePoints("Cholera_Deaths")
> df_deaths <- data.frame(deaths@coords)
> coordinates(df_deaths)=~coords.x1+coords.x2
> proj4string(df_deaths)=CRS("+init=epsg:27700") 
> df_deaths = spTransform(df_deaths,CRS("+proj=longlat +datum=WGS84"))
> df=data.frame(df_deaths@coords)

To get a first visualisation, use

> stamen_bm <- basemap("stamen.toner")
> j_snow <- spLayer(df_deaths, stroke = FALSE)
> writeMap(stamen_bm, j_snow, width = 1000, height = 750, setView = c( mean(df[,1]),mean(df[,2])), setZoom = 14)

and again, using the + and the – in the top left area, we can zoom in, or out. Or we can do it manually,

> writeMap(stamen_bm, j_snow, width = 1000, height = 750, setView = c( mean(df[,1]),mean(df[,2])), setZoom = 16)

To get the heatmap, use

> library(spatstat)
> library(maptools)

> win <- owin(xrange = bbox(df_deaths)[1,] + c(-0.01,0.01), yrange = bbox(df_deaths)[2,] + c(-0.01,0.01))
> df_deaths_ppp <- ppp(coordinates(df_deaths)[,1],  coordinates(df_deaths)[,2], window = win)
> 
> df_deaths_ppp_d <- density.ppp(df_deaths_ppp, 
  sigma = min(bw.ucv(df[,1]),bw.ucv(df[,2])))
 
> df_deaths_d <- as.SpatialGridDataFrame.im(df_deaths_ppp_d)
> df_deaths_d$v[df_deaths_d$v < 10^3] <- NA

> stamen_bm <- basemap("stamen.toner")
> mapquest_bm <- basemap("mapquest.map")
 
> j_snow <- spLayer(df_deaths, stroke = FALSE)
> df_deaths_den <- spLayer(df_deaths_d, layer = "v", cells.alpha = seq(0.1, 0.8, length.out = 12))
> my_ui <- ui(layers = "topright")

> writeMap(stamen_bm, mapquest_bm, j_snow, df_deaths_den, width = 1000, height = 750, interface = my_ui, setView = c( mean(df[,1]),mean(df[,2])), setZoom = 16)

The amazing thing here are the options in the top right corner. For instance, we can remove some layers, e.g. to remove the points

or to change the background

To get an html file, instead of a standard visualisation in RStudio, use

> writeMap(stamen_bm, mapquest_bm, j_snow, df_deaths_den, width = 450, height = 350, interface = my_ui, setView = c( mean(df[,1]),mean(df[,2])), setZoom = 16, directView ="viewer")

which will generate the html table (as well as some additional files actually) above. Awesome, isn’t it?

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 152 articles
Browse latest View live