Quantcast
Channel: Search Results for “heatmap”– R-bloggers
Viewing all articles
Browse latest Browse all 152

Presidential Debates 2012

$
0
0

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

I have been playing with the beta version of qdap utilizing the presidential debates as a data set. qdap is in a beta phase lacking documentation though I’m getting there. In previous blog posts (presidential debate 1 LINK and VP debate LINK) I demonstrated some of the capabilities of qdap. I wanted to further show some of qdap’s capabilities while seeking to provide information about the debates.

In previous posts readers made comments or emailed regarding functionality of qdap . This was extremely helpful in working out bugs that arise on various operating systems. If you have praise or methods you used to run the qdap scripts please leave a comment saying so. However, if you are having difficulty please file an issue at qdap’s home, GitHub (LINK).

In this post we’ll be looking at:

1. A faceted gantt plot for each of the speeches via gantt_plot
2. Various word statistics via word_stats
3. A venn diagram showing the overlap in word usage via trans.venn
4. A dissimilarity matrix indicating closeness in speech via dissimilarity
5. iGraph Visualization of dissimilarity

Installing qdap (note: qdap was updated 10/23/12)
Here’s the github link for qdap (LINK) and install instructions

# install.packages("devtools")
library(devtools)
install_github("qdap", "trinker")

Reading in the data sets and Cleaning

library(qdap) #load qdap
# download transcript of the debate to working directory
url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)   

# load multiple files with read transcript and assign to working directory
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))
dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue"))
dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue"))

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)
dat2$dialogue <- qprep(dat2$dialogue)
dat3$dialogue <- qprep(dat3$dialogue)

# Split each sentece into it's own line
dat1b <- sentSplit(dat1, "dialogue", stem.col=FALSE) 
dat1$person <- factor(dat1$person , levels = qcv(ROMNEY, OBAMA, LEHRER))
dat2b <- sentSplit(dat2, "dialogue", stem.col=FALSE)  
dat3b <- sentSplit(dat3, "dialogue", stem.col=FALSE) 

# Create a large data frame by the three debates times
L1 <- list(dat1b, dat2b, dat3b)
L1 <- lapply(seq_along(L1), function(i) data.frame(L1[[i]], time = paste("time", i)))
dat4 <- do.call(rbind, L1)

#view a truncated version of the data (see also htruncdf)
truncdf(dat4)

Faceted Gantt Plot

#reorder factor levels
dat4$person <- factor(dat4$person, 
    levels=qcv(terms="OBAMA ROMNEY CROWLEY LEHRER QUESTION SCHIEFFER"))

with(dat4, gantt_plot(dialogue, person, time, xlab = "duration(words)", 
    x.tick=TRUE, minor.line.freq = NULL, major.line.freq = NULL, 
    rm.horiz.lines = FALSE, scale = "free"))

rm3

Basic Word Statistics
This section utilizes the word_stats function in conjunction with ggplot2 to create a heat map for various descriptive word statistics. Below is a list of column names for the function’s default print method.

   column.title description                           
1  n.tot        number of turns of talk               
2  n.sent       number of sentences                   
3  n.words      number of words                       
4  n.char       number of characters                  
5  n.syl        number of syllables                   
6  n.poly       number of polysyllables               
7  sptot        syllables per turn of talk            
8  wps          words per sentence                    
9  cps          characters per sentemce               
10 sps          syllables per sentence                
11 psps         polly syllables per sentence          
12 cpw          characters per word                   
13 spw          syllables per word                    
14 n.state      number of statements                  
15 n.quest      number of questions                   
16 n.incom      number of incomplete satetments       
17 n.hapax      number of hapax legomenon             
18 n.dis        number of dis legomenon               
19 grow.rate    proportion of hapax legomenon to words
20 prop.dis     proportion of dis legomenon to words

z <- with(dat4, word_stats(dialogue, list(person, time), tot))
z$ts
z$gts
(z2 <- colsplit2df(z$gts))    #split a qdap merged column apart
z2$person <- factor(z2$person, levels=          #relevel factor
    qcv(terms="OBAMA ROMNEY CROWLEY LEHRER SCHIEFFER QUESTION"))
x <- with(z2, z2[order(person, time), ])

library(reshape2); library(plyr)
x2 <- melt(x)
x2 <- ddply(x2, .(variable), transform,
   rescale = rescale(value))
x2$var <- as.factor(paste2(x2[, 1:2]))
x3 <- x2[x2$person %in% qcv(ROMNEY, OBAMA), ]
x3$var <- factor(x3$var, levels = rev(levels(x3$var)))

ggplot(x3, aes(variable, var)) + geom_tile(aes(fill = rescale),
    colour = "white") + scale_fill_gradient(low = "white",
    high = "black") + theme_grey() + labs(x = "",
    y = "") + scale_x_discrete(expand = c(0, 0)) +
    scale_y_discrete(expand = c(0, 0)) + theme(legend.position = "none",
    axis.ticks = element_blank(), axis.text.x = element_text(angle = -90, 
        hjust = 0, colour = "grey50"))

heatmap

Venn Diagram
With proper stop word use and small, variable data sets a Venn diagram can be informative. In this case the overlap is fairly strong and less informative though labels are centered. Thus labels closer in proximity are closer in words used.

with(subset(dat4, person == qcv(ROMNEY, OBAMA)), 
    trans.venn(dialogue, list(person, time), 
    title.name = "Presidential Debates Word Overlap 2012")
)

venn

Dissimilarity Matrix

dat5 <- subset(dat4, person == qcv(ROMNEY, OBAMA))
dat5$person <- factor(dat5$person, levels = qcv(OBAMA, ROMNEY))
#a word frequency matrix inspired by the tm package's DocumentTermMatrix
with(dat5, wfm(dialogue, list(person, time)))
#with row and column sums
with(dat5, word.freq.df(dialogue, list(person, time), margins = TRUE))
#dissimilarity (similar to a correlation 
#The default emasure is 1 - binary or proportion overlap between grouping variable
(sim <- with(dat5, dissimilarity(dialogue, list(person, time))))
              OBAMA.time.1 OBAMA.time.2 OBAMA.time.3 ROMNEY.time.1 ROMNEY.time.2
OBAMA.time.2         0.293                                                      
OBAMA.time.3         0.257        0.303                                         
ROMNEY.time.1        0.317        0.261        0.245                            
ROMNEY.time.2        0.273        0.316        0.285         0.317              
ROMNEY.time.3        0.240        0.276        0.311         0.265         0.312

Network Graph
The use of igraph may not always be the best way to view the data but this exercise shows one way this package can be utilized. In this plot the wlabels are sized based on number of words used. The distance measures that label the edges are taken from the dissimilarity function (1 – binary). Colors are based on political party.

library(igraph)
Z <- with(dat5, adjacency_matrix(wfm(dialogue, list(person, time))))
g <- graph.adjacency(Z$adjacency, weighted=TRUE, mode ='undirected')
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

set.seed(3952)
layout1 <- layout.auto(g)
opar <- par()$mar; par(mar=rep(.5, 4)) #Give the graph lots of room
plot(g, layout=layout1)

edge.weight <- 9  #a maximizing thickness constant
z1 <- edge.weight * sim/max(sim)*sim
E(g)$width <- c(z1)[c 1="!=" 2="0" language="(z1)"][/c] #remove 0s: these won't have an edge
numformat <- function(val, digits = 2) { sub("^(-?)0.", "\\1.", sprintf(paste0("%.", digits, "f"), val)) }
z2 <- numformat(round(sim, 3), 3)
E(g)$label <- c(z2)[c 1="!=" 2="0" language="(z2)"][/c]
plot(g, layout=layout1) #check it out! 

label.size <- 15 #a maximizing label size constant
WC <- aggregate(dialogue~person +time, data=dat5, function(x)  sum(word.count(x), na.rm = TRUE))
WC <- WC[order(WC$person, WC$time), 3]
resize <- (log(WC)/max(log(WC)))
V(g)$label.cex <- 5 *(resize - .8)
plot(g, layout=layout1) #check it out!

V(g)$color <- ifelse(substring(V(g)$label, 1, 2)=="OB", "pink", "lightblue")

plot(g, layout=layout1)
tkplot(g)

igr

This blog post is a rough initial analysis of the three presidential debates. It was meant as a means of demonstrating the capabilities of qdap rather than providing in depth analysis of the candidates. Please share your experiences with using qdap in a comment below and suggestions for improvement via the issues page of qdap’s github site(LINK).

For a pdf version of all the graphics created in the blog post -click here-


To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Viewing all articles
Browse latest Browse all 152

Trending Articles