Wednesday, November 26, 2014

Tag clouds in R

An easy way to visualise the concepts that are important in a text is to create a tag-cloud where the most common words are written large and less common words are made smaller and smaller. We want to remove the really common words first so that we avoid creating a cloud with only "in", "of", "for", "the", "a", "an", and so on. There are a number of web-based applications that will do it for us, but where is the fun in that when we can do it in R?

The first question is which words to use. It boils down to finding a suitable text that really reflects the research. The best I have come up with is using article titles. First we copy all the article titles into a single file, and then we rearrange them so that there is a single word on each line. This makes it easy to import into R as a matrix using:

> ArticleTitles = as.matrix(read.csv("file-with-title-words.txt"))

The "as.matrix()" is needed since read.csv automatically imports files as data frames, while the package we are going to use accepts only matrixes. The package in question is wordcloud, which we get it by running the following code at the R-prompt:

> install.packages(c("wordcloud", "tm"))

and loading them with:

> library(wordcloud)
> library(tm)

Thereafter it is as easy as:

> wordcloud(ArticleTitles)

As the default this produces a cloud of up to 300 words that appear a minimum of 3 times using black text on white background. It removes all punctuation and common words automatically. There are a lot of different parameters that we could fudge to get a better-looking cloud but that is left to the reader to try out. In order to make the cloud look like a kidney we can just run the code a number of times until something vaguely kidney-like appears, and then import the image to Adobe Illustrator to make it even better. Finally a light gray outline of a kidney is introduced as background to make the shape more obvious. 

No comments:

Post a Comment