# Blog

Back Blog

My inaugural blog as a Data Science Consultant for SmartCat. The code that accompanies the analyses presented here is available at the respective GitHub repository. On how to use R to estimate the optimal time during the day for aliens to invade Earth and a few more interesting things.

While scraping with {TwitteR} mindlessly, I was planning the analysis, and my thoughts started wondering around all the cool work that people at NASA do... What will they announce today? A new exoplanet, I can bet. People are crazy about exoplanets, and the aliens, and the SETI program, and Astrobiology, and all that stuff. A Twin Earth! However, nobody realizes, I thought, that the potential discovery of this Earth on behalf of some technologically advanced alien civilization could pose a real existential treat for us humans: a true global catastrophic risk. And with all their antennas, golden plates with pictographs and Bach on their interstellar probes... nobody seems to worry about Sir Stephen Hawking's well-reasoned warning on how intelligent aliens could destroy humanity (even not he himself, c.f. "Stephen Hawking: Intelligent Aliens Could Destroy Humanity, But Let's Search Anyway" - says Hawking)! Anyways, it was probably around the third or fourth glass of wine in the Belgrade Cafe from where I've left {TwitteR} do the job from my netbook when I've realized what I want to do this time with R: I will estimate the optimal time during the day for Aliens to invade our planet by analyzing the daily oscillation in the sentiment and volume of tweets from NASA accounts. Assumption: if aliens somehow figure out where we live, that will be because of these guys with big radio-antennas. Next: given that whatever alien civilization decides to invade Earth, they will certainly be so technologically advanced to immediately discover the very source of our quest for them. Finally, given their technological supremacy, they will be available to analyze all the necessary information to ensure the success of their mission: including our (precious!) tweets.

And here it is, with a little help from {tm.plugin.sentiment}, {dplyr} and {ggplot2}:

emoHours <- tweetsDF %>% group_by(Hour) %>%summarise(tweets = n(),  positive = length(which(Polarity > 0)),  neutral = length(which(Polarity == 0)),  negative = length(which(Polarity < 0)))emoHours$positive <- emoHours$positive/emoHours$tweetsemoHours$neutral <- emoHours$neutral/emoHours$tweetsemoHours$negative <- emoHours$negative/emoHours$tweetsemoHours$Hour <- as.numeric(emoHours$Hour)emoHours$Volume <- emoHours$tweets/max(emoHours$tweets)emoHours <- emoHours %>%gather(key = Measure,  value = Value,  positive:Volume)ggplot(emoHours, aes(x = Hour, y = Value, color = Measure)) +geom_path(size = .25) +geom_point(size = 1.5) +geom_point(size = 1, color = "White") +ggtitle("Optimal Time to Invade Earth") +scale_x_continuous(breaks = 0:23, labels = as.character(0:23)) +theme_bw() +theme(plot.title = element_text(size = 12)) +theme(axis.text.x = element_text(size = 8, angle = 90))

Figure 1. Optimal time to invade Earth w. {tm.plugin.sentiment}, {dplyr}, and {ggplot2}

The tweetsDF data frame becomes available after running the previous chunks of code that you will find in the GitHub repo for this blog post. The Polarity column comes from the application of {tm.plugin.sentiment} functions over a {tm} pre-processed corpus of all 255,241 tweets that were collected from NASA's accounts:

### --- Sentiment Analysis# - as {tm} VCorpusnasaCorpus <- VCorpus(VectorSource(tweetsDF$text))# - {tm.plugin.sentiment} sentiment scores# - Term-Document MatrixnasaTDM <- TermDocumentMatrix(nasaCorpus, control = list(tolower = TRUE, removePunctuation = TRUE, removeNumbers = TRUE, removeWords = list(stopwords("english")), stripWhitespace = TRUE, stemDocument = TRUE, minWordLength = 3, weighting = weightTf))# - {tm.plugin.sentiment} polarity score# - NOTE: that would be (n-p)/(n+p)nasaPolarity <- polarity(nasaTDM)sum(nasaPolarity != 0)tweetsDF$Polarity <- nasaPolarity

The optimal time for Alien invasion is obviously somewhere between 7:00 and 9:00 in the morning (NOTE for the aliens: all times are GMT). All tweets were categorized as neutral, positive, or negative in respect to their polarity, which is given as (n-p)/(n+p), n being the count of negative and p of positive words in the respective tweets. Then, instead of going for a time series analyses, I have simply grouped all tweets per hour of the day in which they have occurred, recalculating the counts of positive, negative, and neutral tweets into proportions of total tweets per hour. Thus, the vertical axis presents proportion of tweets per hour. The Volume variable is simply a rescaling of the plain simple counts of number of tweets per hour by the maximum count found in the data set, so that this can be conveniently presented on a 0 to 1 scale in the chart. And what we see is that between 7:00 and 9:00, approximately, an anomaly in the hourly distribution of tweets from NASA's accounts takes place: a sudden increase in the proportion of neutral and negative tweets, accompanied by the drop in the volume of tweets occurring. So that's when we're moody and not relaxed, and probably tweeting less given the pressure of the daily work routine before lunch: the ideal time for an Alien civilization to invade.

Of course, technologically advanced aliens, who know statistics very well, could as well ask whether the described phenomenon is simply a by product of the increased measurement error related to the quite obvious drop in the sample sizes for the respective, invasion-critical hours...

Putting aside the question of alien invasion, I am really very interested to learn from NASA today what is that was discovered beyond the limits of the Solar system. To prove how the discoveries of potentially habitable exoplanets are popular, the following analysis was conducted. In the first step, we simply concatenate all tweets originating from the same account, while treating the tweets with the #askNASA hashtag as a separate group (i.e. as it was a Twitter account in itself). Given that I was interested in the account level of analysis here, and provided that individual tweets offer too little information for typical BoW approaches in text mining, this step is really justified. Then, I have produced a typical Term-Document Matrix from all available tweets, preserving all terms beginning with "@" or "#" there, and then cleaned the matrix from everything else. Finally, the term counts were turned into binary (present/absent) information in order to compute the Jaccard similarity coefficients across the accounts:

tweetTexts <- tweetsDF %>%group_by(screeName) %>%summarise(text = paste(text, collapse = " "))# - accNames is to be used later:accNames <- tweetTexts$screeNameaccNames <- append(accNames, "askNASA")tweetTexts <- tweetTexts$textaskNasaText <- paste(dT\$text, collapse = "")tweetTexts <- append(tweetTexts, askNasaText)tweetTexts <- enc2utf8(tweetTexts)tweetTexts <- VCorpus(VectorSource(tweetTexts))# - Term-Doc Matrix for this:removePunctuationSpecial <- function(x) { x <- gsub("#", "HASHCHAR", x) x<- gsub("@", "MONKEYCHAR", x) x <- gsub("[[:punct:]]+", "", x) x <- gsub("HASHCHAR", "#", x) x <- gsub("MONKEYCHAR", "@", x) return(x)}tweetTexts <- tm_map(tweetTexts,            content_transformer(removePunctuationSpecial),            lazy = TRUE)tweetsTDM <- TermDocumentMatrix(tweetTexts,                   control = list(tolower = FALSE,                           removePunctuation = FALSE,                           removeNumbers = TRUE,                           removeWords = list(stopwords("english")),                           stripWhitespace = TRUE,                           stemDocument = FALSE,                           minWordLength = 3,                           weighting = weightTf))# - store TDM object:saveRDS(tweetsTDM, "tweetsTDM.Rds")# - extract only mention and hashtag features:tweetsTDM <- t(as.matrix(tweetsTDM))w <- which(grepl("^@|^#", colnames(tweetsTDM)))tweetsTDM <- tweetsTDM[, -w]# - keep only mention and hashtag features w. Freq > 50wK <- which(colSums(tweetsTDM) > 10)tweetsTDM <- tweetsTDM[, wK]# - transform to binary for Jaccard distancewPos <- which(tweetsTDM > 0, arr.ind = T)tweetsTDM[wPos] <- 1# - Jaccard distances for accounts and #asknasa:simAccounts <- dist(tweetsTDM, method = "Jaccard", by_rows = T)simAccounts <- as.matrix(simAccounts)

The following {igraph} works this way: each account - of which one, be reminded, #askNASA is not really an account but represents information on all tweets with the respective hashtag- points to the account which is most similar to it in respect to the Jaccard distance computed from the presence and absence of mentions and hashtags used. So this is more some proxy of a "social distance" between accounts than a true distributional semantics measure. It can be readily observed that #askNASA points to @PlanetQuest as its nearest neighbor in this analysis. The Jaccard distance was used since I am not really into using typical Term-Document Count Matrices in analyzing tweets; they simply convey too sparse an information for a typical approach to make any sense.