You shall not pass! This pretty much sums up the main reason why Twitalyzr was made in the first place. Twitalyzr is a real-time filter of tweets with #cassandra hashtag. It basically classifies new tweets into light side tweets and dark side tweets. Light side tweets are all about the glory of Apache Cassandra database and Big Data in general and dark side tweets are everything else. Apache Cassandra is a distributed NoSQL database. Because of the very common name of the database, there is a very high level of diversity in topics when people tweet about #cassandra. Such topics are about the database, book/game/movie characters, shopping, soap operas, but also adult content. People who are following the BigData world usually want to be informed about new blogs, seminars, papers, as well as jobs, questions and problems about this database. There are around 100-200 tweets with #cassandra and half of them are useless, so it can be a real waste of time looking at all the unnecessary tweets, every goddamn day. This task cannot be done by adding more hashtags, or with a few if-then expressions, because it would really narrow down the search.
Twitalyzr is an Apache Spark application that consists of two parts. The first part is making and optimizing a classification model. The second part is loading the previously trained model and attaching it on a stream to perform real time analysis of upcoming tweets. After the classification, light side tweets (the ones with porn inside, hehe, just kidding) are sent to the web application for visualization. In the image below we can see the architecture of Twitalyzr.
The purpose of Training Spark Job is to make a model which has the best accuracy over some test cases. First of all, we collected a dataset containing tweets with hashtags #cassandra, #bigdata, #datastax, #porn, #sex, #spark… The reason why we took all that for training was to feed the model with more data, so it could be more accurate. We collected around 5000 tweets using TwitterAPI within a few days. Now comes the horrible part: because 5000 is not a big number, we decided to manually label tweets on the light side (1) and the dark side (0) and use them in a supervised learning algorithm. Fortunately, we developed a GUI app for fast labeling of json tweets (or any other text in json format) so this task was finished quickly. Nevertheless, it is a manual process, so errors are unavoidable. We read/watched a lot of crazy stuff. There are so many obscenities containing the word cassandra, so we must ask, why is the database named like that? This dirty job is followed by the preprocessing of trainset. The preprocessing step contains several stages which are done on both tweet text and user description text:
- Cleaning - Removal of non-English letters, numbers, multiple whitespaces, etc
- Tokenize - Tokenization of text on whitespace
- Stop word removal - Words like: the, a, to, at are useless and thus removed
- N-grams - We used 2-grams, (3-grams gives slightly lower accuracy on test set)
- Term Frequency - Simply counting occurrences of words in text and making a vector model.
- Vector assembly - Grouping all preprocessed vectors from tweet text, user description and hashtags into a single sparse vector. That vector represents the features of one tweet.
We split the dataset into a training set and test set. The training was done using the K Fold cross validation. We used an already implemented grid search from Spark ML for parameter optimization. The compared algorithms were: Logistic Regression, Naive Bayes and Random Forest. Logistic Regression achieved the best results on test data set with an accuracy of 0.99.
It was fun to see that we had a train model in iterations. Due to poor results in the first iterations of training we printed out tweets that were classified wrong and realised that it didn’t classify them wrong, but that they weren’t labelled accurately, due to human labeling as we mentioned earlier. Basically some lewd tweets were labeled as light side and some database related tweets were labeled as dark side. Maybe the person who had labeled it actually watched that porn and it was so great that he had to label it as good? Guess we’ll never know. But, it was really annoying to open the dataset again (a humongous number of times) and manually correct the missed labelled tweets. Of course, we could use some techniques of unsupervised learning to try to get the labels automatically, or we could use text from different sources that we were certain about their topic. But, 5000 tweets weren’t that many.
The streaming task was very easy. Its job was to load the previously trained model and connect to Twitter stream via Twitter API. Streaming was processed in mini batches, e.g. every minute records were transformed through the same pipeline (trained model). After that, tweets that had been labelled as light side were filtered and sent via REST to Web server and those tweets were showed in the application. Dark side tweets were sent to our private collection of naughty things :) It sounded cool to run 24/7 on AWS instance and have a live demo of results. So we added some ansible and made it happen. Ansible automatized the whole process of creating instance, configuring Spark on AWS, deploying application, starting streaming, etc. Now our baby Twitalyzr shows the results on its page.
This project is just a showcase of usage of Apache Spark, both for machine learning and streaming. The model with logistic regression showed best performance, and after deploying on live data it didn’t bow against the darkest tweets. The idea for future work is to extend this project to be more generalized, and to work not only on #cassandra. One of the ideas is to have a user subscribe on one search term (could be a hashtag), and then, Twitalyzr would automatically start collecting tweets, and it would try to group tweets with similar topics. After that, the user would manually choose a cluster that he is interested in, and that kind of tweets would be sent to the user. There is another interesting problem, which is bot detection, because there are a lot of bots on Twitter, and their tweets and retweets mostly don’t have any value.