This is a quick and dirty way of determining language of twitter posts. Ideally one would have a naive bayes model scoring every tweet against a Bag of Words in order to detect languages but unfortunately package e1071 doesn’t come with a multinomial boolean version of naive bayes which is the one I feel comfortable using when doing NLP.
Instead, I will use the tm package which comes with serveral stopwords list to quickly determine languages.
Let’s load the packages required for this task:
According with the tm package documentation the supported languages are: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, and swedish.
I am going to create a list of supported languages and then iterate through this list in order to extract all stopwords.
So here it is my list with supported languages:
and my empty list where I’ll store languages along with their stopwords (similar to a Python dictionary):
and now let’s iterate over langList:
Let’s do a quick check:
total stopwords by language:
Ok, now we need a function that scores a given piece of text of any length so here it is my function:
Let’s do some examples:
Ok my function seems to be working fine so now let’s get some tweets with #bigdata hashtag.
I will need a data frame with all tweets (I’ll use this df to score posts):
Let’s look at post counts by language in another as we want to check accuracy later on…
The first tweet in every language is:
Let’s score the list of tweets:
And finally our crosstab table looks like:
There are some posts labeled incorrectly some of the reasons behind are:
Even though we have searched by language, doesn’t mean the post is written in a specific language - it only means the default language used by the twitter account is german, english etc.
It’s harder to predict correctly when message is too short.