Pipeline to Preprocess Tweets for Optimized Unigram Feature Selection

In this post I'm going to showcase one of many ways to clean up the text from tweets and make it ready for a unigram feature selection process. I am using Python but I am gonna update this post with Java 8 implementation in another post. In the image below, The component that receives the raw text and transforms it into what is digestible for a feature selector is this pipe component (see that Sherlock Holmes looking pipe?). and we are gonna concentrate on the pipe component here.

My suggestion of implementing the pipe is partially based the study done by Go, Huang and Bhayani in 20091. The researchers found out that since many tweets often mention or refer to other users and embed URLs, this causes an extra overhead and makes the feature selection a much harder task and the learning process less accurate. As you might know, feature selection in NLP is a much resource consuming task that learning itself. They suggest a solution where the so-called pipe should map all usernames in a tweet to a string literal like USER and all URLs that were embedded in a tweet's text to URL. We are following this rule of thumb here among other extra work that we do.

Assuming tweet is the raw tweet text that is fetched from Twitter Streaming API, my suggested pipeline consists of the following 7 steps:

  1. Encode/Decode the tweet: Now you might be wondering why is this step necessary but in Twitter Streaming API documentation,

    The body of a streaming API response consists of a series of newline-delimited messages, where “newline” is considered to be \r\n (in hex, 0x0D 0x0A) and “message” is a JSON encoded data structure or a blank line.

    To get rid of the characters (I am assuming in most cases, you don't wanna have them represented in your feature list), you just decode it and encode it back to ASCII:

    tweet = unicode(tweet)
    result = tweet.decode('unicode_escape').encode('ascii', 'ignore')

  2. Replace usernames mentioned in the text with the string literal, USER: for this, you can easily use a Regular Expression to find all mentioned user names and replace them:

     result = re.sub('(?<=^|(?<=[^a-zA-Z0-9-_.]))@([A-Za-z]+[A-Za-z0-9]+)', 'USER', result)

  3. Replace URLs with the string URL: For this, you can also simply use Regular Expressions:

    result = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', 'URL', result)

  4. Remove Backslash character: For some strange reasons, after decoding to ASCII, you will see that text may contain extra backslash. Safely, just remove them:

    result = result.replace('\', '')

  5. Tokenize: Transform the string to lowercase (or uppercase for consistency) and split the string into tokens:

    result = result.lower().split()

  6. Remove Stop Tokens: Now is the time for removing the stopwords. I am using NLTK package's corpus that has a decent list of stopwords:

    from nltk.corpus import stopwords 
    stops = set(stopwords.words("english"))
    tokens = [t for t in result if (not t in stops)]

  7. Remove everything except keywords and alphanumeric characters:

    tokens = [t for t in tokens if "#" in t or t.isalnum()]

  1. Go, A., Huang, L., & Bhayani, R. (2009). Twitter sentiment analysis. Entropy, 17, available online at: http://www-nlp.stanford.edu/courses/cs224n/2009/fp/3.pdf

The Rstats tag of this blog is added to R Bloggers