Data Science Simplified: Natural Language Processing made simple: Word Cloud, Sentiment Analysis and Topic Modelling

In this chapter, let us understand

What is NLP?
Concepts
How to get word cloud?
How to perform sentiment analysis?
How to build Topic modelling?
Summary

What is NLP?

Natural Language Processing (NLP), in simple words, is using analytical tools to analyse natural language and speech.

Examples of Natural Language:

Examples of natural languages are tweets, Facebook posts, chat text messages, speech videos etc.

Before we discuss the concepts of NLP, let us try to learn the concepts as we perform the analysis using Orange.

Example 1: Word Cloud

Step 1: Download Orange (from here) directly and install it on your windows machine. Orange is an open source software which is easy to learn and powerful too. If you have Anaconda installed, you can install Orange there as well.

Step 2: Run as administrator. Click on add-ons. Search “text” and install it.

Install Text add-on

Step 3: You will now get a new tab on the left side of the orange window “text mining”.

Text Mining panel appears now

Step 4: Create a folder on your machine. In that folder, create a text document and save it. I created folder named “data” and in which I stored a text file which contains the first two lines of this chapter and saved it under the same Sample.txt.

Sample data (Sample.txt) inside folder (Data)

Step 5: Now drag the “Import Documents” and place it on the right panel.

Import Documents widget

Step 6: Double click on that “Import Documents” widget and Import the documents from the folder you created. You have to link to the folder, not directly to the text file inside. I imported the folder “Data”.

Import data folder

Step 7: Now drag Preprocess Text and connect it to “Import Documents”.

Preprocess text

Step 8: Let us understand the concepts now.

Preprocess text window

Four steps of preprocessing:

Preprocesss steps

Let us discuss these concepts one by one.

Transformation

Lowercase: Letters will converted to lowercase. E.g.

This → this

HERE → here

Remove accents: All the diacritics (Accents) will be removed. E.g.

rêsumê → resume

Parse html: This will remove the html tags and keep the texts only.

<a href>nlp</a> → nlp

Remove urls

Visit https://orange.biolab.si/ →Visit url

Tokenization

This is also called as lexical analysis or lexing.

In this step, the sequence of characters (in our case the first two paragraphs of this chapter) to tokens. The following are the ways with which we can carry out tokenization.

Word & Punctuation

E.g. I am learning NLP. → (I), (am), (learning), (NLP), (.)

This will retain the punctuation symbols and will divide the text into words.

Whitespace

I am learning NLP. → (I), (am), (learning), (NLP.)

This will split the text into words by using whitespaces in between them.

Sentence

Example:

I am learning NLP. I am reading it. → (I am learning NLP.), (I am reading it.)

Text will be split into sentences by full stops.

Regexp

This will split the text by words without punctuation as provided regex (Regular expression, regexp or rational expression).

Using the trained Twitter model, text is split into words retaining emoticons and hashtags.

I am learning :-D #nlp → (I), (am), (learning), (:-D), (#nlp)

Normalization

If you enable normalization, you can see the following options under normalization.

Normalization window

Lemmatization is closely related to stemming. Running will be converted to run in both lemmatization and stemming but better will be converted to good in lemmatization but not in stemming.

Lemmatization uses word meaning and context, while stemming operates only on the particular word. Hence stemming is faster to implement.

Stemming:

Raw words	Stemmed words
ran	run
running	Run
better	better

Lemmatization:

Raw words	After Lemmatization
has	have
have	have
had	Have
better	good

For more on this post https://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming.

Filtering

Filtering as the name suggests, retains or removes specific words.

Stopwords: stopwords, in simple words insignificant words, such as ‘and’, ‘or’ are removed. Further we can also specify which additional words to be removed by loading a txt file which contains stopwords one word per line.

Stopwords

Additional stopwords stored in a text file (one word per line)

In the Remove.txt file I saved nlp and etc. So we will be removing “nlp” and “etc” from the text.

Lexicon: Lexicon is the opposite of stopwords, it keeps only those words which we mention in the text file.

Regexp: It is based on the regular expression. By default, it removes punctuation marks from the text file.

Document frequency: If document frequency is set to (0.10 to 0.90), then only the tokens that appear in 10% to 90% of documents are retained. If document frequency is set to (0.10 to 0.90), then only the tokens that appear in 10% to 90% of documents are retained.

If the document frequency is set to (2,6), then it keeps only tokens that appear 2 or more time and 6 or less times in the text document.

Document frequency (10% to 90%)

Most frequent tokens retains only the given number of most frequent tokens in the document.

If you enable N-grams, you can see the following option.

N-gram refers to sequence of N words.

Thank you (is a 2-gram)

I am learning (is a 3-gram).

N-grams Range: Default is one-grams and two-grams.

N-grams

Word cloud: displays tokens with the size corresponding to the frequency of the word in text.

Step 8: In the preprocess text widget, disable normalization and N-grams range. Double click on the word cloud to the pictorial representation.

Word cloud widget

Word cloud

Step 9: To compare the effect of pre-processing, I created one more word cloud (Word Cloud 1) without pre-processing step. You can see some of the punctuation symbols in the word cloud now.

Compare word clouds with and without preprocessing

Word cloud without preprocessing

Sentiment analysis

As the word suggests, sentiment analysis helps to understand the sentiments in the text documents, tweets or Facebook posts.

Step 1: Drag the Corpus widget. Double click and select the election tweets.

Load data

Step 2: You can view the text data using corpus viewer widget. You can see that it contains total of 6444 tweets from Hilary Clinton and Donald Trump.

View the data

Step 3: Connect the corpus to Sentiment Analysis widget. Let the default Vader method be selected. View results by double clicking on the Data Table.

Sentiment Analysis

You can see four scores: Positive, negative, neutral and compound scores based on Vader method. Another supported method is Liu Hu method.

Vader refers to for Valence Aware Dictionary for sEntiment Reasoning (read more).

Liu Hu is named after Minqing Hu and Bing Liu (read more).

Vader method

If the compound score +1 then it is most positive, whereas if it is -1, then it is most negative.

Sentiment analysis results

In Excel, I have calculated these average scores across authors for comparison.

Average sentiment scores across authors

I calculated the sentiment scores based on Liu Hu method and the results are shown below:

Liu Hu method

Unlike the Vader method, in Liu Hu method we get only one single normalized score: sentiment. Positive score for positive sentiment, 0 for neutral and negative score for negative sentiment.

Results of sentiment analysis using Liu Hu method (notice only one score)

Average sentiment across Hillary Clinton and Trump:

Average sentiment across Hillary Clinton and Trump

Topic modelling

A Topic Model is an unsupervised technique. It is similar to cluster analysis. Topic modelling groups the text data into different topics (but topic name is suggested by the model, we have to name those topics).

Step 1: Load the data grimm-tales-selected.tab

Load the data

Step 2: View the data

As you can see, there are 44 documents.

View the document

Step 3: Run the topic modelling. Double click on the topic modelling widget to view the results.

As you can see, for 10 topics, keywords have been given. Green coloured words denote positive association with that topic, while red coloured words denote negative association with that topic. You can name the topics as well.

Topic Keyword- Topic Modelling

Link topic modelling widget to word cloud. Link preprocess text also to word cloud and view the results together. If you click on topic from topic modelling, corresponding word cloud appears. And if you click on any word in the word cloud, you will see where all that word appear on the right most window as shown below.

Open three windows side by side to view results together: view topic keywords, corresponding word cloud and documents with those word simultaneously

Run Topic modelling

View word clouds and where it appears side by side

Step 4: Connect data view widget to Topic Modelling.

Connect data view widget to Topic Modelling

Step 5: View the results of Topic Modelling by double clicking on data table widget.

Results of Topic Modelling

Topic modelling helps to organize text documents, understand and find hidden patterns etc.

Summary

In this chapter, we have explored the following topics:

What is NLP?
Concepts in NLP
How to get word cloud, sentiment analysis and topic modelling

Data Science Simplified

Natural Language Processing made simple: Word Cloud, Sentiment Analysis and Topic Modelling