- What is NLP?
- Concepts
- How to get word cloud?
- How to perform sentiment analysis?
- How to build Topic modelling?
- Summary
What is NLP?
Natural Language Processing (NLP), in simple words, is using analytical tools to analyse natural language and speech.
Examples of Natural Language:
Examples of natural languages are tweets, Facebook posts, chat text messages, speech videos etc.
Before we discuss the concepts of NLP, let us try to learn the concepts as we perform the analysis using Orange.
Example 1: Word Cloud
Step 1: Download Orange (from here) directly and install it on your windows machine. Orange is an open source software which is easy to learn and powerful too. If you have Anaconda installed, you can install Orange there as well.
Step 2: Run as administrator. Click on add-ons. Search “text” and install it.
Install Text add-on |
Step 3: You will now get a new tab on the left side of the orange window “text mining”.
Text Mining panel appears now |
Step 4: Create a folder on your machine. In that folder, create a text document and save it. I created folder named “data” and in which I stored a text file which contains the first two lines of this chapter and saved it under the same Sample.txt.
Sample data (Sample.txt) inside folder (Data) |
Step 5: Now drag the “Import Documents” and place it on the right panel.
Import Documents widget |
Step 6: Double click on that “Import Documents” widget and Import the documents from the folder you created. You have to link to the folder, not directly to the text file inside. I imported the folder “Data”.
Import data folder |
Step 7: Now drag Preprocess Text and connect it to “Import Documents”.
Preprocess text |
Step 8: Let us understand the concepts now.
Preprocess text window |
Four steps of preprocessing:
Preprocesss steps |
Let us discuss these concepts one by one.
- Transformation
- Lowercase: Letters will converted to lowercase. E.g.
This → this
HERE → here
- Remove accents: All the diacritics (Accents) will be removed. E.g.
rêsumê → resume
- Parse html: This will remove the html tags and keep the texts only.
<a href>nlp</a> → nlp
- Remove urls
Visit https://orange.biolab.si/ →Visit url
- Tokenization
This is also called as lexical analysis or lexing.
In this step, the sequence of characters (in our case the first two paragraphs of this chapter) to tokens. The following are the ways with which we can carry out tokenization.
- Word & Punctuation
E.g. I am learning NLP. → (I), (am), (learning), (NLP), (.)
This will retain the punctuation symbols and will divide the text into words.
- Whitespace
I am learning NLP. → (I), (am), (learning), (NLP.)
This will split the text into words by using whitespaces in between them.
- Sentence
Example:
I am learning NLP. I am reading it. → (I am learning NLP.), (I am reading it.)
Text will be split into sentences by full stops.
- Regexp
This will split the text by words without punctuation as provided regex (Regular expression, regexp or rational expression).
- Tweet
Using the trained Twitter model, text is split into words retaining emoticons and hashtags.
I am learning :-D #nlp → (I), (am), (learning), (:-D), (#nlp)
- Normalization
If you enable normalization, you can see the following options under normalization.
Normalization window |
Lemmatization is closely related to stemming. Running will be converted to run in both lemmatization and stemming but better will be converted to good in lemmatization but not in stemming.
Lemmatization uses word meaning and context, while stemming operates only on the particular word. Hence stemming is faster to implement.
- Stemming:
Raw words
|
Stemmed words
|
ran
|
run
|
running
|
Run
|
better
|
better
|
- Lemmatization:
Raw words
|
After Lemmatization
|
has
|
have
|
have
|
have
|
had
|
Have
|
better
|
good
|
For more on this post https://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming.
- Filtering
Additional stopwords stored in a text file (one word per line) |
In the Remove.txt file I saved nlp and etc. So we will be removing “nlp” and “etc” from the text.
Lexicon: Lexicon is the opposite of stopwords, it keeps only those words which we mention in the text file.
Regexp: It is based on the regular expression. By default, it removes punctuation marks from the text file.
Document frequency: If document frequency is set to (0.10 to 0.90), then only the tokens that appear in 10% to 90% of documents are retained. If document frequency is set to (0.10 to 0.90), then only the tokens that appear in 10% to 90% of documents are retained.
If the document frequency is set to (2,6), then it keeps only tokens that appear 2 or more time and 6 or less times in the text document.
Document frequency (10% to 90%) |
Most frequent tokens retains only the given number of most frequent tokens in the document.
If you enable N-grams, you can see the following option.
N-gram refers to sequence of N words.
Thank you (is a 2-gram)
I am learning (is a 3-gram).
N-grams Range: Default is one-grams and two-grams.
N-grams |
Word cloud: displays tokens with the size corresponding to the frequency of the word in text.
Step 8: In the preprocess text widget, disable normalization and N-grams range. Double click on the word cloud to the pictorial representation.
Word cloud widget |
Word cloud |
Step 9: To compare the effect of pre-processing, I created one more word cloud (Word Cloud 1) without pre-processing step. You can see some of the punctuation symbols in the word cloud now.
Compare word clouds with and without preprocessing |
Word cloud without preprocessing |
Sentiment analysis
As the word suggests, sentiment analysis helps to understand the sentiments in the text documents, tweets or Facebook posts.
Step 1: Drag the Corpus widget. Double click and select the election tweets.
Load data |
Step 2: You can view the text data using corpus viewer widget. You can see that it contains total of 6444 tweets from Hilary Clinton and Donald Trump.
View the data |
Step 3: Connect the corpus to Sentiment Analysis widget. Let the default Vader method be selected. View results by double clicking on the Data Table.
Sentiment Analysis |
You can see four scores: Positive, negative, neutral and compound scores based on Vader method. Another supported method is Liu Hu method.
Vader refers to for Valence Aware Dictionary for sEntiment Reasoning (read more).
Liu Hu is named after Minqing Hu and Bing Liu (read more).
Vader method |
If the compound score +1 then it is most positive, whereas if it is -1, then it is most negative.
Sentiment analysis results |
In Excel, I have calculated these average scores across authors for comparison.
Average sentiment scores across authors |
I calculated the sentiment scores based on Liu Hu method and the results are shown below:
Liu Hu method |
Unlike the Vader method, in Liu Hu method we get only one single normalized score: sentiment. Positive score for positive sentiment, 0 for neutral and negative score for negative sentiment.
Results of sentiment analysis using Liu Hu method (notice only one score) |
Average sentiment across Hillary Clinton and Trump:
Average sentiment across Hillary Clinton and Trump |
Topic modelling
A Topic Model is an unsupervised technique. It is similar to cluster analysis. Topic modelling groups the text data into different topics (but topic name is suggested by the model, we have to name those topics).
Step 1: Load the data grimm-tales-selected.tab
Load the data |
Step 2: View the data
As you can see, there are 44 documents.
Step 3: Run the topic modelling. Double click on the topic modelling widget to view the results.
As you can see, for 10 topics, keywords have been given. Green coloured words denote positive association with that topic, while red coloured words denote negative association with that topic. You can name the topics as well.
Link topic modelling widget to word cloud. Link preprocess text also to word cloud and view the results together. If you click on topic from topic modelling, corresponding word cloud appears. And if you click on any word in the word cloud, you will see where all that word appear on the right most window as shown below.
Step 4: Connect data view widget to Topic Modelling.
Step 5: View the results of Topic Modelling by double clicking on data table widget.
Topic modelling helps to organize text documents, understand and find hidden patterns etc.
Open three windows side by side to view results together: view topic keywords, corresponding word cloud and documents with those word simultaneously |
Run Topic modelling |
View word clouds and where it appears side by side |
Step 4: Connect data view widget to Topic Modelling.
Connect data view widget to Topic Modelling |
Step 5: View the results of Topic Modelling by double clicking on data table widget.
Results of Topic Modelling |
Topic modelling helps to organize text documents, understand and find hidden patterns etc.
Summary
In this chapter, we have explored the following topics:
- What is NLP?
- Concepts in NLP
- How to get word cloud, sentiment analysis and topic modelling