Text Mining and Sentiment Analysis

Text Mining and Sentiment Analysis with NLTK and Pandas in Python
Data Science
Text Mining
Python
Autor:in

Jan Kirenz

Veröffentlichungsdatum

19. Mai 2022

Geändert

16. Juni 2022

Text Mining and Sentiment Analysis with NLTK and pandas in Python

Data import

import pandas as pd

# Import some Tweets from Barack Obama 
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/twitter-tweepy/main/tweets-obama.csv")
df.head(3)
Unnamed: 0 created_at id author_id text
0 0 2022-05-16T21:24:35.000Z 1526312680226799618 813286 It’s despicable, it’s dangerous — and it needs…
1 1 2022-05-16T21:24:34.000Z 1526312678951641088 813286 We need to repudiate in the strongest terms th…
2 2 2022-05-16T21:24:34.000Z 1526312677521428480 813286 This weekend’s shootings in Buffalo offer a tr…
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         5 non-null      int64 
 1   created_at         5 non-null      object
 2   id                 5 non-null      int64 
 3   author_id          5 non-null      int64 
 4   text               5 non-null      object
 5   text_token         5 non-null      object
 6   text_string        5 non-null      object
 7   text_string_fdist  5 non-null      object
 8   text_string_lem    5 non-null      object
 9   is_equal           5 non-null      bool  
dtypes: bool(1), int64(3), object(6)
memory usage: 493.0+ bytes

Data transformation

df['text'] = df['text'].astype(str).str.lower()
df.head(3)
Unnamed: 0 created_at id author_id text
0 0 2022-05-16T21:24:35.000Z 1526312680226799618 813286 it’s despicable, it’s dangerous — and it needs…
1 1 2022-05-16T21:24:34.000Z 1526312678951641088 813286 we need to repudiate in the strongest terms th…
2 2 2022-05-16T21:24:34.000Z 1526312677521428480 813286 this weekend’s shootings in buffalo offer a tr…

Tokenization

conda install -c anaconda nltk
from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer('\w+')

df['text_token']=df['text'].apply(regexp.tokenize)
df.head(3)
Unnamed: 0 created_at id author_id text text_token
0 0 2022-05-16T21:24:35.000Z 1526312680226799618 813286 it’s despicable, it’s dangerous — and it needs… [it, s, despicable, it, s, dangerous, and, it,…
1 1 2022-05-16T21:24:34.000Z 1526312678951641088 813286 we need to repudiate in the strongest terms th… [we, need, to, repudiate, in, the, strongest, …
2 2 2022-05-16T21:24:34.000Z 1526312677521428480 813286 this weekend’s shootings in buffalo offer a tr… [this, weekend, s, shootings, in, buffalo, off…

Stopwords

  • Stop words are words in a stop list which are dropped before analysing natural language data since they don’t contain valuable information (like “will”, “and”, “or”, “has”, …).
import nltk

nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





True
from nltk.corpus import stopwords

# Make a list of english stopwords
stopwords = nltk.corpus.stopwords.words("english")

# Extend the list with your own custom stopwords
my_stopwords = ['https']
stopwords.extend(my_stopwords)
# Remove stopwords
df['text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])
df.head(3)
Unnamed: 0 created_at id author_id text text_token
0 0 2022-05-16T21:24:35.000Z 1526312680226799618 813286 it’s despicable, it’s dangerous — and it needs… [despicable, dangerous, needs, stop, co, 0ch2z…
1 1 2022-05-16T21:24:34.000Z 1526312678951641088 813286 we need to repudiate in the strongest terms th… [need, repudiate, strongest, terms, politician…
2 2 2022-05-16T21:24:34.000Z 1526312677521428480 813286 this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r…

Remove infrequent words

  • We first change the format of text_token to strings and keep only words which are longer than 2 letters
df['text_string'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if len(item)>2]))
df[['text', 'text_token', 'text_string']].head()
text text_token text_string
0 it’s despicable, it’s dangerous — and it needs… [despicable, dangerous, needs, stop, co, 0ch2z… despicable dangerous needs stop 0ch2zosmhb
1 we need to repudiate in the strongest terms th… [need, repudiate, strongest, terms, politician… need repudiate strongest terms politicians med…
2 this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r… weekend shootings buffalo offer tragic reminde…
3 i’m proud to announce the voyager scholarship … [proud, announce, voyager, scholarship, friend… proud announce voyager scholarship friend bche…
4 across the country, americans are standing up … [across, country, americans, standing, abortio… across country americans standing abortion rig…
  • Create a list of all words
all_words = ' '.join([word for word in df['text_string']])
  • Tokenize all_words
tokenized_words = nltk.tokenize.word_tokenize(all_words)
  • Create a frequency distribution which records the number of times each word has occurred:

from nltk.probability import FreqDist

fdist = FreqDist(tokenized_words)
fdist
FreqDist({'need': 2, 'americans': 2, 'proud': 2, 'despicable': 1, 'dangerous': 1, 'needs': 1, 'stop': 1, '0ch2zosmhb': 1, 'repudiate': 1, 'strongest': 1, ...})
  • Now we can use our fdist dictionary to drop words which occur less than a certain amount of times (usually we use a value of 3 or 4).
  • Since our dataset is really small, we don’t filter out any words and set the value to greater or equal to 1 (otherwise there are not many words left in this particular dataset)
df['text_string_fdist'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] >= 1 ]))
df[['text', 'text_token', 'text_string', 'text_string_fdist']].head()
text text_token text_string text_string_fdist
0 it’s despicable, it’s dangerous — and it needs… [despicable, dangerous, needs, stop, co, 0ch2z… despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb
1 we need to repudiate in the strongest terms th… [need, repudiate, strongest, terms, politician… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med…
2 this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde…
3 i’m proud to announce the voyager scholarship … [proud, announce, voyager, scholarship, friend… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche…
4 across the country, americans are standing up … [across, country, americans, standing, abortio… across country americans standing abortion rig… across country americans standing abortion rig…

Lemmatization

nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!





True
from nltk.stem import WordNetLemmatizer

wordnet_lem = WordNetLemmatizer()

df['text_string_lem'] = df['text_string_fdist'].apply(wordnet_lem.lemmatize)
  • Note that in some datasets, there are no words to lemmatize. We can check this as follows:
# check if the columns are equal
df['is_equal']= (df['text_string_fdist']==df['text_string_lem'])
# show level count
df.is_equal.value_counts()
True    5
Name: is_equal, dtype: int64
df
Unnamed: 0 created_at id author_id text text_token text_string text_string_fdist text_string_lem is_equal
0 0 2022-05-16T21:24:35.000Z 1526312680226799618 813286 it’s despicable, it’s dangerous — and it needs… [despicable, dangerous, needs, stop, co, 0ch2z… despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb True
1 1 2022-05-16T21:24:34.000Z 1526312678951641088 813286 we need to repudiate in the strongest terms th… [need, repudiate, strongest, terms, politician… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med… True
2 2 2022-05-16T21:24:34.000Z 1526312677521428480 813286 this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… True
3 3 2022-05-16T13:16:16.000Z 1526189794665107457 813286 i’m proud to announce the voyager scholarship … [proud, announce, voyager, scholarship, friend… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche… True
4 4 2022-05-14T15:03:07.000Z 1525491905139773442 813286 across the country, americans are standing up … [across, country, americans, standing, abortio… across country americans standing abortion rig… across country americans standing abortion rig… across country americans standing abortion rig… True

Word cloud

conda install -c conda-forge wordcloud
all_words_lem = ' '.join([word for word in df['text_string_lem']])
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(width=600, 
                     height=400, 
                     random_state=2, 
                     max_font_size=100).generate(all_words_lem)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off');

  • Different style:
import numpy as np

x, y = np.ogrid[:300, :300]
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(background_color="white", repeat=True, mask=mask)
wc.generate(all_words_lem)

plt.axis("off")
plt.imshow(wc, interpolation="bilinear");

Frequency distributions

nltk.download('punkt')
[nltk_data] Downloading package punkt to /Users/jankirenz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!





True
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

words = nltk.word_tokenize(all_words_lem)
fd = FreqDist(words)

Most common words

fd.most_common(3)
[('need', 2), ('americans', 2), ('proud', 2)]
fd.tabulate(3)
     need americans     proud 
        2         2         2 

Plot common words

# Obtain top 10 words
top_10 = fd.most_common(10)

# Create pandas series to make plotting easier
fdist = pd.Series(dict(top_10))
import seaborn as sns
sns.set_theme(style="ticks")

sns.barplot(y=fdist.index, x=fdist.values, color='blue');

import plotly.express as px

fig = px.bar(y=fdist.index, x=fdist.values)

# sort values
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})

# show plot
fig.show()

Search specific words

# Show frequency of a specific word
fd["americans"]
2

Sentiment analysis

VADER lexicon

  • NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for “Valence Aware Dictionary and Sentiment Reasoner” (Hutto & Gilbert, 2014).
nltk.download('vader_lexicon')
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!





True

Sentiment

Sentiment Intensity Analyzer

  • Initialize an object of SentimentIntensityAnalyzer with name “analyzer”:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

Polarity scores

  • Use the polarity_scores method:
df['polarity'] = df['text_string_lem'].apply(lambda x: analyzer.polarity_scores(x))
df.tail(3)
Unnamed: 0 created_at id author_id text text_token text_string text_string_fdist text_string_lem is_equal polarity
2 2 2022-05-16T21:24:34.000Z 1526312677521428480 813286 this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… True {‘neg’: 0.247, ‘neu’: 0.557, ‘pos’: 0.195, ’co…
3 3 2022-05-16T13:16:16.000Z 1526189794665107457 813286 i’m proud to announce the voyager scholarship … [proud, announce, voyager, scholarship, friend… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche… True {‘neg’: 0.0, ‘neu’: 0.573, ‘pos’: 0.427, ’comp…
4 4 2022-05-14T15:03:07.000Z 1525491905139773442 813286 across the country, americans are standing up … [across, country, americans, standing, abortio… across country americans standing abortion rig… across country americans standing abortion rig… across country americans standing abortion rig… True {‘neg’: 0.0, ‘neu’: 0.71, ‘pos’: 0.29, ’compou…

Transform data

# Change data structure
df = pd.concat(
    [df.drop(['Unnamed: 0', 'id', 'author_id', 'polarity'], axis=1), 
     df['polarity'].apply(pd.Series)], axis=1)
df.head(3)
created_at text text_token text_string text_string_fdist text_string_lem is_equal neg neu pos compound
0 2022-05-16T21:24:35.000Z it’s despicable, it’s dangerous — and it needs… [despicable, dangerous, needs, stop, co, 0ch2z… despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb True 0.639 0.361 0.000 -0.6486
1 2022-05-16T21:24:34.000Z we need to repudiate in the strongest terms th… [need, repudiate, strongest, terms, politician… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med… True 0.247 0.458 0.295 0.2263
2 2022-05-16T21:24:34.000Z this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… True 0.247 0.557 0.195 -0.1280
# Create new variable with sentiment "neutral," "positive" and "negative"
df['sentiment'] = df['compound'].apply(lambda x: 'positive' if x >0 else 'neutral' if x==0 else 'negative')
df.head(4)
created_at text text_token text_string text_string_fdist text_string_lem is_equal neg neu pos compound sentiment
0 2022-05-16T21:24:35.000Z it’s despicable, it’s dangerous — and it needs… [despicable, dangerous, needs, stop, co, 0ch2z… despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb despicable dangerous needs stop 0ch2zosmhb True 0.639 0.361 0.000 -0.6486 negative
1 2022-05-16T21:24:34.000Z we need to repudiate in the strongest terms th… [need, repudiate, strongest, terms, politician… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med… need repudiate strongest terms politicians med… True 0.247 0.458 0.295 0.2263 positive
2 2022-05-16T21:24:34.000Z this weekend’s shootings in buffalo offer a tr… [weekend, shootings, buffalo, offer, tragic, r… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… weekend shootings buffalo offer tragic reminde… True 0.247 0.557 0.195 -0.1280 negative
3 2022-05-16T13:16:16.000Z i’m proud to announce the voyager scholarship … [proud, announce, voyager, scholarship, friend… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche… proud announce voyager scholarship friend bche… True 0.000 0.573 0.427 0.9313 positive

Analyze data

# Tweet with highest positive sentiment
df.loc[df['compound'].idxmax()].values
array(['2022-05-16T13:16:16.000Z',
       'i’m proud to announce the voyager scholarship with my friend @bchesky. we hope to provide young people with an interest in public service with some financial support to graduate college, exposure to travel, and the networks they need to make a difference. https://t.co/rbtrjalgpe https://t.co/rz7qknmmww',
       list(['proud', 'announce', 'voyager', 'scholarship', 'friend', 'bchesky', 'hope', 'provide', 'young', 'people', 'interest', 'public', 'service', 'financial', 'support', 'graduate', 'college', 'exposure', 'travel', 'networks', 'need', 'make', 'difference', 'co', 'rbtrjalgpe', 'co', 'rz7qknmmww']),
       'proud announce voyager scholarship friend bchesky hope provide young people interest public service financial support graduate college exposure travel networks need make difference rbtrjalgpe rz7qknmmww',
       'proud announce voyager scholarship friend bchesky hope provide young people interest public service financial support graduate college exposure travel networks need make difference rbtrjalgpe rz7qknmmww',
       'proud announce voyager scholarship friend bchesky hope provide young people interest public service financial support graduate college exposure travel networks need make difference rbtrjalgpe rz7qknmmww',
       True, 0.0, 0.573, 0.427, 0.9313, 'positive'], dtype=object)
# Tweet with highest negative sentiment 
# ...seems to be a case of wrong classification because of the word "deficit"
df.loc[df['compound'].idxmin()].values
array(['2022-05-16T21:24:35.000Z',
       'it’s despicable, it’s dangerous — and it needs to stop.\nhttps://t.co/0ch2zosmhb',
       list(['despicable', 'dangerous', 'needs', 'stop', 'co', '0ch2zosmhb']),
       'despicable dangerous needs stop 0ch2zosmhb',
       'despicable dangerous needs stop 0ch2zosmhb',
       'despicable dangerous needs stop 0ch2zosmhb', True, 0.639, 0.361,
       0.0, -0.6486, 'negative'], dtype=object)

Visualize data

# Number of tweets 
sns.countplot(y='sentiment', 
             data=df, 
             palette=['#b2d8d8',"#008080", '#db3d13']
             );

# Lineplot
g = sns.lineplot(x='created_at', y='compound', data=df)

g.set(xticklabels=[]) 
g.set(title='Sentiment of Tweets')
g.set(xlabel="Time")
g.set(ylabel="Sentiment")
g.tick_params(bottom=False)

g.axhline(0, ls='--', c = 'grey');

# Boxplot
sns.boxplot(y='compound', 
            x='sentiment',
            palette=['#b2d8d8',"#008080", '#db3d13'], 
            data=df);

Literature:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.