Tutorial 4: Public Opinion on the Climate Emergency and Why it Matters
Contents
Tutorial 4: Public Opinion on the Climate Emergency and Why it Matters#
Week 2, Day 3: IPCC Socio-economic Basis
Content creators: Maximilian Puelma Touzel
Content reviewers: Peter Ohue, Derick Temfack, Zahra Khodakaramimaghsoud, Peizhen Yang, Younkap Nina Duplex, Laura Paccini, Sloane Garelick, Abigail Bodner, Manisha Sinha, Agustina Pesce, Dionessa Biton, Cheng Zhang, Jenna Pearson, Chi Zhang, Ohad Zivan
Content editors: Jenna Pearson, Chi Zhang, Ohad Zivan
Production editors: Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan
Our 2023 Sponsors: NASA TOPS and Google DeepMind
Tutorial Objectives#
In this tutorial, we will explore a dataset derived from Twitter, focusing on public sentiment surrounding the Conference of Parties (COP) climate change conferences. We will use data from a published study by Falkenberg et al. Nature Clim. Chg. 2022. This dataset encompasses tweets mentioning the COP conferences, which bring together world governments, NGOs, and businesses to discuss and negotiate on climate change progress. Our main objective is to understand public sentiment about climate change and how it has evolved over time through an analysis of changing word usage on social media. In the process, we will also learn how to manage and analyze large quantities of text data.
The tutorial is divided into sections, where we first delve into loading and inspecting the data, examining the timing and languages of the tweets, and analyzing sentiments associated with specific words, including those indicating ‘hypocrisy’. We’ll also look at sentiments regarding institutions within these tweets and compare the sentiment of tweets containing ‘hypocrisy’-related words versus those without. This analysis is supplemented with visualization techniques like word clouds and distribution plots.
By the end of this tutorial, you will have developed a nuanced understanding of how text analysis can be used to study public sentiment on climate change and other environmental issues, helping us to navigate the intricate and evolving landscape of climate communication and advocacy.
Setup#
# imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# notebook config
from IPython.display import display, HTML
import datetime
import re
import nltk
from nltk.corpus import stopwords
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import urllib.request # the lib that handles the url stuff
from afinn import Afinn
import pooch
import os
import tempfile
Figure settings#
# @title Figure settings
import ipywidgets as widgets # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use(
"https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle"
)
sns.set_style("ticks", {"axes.grid": False})
display(HTML("<style>.container { width:100% !important; }</style>"))
Video 2: A Simple Greenhouse Model#
# @title Video 2: A Simple Greenhouse Model
# Tech team will add code to format and display the video
# helper functions
def pooch_load(filelocation=None, filename=None, processor=None):
shared_location = "/home/jovyan/shared/Data/tutorials/W2D3_FutureClimate-IPCCII&IIISocio-EconomicBasis" # this is different for each day
user_temp_cache = tempfile.gettempdir()
if os.path.exists(os.path.join(shared_location, filename)):
file = os.path.join(shared_location, filename)
else:
file = pooch.retrieve(
filelocation,
known_hash=None,
fname=os.path.join(user_temp_cache, filename),
processor=processor,
)
return file
Section 1: Data Preprocessing#
We have performed the following preprocessing steps for you (simply follow along; there is no need to execute any commands in this section):
Every Twitter message (hereon called tweets) has an ID. IDs of all tweets mentioning COPx
(x=20-26, which refers to the session number of each COP meeting) used in Falkenberg et al. (2022) were placed by the authors in an osf archive. You can download the 7 .csv files (one for each COP) here
The twarc2
program serves as an interface with the Twitter API, allowing users to retrieve full tweet content and metadata by providing the tweet ID. Similar to GitHub, you need to create a Twitter API account and configure twarc
on your local machine by providing your account authentication keys. To rehydrate a set of tweets using their IDs, you can use the following command: twarc2 hydrate source_file.txt store_file.jsonl
. In this command, each line of the source_file.txt
represents a Twitter ID, and the hydrated tweets will be stored in the store_file.jsonl
.
First, format the downloaded IDs and split them into separate files (batches) to make hydration calls to the API more time manageable (hours versus days - this is slow because of an API-imposed limit of 100 tweets/min.).
# import os
# dir_name='Falkenberg2022_data/'
# if not os.path.exists(dir_name):
# os.mkdir(dir_name)
# batch_size = int(1e5)
# download_pathname=''#~/projects/ClimateMatch/SocioEconDay/Polarization/COP_Twitter_IDs/
# for copid in range(20,27):
# df_tweetids=pd.read_csv(download_pathname+'tweet_ids_cop'+str(copid)+'.csv')
# for batch_id,break_id in enumerate(range(0,len(df_tweetids),batch_size)):
# file_name="tweetids_COP"+str(copid)+"_b"+str(batch_id)+".txt"
# df_tweetids.loc[break_id:break_id+batch_size,'id'].to_csv(dir_name+file_name,index=False,header=False)
Make the hydration calls for COP26 (this took 4 days to download 50GB of data for COP26).
# import glob
# import time
# copid=26
# filename_list = glob.glob('Falkenberg2022_data/'+"tweetids_COP"+str(copid)+"*")
# dir_name='tweet_data/'
# if not os.path.exists(dir_name):
# os.mkdir(dir_name)
# file_name="tweetids_COP"+str(copid)+"_b"+str(batch_id)+".txt"
# for itt,tweet_id_batch_filename in enumerate(filename_list):
# strvars=tweet_id_batch_filename.split('/')[1].split('.')[0].split('_')
# tweet_store_filename = dir_name+'tweets_'+strvars[1]+'_'+strvars[2]+'.json'
# if not os.path.exists(tweet_store_filename):
# st=time.time()
# os.system('twarc2 hydrate '+tweet_id_batch_filename+' '+tweet_store_filename)
# print(str(itt)+' '+str(strvars[2])+" "+str(time.time()-st))
Load the data, then inspect and pick a chunk size. Note, by default, there are 100 tweets per line in the
.json
files returned by the API. Given we asked for 1e5 tweets/batch, there should be 1e3 lines in these files.
# copid=26
# batch_id = 0
# tweet_store_filename = 'tweet_data/tweets_COP'+str(copid)+'_b'+str(batch_id)+'.json'
# num_lines = sum(1 for line in open(tweet_store_filename))
# num_lines
Now we read in the data, iterating over chunks in each batch and only store the needed data in a dataframe (takes 10-20 minutes to run). Let’s look at when the tweets were posted, what language they are in, and the tweet text:
# selected_columns = ['created_at','lang','text']
# st=time.time()
# filename_list = glob.glob('tweet_data/'+"tweets_COP"+str(copid)+"*")
# df=[]
# for tweet_batch_filename in filename_list[:-1]:
# reader = pd.read_json(tweet_batch_filename, lines=True,chunksize=1)
# # df.append(pd.DataFrame([item[selected_columns] for sublist in reader.data.values.tolist()[:-1] for item in sublist] )[selected_columns])
# dfs=[]
# for chunk in reader:
# if 'data' in chunk.columns:
# dfs.append(pd.DataFrame(list(chunk.data.values)[0])[selected_columns])
# df.append(pd.concat(dfs,ignore_index=True))
# # df.append(pd.DataFrame(list(reader.data)[0])[selected_columns])
# df=pd.concat(df,ignore_index=True)
# df.created_at=pd.to_datetime(df.created_at)
# print(str(len(df))+' tweets took '+str(time.time()-st))
# df.head()
Finally, store the data in the efficiently compressed feather format
# df.to_feather('stored_tweets')
Section 2: Load and Inspect Data#
Now that we have reviewed the steps that were taken to generate the preprocessed data, we can load the data. It may a few minutes to download the data.
filename_tweets = "stored_tweets"
url_tweets = "https://osf.io/download/8p52x/"
df = pd.read_feather(
pooch_load(url_tweets, filename_tweets)
) # takes a couple minutes to download
Downloading data from 'https://osf.io/download/8p52x/' to file '/tmp/stored_tweets'.
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[11], line 4
1 filename_tweets = "stored_tweets"
2 url_tweets = "https://osf.io/download/8p52x/"
3 df = pd.read_feather(
----> 4 pooch_load(url_tweets, filename_tweets)
5 ) # takes a couple minutes to download
Cell In[4], line 11, in pooch_load(filelocation, filename, processor)
9 file = os.path.join(shared_location, filename)
10 else:
---> 11 file = pooch.retrieve(
12 filelocation,
13 known_hash=None,
14 fname=os.path.join(user_temp_cache, filename),
15 processor=processor,
16 )
18 return file
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
236 if downloader is None:
237 downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
241 if known_hash is None:
242 get_logger().info(
243 "SHA256 hash of downloaded file: %s\n"
244 "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
(...)
247 file_hash(str(full_path)),
248 )
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/core.py:803, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
799 try:
800 # Stream the file to a temporary so that we can safely check its
801 # hash before overwriting the original.
802 with temporary_file(path=str(fname.parent)) as tmp:
--> 803 downloader(url, tmp, pooch)
804 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
805 shutil.move(tmp, str(fname))
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/downloaders.py:226, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
224 progress = self.progressbar
225 progress.total = total
--> 226 for chunk in content:
227 if chunk:
228 output_file.write(chunk)
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/requests/models.py:816, in Response.iter_content.<locals>.generate()
814 if hasattr(self.raw, "stream"):
815 try:
--> 816 yield from self.raw.stream(chunk_size, decode_content=True)
817 except ProtocolError as e:
818 raise ChunkedEncodingError(e)
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:628, in HTTPResponse.stream(self, amt, decode_content)
626 else:
627 while not is_fp_closed(self._fp):
--> 628 data = self.read(amt=amt, decode_content=decode_content)
630 if data:
631 yield data
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:567, in HTTPResponse.read(self, amt, decode_content, cache_content)
564 fp_closed = getattr(self._fp, "closed", False)
566 with self._error_catcher():
--> 567 data = self._fp_read(amt) if not fp_closed else b""
568 if amt is None:
569 flush_decoder = True
File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:533, in HTTPResponse._fp_read(self, amt)
530 return buffer.getvalue()
531 else:
532 # StringIO doesn't like amt=None
--> 533 return self._fp.read(amt) if amt is not None else self._fp.read()
File ~/miniconda3/envs/climatematch/lib/python3.10/http/client.py:466, in HTTPResponse.read(self, amt)
463 if self.length is not None and amt > self.length:
464 # clip the read to the "end of response"
465 amt = self.length
--> 466 s = self.fp.read(amt)
467 if not s and amt:
468 # Ideally, we would raise IncompleteRead if the content-length
469 # wasn't satisfied, but it might break compatibility.
470 self._close_conn()
File ~/miniconda3/envs/climatematch/lib/python3.10/socket.py:705, in SocketIO.readinto(self, b)
703 while True:
704 try:
--> 705 return self._sock.recv_into(b)
706 except timeout:
707 self._timeout_occurred = True
File ~/miniconda3/envs/climatematch/lib/python3.10/ssl.py:1274, in SSLSocket.recv_into(self, buffer, nbytes, flags)
1270 if flags != 0:
1271 raise ValueError(
1272 "non-zero flags not allowed in calls to recv_into() on %s" %
1273 self.__class__)
-> 1274 return self.read(nbytes, buffer)
1275 else:
1276 return super().recv_into(buffer, nbytes, flags)
File ~/miniconda3/envs/climatematch/lib/python3.10/ssl.py:1130, in SSLSocket.read(self, len, buffer)
1128 try:
1129 if buffer is not None:
-> 1130 return self._sslobj.read(len, buffer)
1131 else:
1132 return self._sslobj.read(len)
KeyboardInterrupt:
Let’s check the timing of the tweets relative to the COP26 event (duration shaded in blue in the plot you will make) to see how the number of tweets vary over time.
total_tweetCounts = (
df.created_at.groupby(df.created_at.apply(lambda x: x.date))
.count()
.rename("counts")
)
fig, ax = plt.subplots()
total_tweetCounts.reset_index().plot(
x="created_at", y="counts", figsize=(20, 5), style=".-", ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.set_yscale("log")
COPdates = [
datetime.datetime(2021, 10, 31),
datetime.datetime(2021, 11, 12),
] # shade the duration of the COP26 to guide the eye
ax.axvspan(*COPdates, alpha=0.3)
# gray region
In addition to assessing the number of tweets, we can also explore who was tweeting about this COP. Look at how many tweets were posted in various languages:
counts = df.lang.value_counts().reset_index()
The language name of the tweet is stored as a code name. We can pull a language code dictionary from the web and use it to translate the language code to the language name.
target_url = "https://gist.githubusercontent.com/carlopires/1262033/raw/c52ef0f7ce4f58108619508308372edd8d0bd518/gistfile1.txt"
exec(urllib.request.urlopen(target_url).read())
lang_code_dict = dict(iso_639_choices)
counts = counts.replace({"index": lang_code_dict})
counts
Coding Exercise 2#
Run the following cell to print the dictionary for the language codes:
lang_code_dict
Find your native language code in the dictionary you just printed and use it to select the COP tweets that were written in your language!
language_code = ...
df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)
pd.options.display.max_rows = 100 # see up to 100 entries
pd.options.display.max_colwidth = 250 # widen how much text is presented of each tweet
samples = ...
samples
# to_remove solution
language_code = "en"
df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)
pd.options.display.max_rows = 100 # see up to 100 entries
pd.options.display.max_colwidth = 250 # widen how much text is presented of each tweet
samples = df_tmp.sample(100)
samples
df = df_tmp
Section 3: Word Set Prevalence#
Falkenberg et al. investigated the hypothesis that public sentiment around the COP conferences has increasingly framed them as hypocritical (“political hypocrisy as a topic of cross-ideological appeal”). The authors operationalized hypocrisy language as any tweet containing any of the following words:
selected_words = [
"hypocrisy",
"hypocrite",
"hypocritical",
"greenwash",
"green wash",
"blah",
] # the last 3 words don't add much. Greta Thurnberg's 'blah, blah blah' speech on Sept. 28th 2021.
Questions 3#
How might this matching procedure be limited in its ability to capture this sentiment?
# to_remove explanation
"""
1. Our approach is based on a predetermined dictionary, which might have several limitations in accurately capturing sentiment. Contextual ignorance (e.g., the word "not" can reverse the sentiment of the following word, but this isn't captured in a simple matching procedure) and language and cultural differences cannot be well captured.
""" ""
The authors then searched for these words within a distinct dataset across all COP conferences (this dataset was not made openly accessible but the figure using that data is here). They found that hypocrisy has been mentioned more in recent COP conferences.
Here, we will shift our focus to their accessible COP26 dataset and analyze the nature of comments related to specific topics, such as political hypocrisy. First, let’s look through the whole dataset and pull tweets that mention any of the selected words.
selectwords_detector = re.compile(
r"\b(?:{0})\b".format("|".join(selected_words))
) # to make a word detector for a wordlist faster to run, compile it!
df["select_talk"] = df.text.apply(
lambda x: selectwords_detector.search(x, re.IGNORECASE)
) # look through whole dataset, flagging tweets with select_talk (computes in under a minute)
Let’s extract these tweets and examine their occurrence statistics in relation to the entire dataset that we calculated above.
selected_tweets = df.loc[~df.select_talk.isnull(), :]
selected_tweet_counts = (
selected_tweets.created_at.groupby(
selected_tweets.created_at.apply(lambda x: x.date)
)
.count()
.rename("counts")
)
selected_tweet_fraction = selected_tweet_counts / total_tweetCounts
fig, ax = plt.subplots(figsize=(20, 5))
selected_tweet_fraction.reset_index().plot(
x="created_at", y="counts", style=[".-"], ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.axvspan(*COPdates, alpha=0.3) # gray region
ax.set_ylabel("fraction talking about hypocrisy")
Please note that these fractions are normalized, meaning that larger fractions closer to the COP26 dates (shaded in blue) when the total number of tweets are orders of magnitude larger indicate a significantly greater absolute number of tweets talking about hypocrisy.
Now, let’s examine the content of these tweets by randomly sampling 100 of them.
selected_tweets.text.sample(100).values
Coding Exercise 3#
Please select another topic and provide a list of topic words. We will then conduct the same analysis for that topic. For example, if the topic is “renewable technology,” please provide a list of relevant words.
selected_words_2 = [..., ..., ..., ..., ...]
selectwords_detector_2 = re.compile(r"\b(?:{0})\b".format("|".join([str(word) for word in selected_words_2])))
df["select_talk_2"] = df.text.apply(
lambda x: selectwords_detector_2.search(x, re.IGNORECASE)
)
selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]
selected_tweet_counts_2 = (
selected_tweets_2.created_at.groupby(
selected_tweets_2.created_at.apply(lambda x: x.date)
)
.count()
.rename("counts")
)
selected_tweet_fraction_2 = ...
samples = ...
samples
# to_remove solution
selected_words_2 = ["renewable", "wind", "solar", "geothermal", "biofuel"]
selectwords_detector_2 = re.compile(r"\b(?:{0})\b".format(
"|".join([str(word) for word in selected_words_2])))
df["select_talk_2"] = df.text.apply(
lambda x: selectwords_detector_2.search(x, re.IGNORECASE)
)
selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]
selected_tweet_counts_2 = (
selected_tweets_2.created_at.groupby(
selected_tweets_2.created_at.apply(lambda x: x.date)
)
.count()
.rename("counts")
)
selected_tweet_fraction_2 = selected_tweet_counts_2 / total_tweetCounts
samples = selected_tweets_2.text.sample(100).values
samples
Section 4: Sentiment Analysis#
Let’s test this hypothesis from Falkenberg et al. (that public sentiment around the COP conferences has increasingly framed them as political hypocrisy). To do so, we can use sentiment analysis, which is a method for computing the proportion of words that have positive connotations, negative connotations or are neutral. Some sentiment analysis systems can measure other word attributes as well. In this case, we will analyze the sentiment of the subset of tweets that mention international organizations central to globalization (e.g., G7), focusing specifically on the tweets related to hypocrisy.
Note: part of the computation flow in what follows is from Caren Neal’s tutorial.
We’ll assign tweets a sentiment score using a dictionary method (i.e. based on the word sentiment scores of words in the tweet that appear in given word-sentiment score dictionary). The particular word-sentiment score dictionary we will use is compiled in the AFINN package and reflects a scoring between -5 (negative connotation) and 5 (positive connotation). The English language dictionary consists of 2,477 coded words.
Let’s initialize the dictionary for the selected language. For example, the language code for English is ‘en’.
afinn = Afinn(language=language_code)
Now we can load the dictionary:
filename_afinn_wl = "AFINN-111.txt"
url_afinn_wl = (
"https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-111.txt"
)
afinn_wl_df = pd.read_csv(
pooch_load(url_afinn_wl, filename_afinn_wl),
header=None, # no column names
sep="\t", # tab sepeated
names=["term", "value"],
) # new column names
seed = 808 # seed for sample so results are stable
afinn_wl_df.sample(10, random_state=seed)
Let’s look at the distribution of scores over all words in the dictionary
fig, ax = plt.subplots()
afinn_wl_df.value.value_counts().sort_index().plot.bar(ax=ax)
ax.set_xlabel("Finn score")
ax.set_ylabel("dictionary counts")
These scores were assigned to words based on labeled tweets (validation paper).
Before focussing on sentiments about institutions within the hypocrisy tweets, let’s look at the hypocrisy tweets in comparison to non-hypocrisy tweets. This will take some more intensive computation, so let’s only perform it on a 1% subsample of the dataset
smalldf = df.sample(frac=0.01)
smalldf["afinn_score"] = smalldf.text.apply(
afinn.score
) # intensive computation! We have reduced the data set to frac=0.01 it's size so it takes ~1 min. (the full dataset takes 1hrs 50 min.)
smalldf["afinn_score"].describe() # generate descriptive statistics.
From this, we can see that the maximum score is 24 and the minimum score is -33. The score is computed by summing up the scores of all dictionary words present in the tweet, which means that longer tweets tend to have higher scores.
To make the scores comparable across tweets of different lengths, a rough approach is to convert them to a per-word score. This is done by normalizing each tweet’s score by its word count. It’s important to note that this per-word score is not specific to the dictionary words used, so this approach introduces a bias that depends on the proportion of dictionary words in each tweet. We will refer to this normalized score as afinn_adjusted.
def word_count(text_string):
"""Calculate the number of words in a string"""
return len(text_string.split())
smalldf["word_count"] = smalldf.text.apply(word_count)
smalldf["afinn_adjusted"] = (
smalldf["afinn_score"] / smalldf["word_count"]
) # note this isn't a percentage
smalldf["afinn_adjusted"].describe()
After normalizing the scores, we find that the maximum score is now 2 and the minimum score is now -1.5.
Now let’s look at the sentiment of tweets with hypocrisy words versus those without those words. For reference, we’ll first make cumulative distribution plots of score distributions for some other possibly negative words: fossil, G7, Boris and Davos.
for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
selectwords_detector = re.compile(
r"\b(?:{0})\b".format("|".join(sel_words))
) # compile for speed!
smalldf[sel_name] = smalldf.text.apply(
lambda x: selectwords_detector.search(x, re.IGNORECASE) is not None
) # flag if tweet has word(s)
for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
fig, ax = plt.subplots()
ax.set_xlim(-1, 1)
ax.set_xlabel("adjusted Finn score")
ax.set_ylabel("probabilty")
counts, bins = np.histogram(
smalldf.loc[smalldf[sel_name], "afinn_adjusted"],
bins=np.linspace(-1, 1, 101),
density=True,
)
ax.plot(bins[:-1], np.cumsum(counts), color="C0", label=sel_name + " tweets")
counts, bins = np.histogram(
smalldf.loc[~smalldf[sel_name], "afinn_adjusted"],
bins=np.linspace(-1, 1, 101),
density=True,
)
ax.plot(
bins[:-1], np.cumsum(counts), color="C1", label="non-" + sel_name + " tweets"
)
ax.axvline(0, color=[0.7] * 3, zorder=1)
ax.legend()
ax.set_title("cumulative Finn score distribution for " + sel_name + " occurence")
Recall from our previous calculations that the tweets containing the selected hypocrisy-associated words have minimum adjusted score of -1.5. This score is much more negative than the scores of all four reference words we just plotted. So what is the content of these selected tweets that is causing them to be so negative? The explore this, we can use word clouds to assess the usage of specific words.
Section 5: Word Clouds#
To analyze word usage, let’s first vectorize the text data. Vectorization (also known as tokenization) here means giving each word in the vocabulary an index and transforming each word sequence to its vector representation and creating a sequence of elements with the corresponding word indices (e.g. the response ['I','love','icecream']
maps to something like [34823,5937,79345]
).
We’ll use and compare two methods: term-frequency (\(\mathrm{tf}\)) and term-frequency inverse document frequency (\(\mathrm{Tfidf}\)). Both of these methods measure how important a term is within a document relative to a collection of documents by using vectorization to transform words into numbers.
Term Frequency (\(\mathrm{tf}\)): the number of times the word appears in a document compared to the total number of words in the document.
Inverse Document Frequency (\(\mathrm{idf}\)): reflects the proportion of documents in the collection of documents that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).
Thus the overall term-frequency inverse document frequency can be calculated by multiplying the term-frequency and the inverse document frequency:
\(\mathrm{Tfidf}\) aims to add more discriminability to frequency as a word relevance metric by downweighting words that appear in many documents since these common words are less discriminative. In other words, the importance of a term is high when it occurs a lot in a given document and rarely in others.
If you are interested in learning more about the mathematical equations used to develop these two methods, please refer to the additional details in the “Further Reading” section for this day.
Let’s run both of these methods and store the vectorized data in a dictionary:
vectypes = ["counts", "Tfidf"]
def vectorize(doc_data, ngram_range=(1, 1), remove_words=[], min_doc_freq=1):
vectorized_data_dict = {}
for vectorizer_type in vectypes:
if vectorizer_type == "counts":
vectorizer = CountVectorizer(
stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range
)
elif vectorizer_type == "Tfidf":
vectorizer = TfidfVectorizer(
stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range
)
vectorized_doc_list = vectorizer.fit_transform(data).todense().tolist()
feature_names = (
vectorizer.get_feature_names_out()
) # or get_feature_names() depending on scikit learn version
print("vocabulary size:" + str(len(feature_names)))
wdf = pd.DataFrame(vectorized_doc_list, columns=feature_names)
vectorized_data_dict[vectorizer_type] = wdf
return vectorized_data_dict, feature_names
def plot_wordcloud_and_freqdist(wdf, title_str, feature_names):
"""
Plots a word cloud
"""
pixel_size = 600
x, y = np.ogrid[:pixel_size, :pixel_size]
mask = (x - pixel_size / 2) ** 2 + (y - pixel_size / 2) ** 2 > (
pixel_size / 2 - 20
) ** 2
mask = 255 * mask.astype(int)
wc = WordCloud(
background_color="rgba(255, 255, 255, 0)", mode="RGBA", mask=mask, max_words=50
) # ,relative_scaling=1)
wordfreqs = wdf.T.sum(axis=1)
num_show = 50
sorted_ids = np.argsort(wordfreqs)[::-1]
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(x=range(num_show), height=wordfreqs[sorted_ids][:num_show])
ax.set_xticks(range(num_show))
ax.set_xticklabels(
feature_names[sorted_ids][:num_show], rotation=45, fontsize=8, ha="right"
)
ax.set_ylabel("total frequency")
ax.set_title(title_str + " vectorizer")
ax.set_ylim(0, 10 * wordfreqs[sorted_ids][int(num_show / 2)])
ax_wc = inset_axes(ax, width="90%", height="90%")
wc.generate_from_frequencies(wordfreqs)
ax_wc.imshow(wc, interpolation="bilinear")
ax_wc.axis("off")
nltk.download(
"stopwords"
) # downloads basic stop words, i.e. words with little semantic value (e.g. "the"), to be used as words to be removed
remove_words = stopwords.words("english")
We can now vectorize and look at the wordclouds for single word statistics. Let’s explicitly exclude some words and implicity exclude ones that appear in fewer than some threshold number of tweets.
data = (
selected_tweets["text"].sample(frac=0.1).values
) # reduce size since the vectorization computation transforms the corpus into an array of large size (vocabulary size x number of tweets)
# let's add some more words that we don't want to track (you can generate this kind of list iteratively by looking at the results and adding to this list):
remove_words += [
"cop26",
"http",
"https",
"30",
"000",
"je",
"rt",
"climate",
"limacop20",
"un_climatetalks",
"climatechange",
"via",
"ht",
"talks",
"unfccc",
"peru",
"peruvian",
"lima",
"co",
]
print(str(len(data)) + " tweets")
min_doc_freq = 5 / len(data)
ngram_range = (1, 1) # start and end number of words
vectorized_data_dict, feature_names = vectorize(
selected_tweets,
ngram_range=ngram_range,
remove_words=remove_words,
min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
plot_wordcloud_and_freqdist(
vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
)
Note in the histograms how the \(\mathrm{Tfidf}\) vectorizer has scaled down the hypocrisy words such that they are less prevalent relative to the count vectorizer.
There are some words here (e.g. private
and jet
) that look like they likely would appear in pairs. Let’s tell the vectorizer to also look for high frequency pairs of words.
ngram_range = (1, 2) # start and end number of words
vectorized_data_dict, feature_names = vectorize(
selected_tweets,
ngram_range=ngram_range,
remove_words=remove_words,
min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
plot_wordcloud_and_freqdist(
vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
)
The hypocrisy words take up so much frequency that it is hard to see what the remaining words are. To clear this list a bit more, let’s also remove the hypocrisy words altogether.
remove_words += selected_words
ngram_range = (1, 2) # start and end number of words
vectorized_data_dict, feature_names = vectorize(
selected_tweets,
ngram_range=ngram_range,
remove_words=remove_words,
min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
plot_wordcloud_and_freqdist(
vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
)
Observe that terms we might have expected are associated with hypocrisy, e.g. “flying” are still present. Even when allowing for pairs, the semantics are hard to extract from this analysis that ignores the correlations in usage among multiple words.
To futher assess statistics, one approach is use a generative model with latent structure.
Topic models (the structural topic model in particular) are a nice modelling framework to start analyzing those correlations.
For a modern introduction to text analysis in the social sciences, I recommend the textbook:
Text as Data: A New Framework for Machine Learning and the Social Sciences (2022) by Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart
Summary#
In this tutorial, you’ve learned how to analyze large amounts of text data from social media to understand public sentiment about climate change. You’ve been introduced to the process of loading and examining Twitter data, specifically relating to the COP climate change conferences. You’ve also gained insights into identifying and analyzing sentiments associated with specific words, with a focus on those indicating ‘hypocrisy’.
We used techniques to normalize sentiment scores and to compare sentiment among different categories of tweets. You have also learned about text vectorization methods, term-frequency (tf) and term-frequency inverse document frequency (tfidf), and their applications in word usage analysis. This tutorial provided you a valuable stepping stone to further delve into text analysis, which could help deeper our understanding of public sentiment on climate change. Such analysis helps us track how global perceptions and narratives about climate change evolve over time, which is crucial for policy planning and climate communication strategies.
This tutorial therefore not only provided you with valuable tools for text analysis but also demonstrated their potential in contributing to our understanding of climate change perceptions, a key factor in driving climate action.
Resources#
The data for this tutorial can be accessed from Falkenberg et al. Nature Clim. Chg. 2022.