{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9a3312c6",
   "metadata": {
    "execution": {}
   },
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ClimateMatchAcademy/course-content/blob/main/tutorials/W2D3_FutureClimate-IPCCII&amp;IIISocio-EconomicBasis/W2D3_Tutorial4.ipynb)   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/{ORG}/course-content/main/tutorials/W2D3_FutureClimate-IPCCII&amp;IIISocio-EconomicBasis/W2D3_Tutorial4.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "rmt_vWoCB1Qc",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Tutorial 4: Public Opinion on the Climate Emergency and Why it Matters**\n",
    "\n",
    "**Week 2, Day 3: IPCC Socio-economic Basis**\n",
    "\n",
    "**Content creators:** Maximilian Puelma Touzel\n",
    "\n",
    "**Content reviewers:** Peter Ohue, Derick Temfack, Zahra Khodakaramimaghsoud, Peizhen Yang, Younkap Nina Duplex, Laura Paccini, Sloane Garelick, Abigail Bodner, Manisha Sinha, Agustina Pesce, Dionessa Biton, Cheng Zhang, Jenna Pearson, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Content editors:** Jenna Pearson, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Production editors:** Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Our 2023 Sponsors:** NASA TOPS and Google DeepMind"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "aea4f3f2-e034-4ced-9be3-a44830488f2d",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Tutorial Objectives** \n",
    "In this tutorial, we will explore a dataset derived from Twitter, focusing on public sentiment surrounding the [Conference of Parties (COP) climate change conferences](https://unfccc.int/process/bodies/supreme-bodies/conference-of-the-parties-cop). We will use data from a published study by [Falkenberg et al. *Nature Clim. Chg.* 2022](https://www.nature.com/articles/s41558-022-01527-x). This dataset encompasses tweets mentioning the COP conferences, which bring together world governments, NGOs, and businesses to discuss and negotiate on climate change progress. Our main objective is to understand public sentiment about climate change and how it has evolved over time through an analysis of changing word usage on social media. In the process, we will also learn how to manage and analyze large quantities of text data.\n",
    "\n",
    "The tutorial is divided into sections, where we first delve into loading and inspecting the data, examining the timing and languages of the tweets, and analyzing sentiments associated with specific words, including those indicating 'hypocrisy'. We'll also look at sentiments regarding institutions within these tweets and compare the sentiment of tweets containing 'hypocrisy'-related words versus those without. This analysis is supplemented with visualization techniques like word clouds and distribution plots.\n",
    "\n",
    "By the end of this tutorial, you will have developed a nuanced understanding of how text analysis can be used to study public sentiment on climate change and other environmental issues, helping us to navigate the intricate and evolving landscape of climate communication and advocacy."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fc651270",
   "metadata": {
    "execution": {}
   },
   "source": [
    "\n",
    "# **Setup**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "AM-4KSggB_xR",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 27541,
     "status": "ok",
     "timestamp": 1682441197348,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from wordcloud import WordCloud\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# notebook config\n",
    "from IPython.display import display, HTML\n",
    "import datetime\n",
    "import re\n",
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "from mpl_toolkits.axes_grid1.inset_locator import inset_axes\n",
    "import urllib.request  # the lib that handles the url stuff\n",
    "from afinn import Afinn\n",
    "import pooch\n",
    "import os\n",
    "import tempfile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Figure settings\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e922e0dc-3a05-4a2f-a359-39a6a18e50c6",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Figure settings\n",
    "import ipywidgets as widgets  # interactive display\n",
    "\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\n",
    "    \"https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle\"\n",
    ")\n",
    "\n",
    "sns.set_style(\"ticks\", {\"axes.grid\": False})\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Video 2: A Simple Greenhouse Model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3438b0b-03f3-4fff-b1e1-2e4c688fb60f",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Video 2: A Simple Greenhouse Model\n",
    "# Tech team will add code to format and display the video"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a221310-bbe5-48ed-a01a-68ce33ae08da",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# helper functions\n",
    "\n",
    "\n",
    "def pooch_load(filelocation=None, filename=None, processor=None):\n",
    "    shared_location = \"/home/jovyan/shared/Data/tutorials/W2D3_FutureClimate-IPCCII&IIISocio-EconomicBasis\"  # this is different for each day\n",
    "    user_temp_cache = tempfile.gettempdir()\n",
    "\n",
    "    if os.path.exists(os.path.join(shared_location, filename)):\n",
    "        file = os.path.join(shared_location, filename)\n",
    "    else:\n",
    "        file = pooch.retrieve(\n",
    "            filelocation,\n",
    "            known_hash=None,\n",
    "            fname=os.path.join(user_temp_cache, filename),\n",
    "            processor=processor,\n",
    "        )\n",
    "\n",
    "    return file"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4b726e00",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 1: Data Preprocessing**"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "25ecfd31",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We have performed the following preprocessing steps for you (simply follow along; there is no need to execute any commands in this section):"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d4b7ffe1",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Every Twitter message (hereon called *tweets*) has an ID. IDs of all tweets mentioning `COPx` (x=20-26, which refers to the session number of each COP meeting) used in [Falkenberg et al. (2022)](https://doi.org/10.1038/s41558-022-01527-x) were placed by the authors in an [osf archive](https://osf.io/nu75j). You can download the 7 .csv files (one for each COP) [here](https://osf.io/download/pr29x/) "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1c20cf26",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The `twarc2` program serves as an interface with the Twitter API, allowing users to retrieve full tweet content and metadata by providing the tweet ID. Similar to GitHub, you need to create a Twitter API account and configure `twarc` on your local machine by providing your account authentication keys. To rehydrate a set of tweets using their IDs, you can use the following command: `twarc2 hydrate source_file.txt store_file.jsonl`. In this command, each line of the `source_file.txt` represents a Twitter ID, and the hydrated tweets will be stored in the `store_file.jsonl`.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "14f124e1",
   "metadata": {
    "execution": {}
   },
   "source": [
    "- First, format the downloaded IDs and split them into separate files (*batches*) to make hydration calls to the API more time manageable (hours versus days - this is slow because of an API-imposed limit of 100 tweets/min.). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ceb30ff",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 9,
     "status": "ok",
     "timestamp": 1682441198212,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# import os\n",
    "# dir_name='Falkenberg2022_data/'\n",
    "# if not os.path.exists(dir_name):\n",
    "#     os.mkdir(dir_name)\n",
    "# batch_size = int(1e5)\n",
    "# download_pathname=''#~/projects/ClimateMatch/SocioEconDay/Polarization/COP_Twitter_IDs/\n",
    "# for copid in range(20,27):\n",
    "#     df_tweetids=pd.read_csv(download_pathname+'tweet_ids_cop'+str(copid)+'.csv')\n",
    "#     for batch_id,break_id in enumerate(range(0,len(df_tweetids),batch_size)):\n",
    "#         file_name=\"tweetids_COP\"+str(copid)+\"_b\"+str(batch_id)+\".txt\"\n",
    "#         df_tweetids.loc[break_id:break_id+batch_size,'id'].to_csv(dir_name+file_name,index=False,header=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "63f668a3",
   "metadata": {
    "execution": {}
   },
   "source": [
    "- Make the hydration calls for COP26 (this took 4 days to download 50GB of data for COP26)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e4f4026",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 314,
     "status": "ok",
     "timestamp": 1682441198517,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# import glob\n",
    "# import time\n",
    "# copid=26\n",
    "# filename_list = glob.glob('Falkenberg2022_data/'+\"tweetids_COP\"+str(copid)+\"*\")\n",
    "# dir_name='tweet_data/'\n",
    "# if not os.path.exists(dir_name):\n",
    "#     os.mkdir(dir_name)\n",
    "# file_name=\"tweetids_COP\"+str(copid)+\"_b\"+str(batch_id)+\".txt\"\n",
    "# for itt,tweet_id_batch_filename in enumerate(filename_list):\n",
    "#     strvars=tweet_id_batch_filename.split('/')[1].split('.')[0].split('_')\n",
    "#     tweet_store_filename = dir_name+'tweets_'+strvars[1]+'_'+strvars[2]+'.json'\n",
    "#     if not os.path.exists(tweet_store_filename):\n",
    "#         st=time.time()\n",
    "#         os.system('twarc2 hydrate '+tweet_id_batch_filename+' '+tweet_store_filename)\n",
    "#         print(str(itt)+' '+str(strvars[2])+\" \"+str(time.time()-st))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1ade1f98",
   "metadata": {
    "execution": {}
   },
   "source": [
    "- Load the data, then inspect and pick a chunk size. Note, by default, there are 100 tweets per line in the `.json` files returned by the API. Given we asked for 1e5 tweets/batch, there should be 1e3 lines in these files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdc599db",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 10,
     "status": "ok",
     "timestamp": 1682441198518,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# copid=26\n",
    "# batch_id = 0\n",
    "# tweet_store_filename = 'tweet_data/tweets_COP'+str(copid)+'_b'+str(batch_id)+'.json'\n",
    "# num_lines = sum(1 for line in open(tweet_store_filename))\n",
    "# num_lines"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8985e74d",
   "metadata": {
    "execution": {}
   },
   "source": [
    "- Now we read in the data, iterating over chunks in each batch and only store the needed data in a dataframe (takes 10-20 minutes to run). Let's look at when the tweets were posted, what language they are in, and the tweet text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3c5364c",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 10,
     "status": "ok",
     "timestamp": 1682441198519,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# selected_columns = ['created_at','lang','text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5194e171",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 11,
     "status": "ok",
     "timestamp": 1682441198520,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# st=time.time()\n",
    "# filename_list = glob.glob('tweet_data/'+\"tweets_COP\"+str(copid)+\"*\")\n",
    "# df=[]\n",
    "# for tweet_batch_filename in filename_list[:-1]:\n",
    "#     reader = pd.read_json(tweet_batch_filename, lines=True,chunksize=1)\n",
    "# #     df.append(pd.DataFrame([item[selected_columns] for sublist in reader.data.values.tolist()[:-1] for item in sublist] )[selected_columns])\n",
    "#     dfs=[]\n",
    "#     for chunk in reader:\n",
    "#         if 'data' in chunk.columns:\n",
    "#             dfs.append(pd.DataFrame(list(chunk.data.values)[0])[selected_columns])\n",
    "#     df.append(pd.concat(dfs,ignore_index=True))\n",
    "# #     df.append(pd.DataFrame(list(reader.data)[0])[selected_columns])\n",
    "# df=pd.concat(df,ignore_index=True)\n",
    "# df.created_at=pd.to_datetime(df.created_at)\n",
    "# print(str(len(df))+' tweets took '+str(time.time()-st))\n",
    "# df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "05c4a70a",
   "metadata": {
    "execution": {}
   },
   "source": [
    "- Finally, store the data in the efficiently compressed feather format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c092c48",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 10,
     "status": "ok",
     "timestamp": 1682441198520,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# df.to_feather('stored_tweets')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0xYALOlVOIaq",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 2: Load and Inspect Data**"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b60c41ba",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Now that we have reviewed the steps that were taken to generate the preprocessed data, we can load the data. It may a few minutes to download the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fa12333",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 24315,
     "status": "ok",
     "timestamp": 1682441222825,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "filename_tweets = \"stored_tweets\"\n",
    "url_tweets = \"https://osf.io/download/8p52x/\"\n",
    "df = pd.read_feather(\n",
    "    pooch_load(url_tweets, filename_tweets)\n",
    ")  # takes a couple minutes to download"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4739ce10",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's check the timing of the tweets relative to the COP26 event (duration shaded in blue in the plot you will make) to see how the number of tweets vary over time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c9f8eb0",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 13946,
     "status": "ok",
     "timestamp": 1682441236740,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "total_tweetCounts = (\n",
    "    df.created_at.groupby(df.created_at.apply(lambda x: x.date))\n",
    "    .count()\n",
    "    .rename(\"counts\")\n",
    ")\n",
    "fig, ax = plt.subplots()\n",
    "total_tweetCounts.reset_index().plot(\n",
    "    x=\"created_at\", y=\"counts\", figsize=(20, 5), style=\".-\", ax=ax\n",
    ")\n",
    "ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha=\"right\")\n",
    "ax.set_yscale(\"log\")\n",
    "COPdates = [\n",
    "    datetime.datetime(2021, 10, 31),\n",
    "    datetime.datetime(2021, 11, 12),\n",
    "]  # shade the duration of the COP26 to guide the eye\n",
    "ax.axvspan(*COPdates, alpha=0.3)\n",
    "# gray region"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f4284f8e",
   "metadata": {
    "execution": {}
   },
   "source": [
    "In addition to assessing the number of tweets, we can also explore who was tweeting about this COP. Look at how many tweets were posted in various languages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea608af6",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 4,
     "status": "ok",
     "timestamp": 1682441237444,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "counts = df.lang.value_counts().reset_index()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9a3c8365",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The language name of the tweet is stored as a code name. We can pull a language code dictionary from the web and use it to translate the language code to the language name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddcf6f0d",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 172,
     "status": "ok",
     "timestamp": 1682441237888,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "target_url = \"https://gist.githubusercontent.com/carlopires/1262033/raw/c52ef0f7ce4f58108619508308372edd8d0bd518/gistfile1.txt\"\n",
    "exec(urllib.request.urlopen(target_url).read())\n",
    "lang_code_dict = dict(iso_639_choices)\n",
    "counts = counts.replace({\"index\": lang_code_dict})\n",
    "counts"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "714e60db",
   "metadata": {
    "execution": {}
   },
   "source": [
    "### **Coding Exercise 2**\n",
    "Run the following cell to print the dictionary for the language codes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70a69dc4",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 13,
     "status": "ok",
     "timestamp": 1682441237889,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "lang_code_dict"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7b9723df",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Find your native language code in the dictionary you just printed and use it to select the COP tweets that were written in your language! "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b374ce8",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "language_code = ...\n",
    "df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)\n",
    "pd.options.display.max_rows = 100  # see up to 100 entries\n",
    "pd.options.display.max_colwidth = 250  # widen how much text is presented of each tweet\n",
    "samples = ...\n",
    "samples\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caf4eb4b",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 6907,
     "status": "ok",
     "timestamp": 1682441245325,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "\n",
    "language_code = \"en\"\n",
    "df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)\n",
    "pd.options.display.max_rows = 100  # see up to 100 entries\n",
    "pd.options.display.max_colwidth = 250  # widen how much text is presented of each tweet\n",
    "samples = df_tmp.sample(100)\n",
    "samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7024cc3",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "df = df_tmp"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fc20327d",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 3: Word Set Prevalence**"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "69245b35",
   "metadata": {
    "execution": {}
   },
   "source": [
    "[Falkenberg et al.](https://www.nature.com/articles/s41558-022-01533-z) investigated the hypothesis that *public sentiment* around the COP conferences has increasingly framed them as hypocritical (\"political hypocrisy as a topic of cross-ideological appeal\"). The authors operationalized hypocrisy language as any tweet containing any of the following words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd4a7f67",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 6,
     "status": "ok",
     "timestamp": 1682441245326,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "selected_words = [\n",
    "    \"hypocrisy\",\n",
    "    \"hypocrite\",\n",
    "    \"hypocritical\",\n",
    "    \"greenwash\",\n",
    "    \"green wash\",\n",
    "    \"blah\",\n",
    "]  # the last 3 words don't add much. Greta Thurnberg's 'blah, blah blah' speech on Sept. 28th 2021."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c299519a",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## **Questions 3**\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "59b065ae-9cc7-4c2f-b953-f59804081228",
   "metadata": {
    "execution": {}
   },
   "source": [
    "1. How might this matching procedure be limited in its ability to capture this sentiment?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9eec90f5-589a-416a-9905-dbc0da4d65bb",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove explanation\n",
    "\n",
    "\"\"\"\n",
    "1. Our approach is based on a predetermined dictionary, which might have several limitations in accurately capturing sentiment. Contextual ignorance (e.g., the word \"not\" can reverse the sentiment of the following word, but this isn't captured in a simple matching procedure) and language and cultural differences cannot be well captured.\n",
    "\"\"\" \"\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "16c8c539",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The authors then searched for these words within a distinct dataset across all COP conferences (this dataset was not made openly accessible but the figure using that data is [here](https://www.nature.com/articles/s41558-022-01527-x/figures/7)). They found that hypocrisy has been mentioned more in recent COP conferences.\n",
    "\n",
    "Here, we will shift our focus to their accessible COP26 dataset and analyze the nature of comments related to specific topics, such as political hypocrisy. First, let's look through the whole dataset and pull tweets that mention any of the selected words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcf08c45",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 52374,
     "status": "ok",
     "timestamp": 1682441297695,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "selectwords_detector = re.compile(\n",
    "    r\"\\b(?:{0})\\b\".format(\"|\".join(selected_words))\n",
    ")  # to make a word detector for a wordlist faster to run, compile it!\n",
    "df[\"select_talk\"] = df.text.apply(\n",
    "    lambda x: selectwords_detector.search(x, re.IGNORECASE)\n",
    ")  # look through whole dataset, flagging tweets with select_talk (computes in under a minute)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "39cbd30c",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's extract these tweets and examine their occurrence statistics in relation to the entire dataset that we calculated above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3de0fe0",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1465,
     "status": "ok",
     "timestamp": 1682441299132,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "selected_tweets = df.loc[~df.select_talk.isnull(), :]\n",
    "selected_tweet_counts = (\n",
    "    selected_tweets.created_at.groupby(\n",
    "        selected_tweets.created_at.apply(lambda x: x.date)\n",
    "    )\n",
    "    .count()\n",
    "    .rename(\"counts\")\n",
    ")\n",
    "selected_tweet_fraction = selected_tweet_counts / total_tweetCounts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dynnBOoavDSX",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1097,
     "status": "ok",
     "timestamp": 1682441300224,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(20, 5))\n",
    "selected_tweet_fraction.reset_index().plot(\n",
    "    x=\"created_at\", y=\"counts\", style=[\".-\"], ax=ax\n",
    ")\n",
    "ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha=\"right\")\n",
    "ax.axvspan(*COPdates, alpha=0.3)  # gray region\n",
    "ax.set_ylabel(\"fraction talking about hypocrisy\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "abf23da0",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Please note that these fractions are normalized, meaning that larger fractions closer to the COP26 dates (shaded in blue) when the total number of tweets are orders of magnitude larger indicate a significantly greater absolute number of tweets talking about hypocrisy.\n",
    "\n",
    "Now, let's examine the content of these tweets by randomly sampling 100 of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07a6dc14",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 12,
     "status": "ok",
     "timestamp": 1682441300225,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "selected_tweets.text.sample(100).values"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a8aa791c",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## **Coding Exercise 3**\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "490211ba-8f6b-4beb-860d-8e97a281458a",
   "metadata": {
    "execution": {}
   },
   "source": [
    "1. Please select another topic and provide a list of topic words. We will then conduct the same analysis for that topic. For example, if the topic is \"renewable technology,\" please provide a list of relevant words."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96542ae9",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "selected_words_2 = [..., ..., ..., ..., ...]\n",
    "\n",
    "selectwords_detector_2 = re.compile(r\"\\b(?:{0})\\b\".format(\"|\".join([str(word) for word in selected_words_2])))\n",
    "df[\"select_talk_2\"] = df.text.apply(\n",
    "    lambda x: selectwords_detector_2.search(x, re.IGNORECASE)\n",
    ")\n",
    "\n",
    "selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]\n",
    "selected_tweet_counts_2 = (\n",
    "    selected_tweets_2.created_at.groupby(\n",
    "        selected_tweets_2.created_at.apply(lambda x: x.date)\n",
    "    )\n",
    "    .count()\n",
    "    .rename(\"counts\")\n",
    ")\n",
    "selected_tweet_fraction_2 = ...\n",
    "\n",
    "samples = ...\n",
    "samples\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2e9a3f1",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "\n",
    "selected_words_2 = [\"renewable\", \"wind\", \"solar\", \"geothermal\", \"biofuel\"]\n",
    "\n",
    "selectwords_detector_2 = re.compile(r\"\\b(?:{0})\\b\".format(\n",
    "    \"|\".join([str(word) for word in selected_words_2])))\n",
    "df[\"select_talk_2\"] = df.text.apply(\n",
    "    lambda x: selectwords_detector_2.search(x, re.IGNORECASE)\n",
    ")\n",
    "\n",
    "selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]\n",
    "selected_tweet_counts_2 = (\n",
    "    selected_tweets_2.created_at.groupby(\n",
    "        selected_tweets_2.created_at.apply(lambda x: x.date)\n",
    "    )\n",
    "    .count()\n",
    "    .rename(\"counts\")\n",
    ")\n",
    "selected_tweet_fraction_2 = selected_tweet_counts_2 / total_tweetCounts\n",
    "\n",
    "samples = selected_tweets_2.text.sample(100).values\n",
    "samples"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "30352a60",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 4: Sentiment Analysis**"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e8ade89e",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's test this hypothesis from [Falkenberg et al.](https://www.nature.com/articles/s41558-022-01533-z) (that *public sentiment* around the COP conferences has increasingly framed them as political hypocrisy). To do so, we can use **sentiment analysis**, which is a method for computing the proportion of words that have positive connotations, negative connotations or are neutral. Some sentiment analysis systems can measure other word attributes as well. In this case, we will analyze the sentiment of the subset of tweets that mention international organizations central to globalization (e.g., G7), focusing specifically on the tweets related to hypocrisy."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "73f87f57",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Note: part of the computation flow in what follows is from [Caren Neal's tutorial](https://nealcaren.org/lessons/wordlists/)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "11be60bb",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We'll assign tweets a sentiment score using a dictionary method (i.e. based on the word sentiment scores of words in the tweet that appear in given word-sentiment score dictionary). The particular word-sentiment score dictionary we will use is compiled in the [AFINN package](https://pypi.org/project/afinn/) and reflects a scoring between -5 (negative connotation) and 5 (positive connotation). The English language dictionary consists of 2,477 coded words.\n",
    "\n",
    "Let's initialize the dictionary for the selected language. For example, the language code for English is 'en'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7262921",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 7,
     "status": "ok",
     "timestamp": 1682441300225,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "afinn = Afinn(language=language_code)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5e4b56d8",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Now we can load the dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdadf1b6",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 194,
     "status": "ok",
     "timestamp": 1682441300413,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "filename_afinn_wl = \"AFINN-111.txt\"\n",
    "url_afinn_wl = (\n",
    "    \"https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-111.txt\"\n",
    ")\n",
    "\n",
    "afinn_wl_df = pd.read_csv(\n",
    "    pooch_load(url_afinn_wl, filename_afinn_wl),\n",
    "    header=None,  # no column names\n",
    "    sep=\"\\t\",  # tab sepeated\n",
    "    names=[\"term\", \"value\"],\n",
    ")  # new column names\n",
    "seed = 808  # seed for sample so results are stable\n",
    "afinn_wl_df.sample(10, random_state=seed)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "665cf434",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's look at the distribution of scores over all words in the dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf0134fb",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 758,
     "status": "ok",
     "timestamp": 1682441301158,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "afinn_wl_df.value.value_counts().sort_index().plot.bar(ax=ax)\n",
    "ax.set_xlabel(\"Finn score\")\n",
    "ax.set_ylabel(\"dictionary counts\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5ecebdb4",
   "metadata": {
    "execution": {}
   },
   "source": [
    "These scores were assigned to words based on labeled tweets ([validation paper](http://www2.imm.dtu.dk/pubdb/edoc/imm6006.pdf)).  \n",
    "\n",
    "Before focussing on sentiments about institutions within the hypocrisy tweets, let's look at the hypocrisy tweets in comparison to non-hypocrisy tweets. This will take some more intensive computation, so let's only perform it on a 1% subsample of the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "AKou3UjdVrxi",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 470,
     "status": "ok",
     "timestamp": 1682441301624,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "smalldf = df.sample(frac=0.01)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a0c8e54",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 82839,
     "status": "ok",
     "timestamp": 1682441384459,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "smalldf[\"afinn_score\"] = smalldf.text.apply(\n",
    "    afinn.score\n",
    ")  # intensive computation! We have reduced the data set to frac=0.01 it's size so it takes ~1 min. (the full dataset takes 1hrs 50 min.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "614c3c61",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 34,
     "status": "ok",
     "timestamp": 1682441384460,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "smalldf[\"afinn_score\"].describe()  # generate descriptive statistics."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cdb68c87",
   "metadata": {
    "execution": {}
   },
   "source": [
    "From this, we can see that the maximum score is 24 and the minimum score is -33. The score is computed by summing up the scores of all dictionary words present in the tweet, which means that longer tweets tend to have higher scores.\n",
    "\n",
    "To make the scores comparable across tweets of different lengths, a rough approach is to convert them to a per-word score. This is done by normalizing each tweet's score by its word count. It's important to note that this per-word score is not specific to the dictionary words used, so this approach introduces a bias that depends on the proportion of dictionary words in each tweet. We will refer to this normalized score as *afinn_adjusted*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "276d045e",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 328,
     "status": "ok",
     "timestamp": 1682441384757,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "def word_count(text_string):\n",
    "    \"\"\"Calculate the number of words in a string\"\"\"\n",
    "    return len(text_string.split())\n",
    "\n",
    "\n",
    "smalldf[\"word_count\"] = smalldf.text.apply(word_count)\n",
    "smalldf[\"afinn_adjusted\"] = (\n",
    "    smalldf[\"afinn_score\"] / smalldf[\"word_count\"]\n",
    ")  # note this isn't a percentage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf46a307",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 5,
     "status": "ok",
     "timestamp": 1682441384757,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "smalldf[\"afinn_adjusted\"].describe()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c30f2ef3",
   "metadata": {
    "execution": {}
   },
   "source": [
    "After normalizing the scores, we find that the maximum score is now 2 and the minimum score is now -1.5. \n",
    "\n",
    "Now let's look at the sentiment of tweets with hypocrisy words versus those without those words. For reference, we'll first make cumulative distribution plots of score distributions for some other possibly negative words: fossil, G7, Boris and Davos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c0f1052",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1841,
     "status": "ok",
     "timestamp": 1682441386595,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "for sel_words in [[\"Fossil\"], [\"G7\"], [\"Boris\"], [\"Davos\"], selected_words]:\n",
    "    sel_name = sel_words[0] if len(sel_words) == 1 else \"select_talk\"\n",
    "    selectwords_detector = re.compile(\n",
    "        r\"\\b(?:{0})\\b\".format(\"|\".join(sel_words))\n",
    "    )  # compile for speed!\n",
    "    smalldf[sel_name] = smalldf.text.apply(\n",
    "        lambda x: selectwords_detector.search(x, re.IGNORECASE) is not None\n",
    "    )  # flag if tweet has word(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "403538c4",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 2180,
     "status": "ok",
     "timestamp": 1682441388771,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "for sel_words in [[\"Fossil\"], [\"G7\"], [\"Boris\"], [\"Davos\"], selected_words]:\n",
    "    sel_name = sel_words[0] if len(sel_words) == 1 else \"select_talk\"\n",
    "    fig, ax = plt.subplots()\n",
    "    ax.set_xlim(-1, 1)\n",
    "    ax.set_xlabel(\"adjusted Finn score\")\n",
    "    ax.set_ylabel(\"probabilty\")\n",
    "    counts, bins = np.histogram(\n",
    "        smalldf.loc[smalldf[sel_name], \"afinn_adjusted\"],\n",
    "        bins=np.linspace(-1, 1, 101),\n",
    "        density=True,\n",
    "    )\n",
    "    ax.plot(bins[:-1], np.cumsum(counts), color=\"C0\", label=sel_name + \" tweets\")\n",
    "    counts, bins = np.histogram(\n",
    "        smalldf.loc[~smalldf[sel_name], \"afinn_adjusted\"],\n",
    "        bins=np.linspace(-1, 1, 101),\n",
    "        density=True,\n",
    "    )\n",
    "    ax.plot(\n",
    "        bins[:-1], np.cumsum(counts), color=\"C1\", label=\"non-\" + sel_name + \" tweets\"\n",
    "    )\n",
    "    ax.axvline(0, color=[0.7] * 3, zorder=1)\n",
    "    ax.legend()\n",
    "    ax.set_title(\"cumulative Finn score distribution for \" + sel_name + \" occurence\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d138d86d",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Recall from our previous calculations that the tweets containing the selected *hypocrisy*-associated words have minimum adjusted score of -1.5. This score is much more negative than the scores of all four reference words we just plotted. So what is the content of these selected tweets that is causing them to be so negative? The explore this, we can use word clouds to assess the usage of specific words."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fa7c32ee",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 5: Word Clouds**\n",
    "\n",
    "To analyze word usage, let's first vectorize the text data. Vectorization (also known as tokenization) here means giving each word in the vocabulary an index and transforming each word sequence to its vector representation and creating a sequence of elements with the corresponding word indices (e.g. the response `['I','love','icecream']` maps to something like `[34823,5937,79345]`). \n",
    "\n",
    "We'll use and compare two methods: term-frequency ($\\mathrm{tf}$) and term-frequency inverse document frequency ($\\mathrm{Tfidf}$). Both of these methods measure how important a term is within a document relative to a collection of documents by using vectorization to transform words into numbers."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "51c3dac4",
   "metadata": {
    "execution": {}
   },
   "source": [
    "**Term Frequency** ($\\mathrm{tf}$): the number of times the word appears in a document compared to the total number of words in the document.\n",
    "\n",
    "$$\\mathrm{tf}=\\frac{\\mathrm{number \\; of \\; times \\; the \\; term \\; appears \\; in \\; the \\; document}}{\\mathrm{total \\; number \\; of \\; terms \\; in \\; the \\; document}}$$\n",
    "\n",
    "**Inverse Document Frequency** ($\\mathrm{idf}$): reflects the proportion of documents in the collection of documents that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).\n",
    "\n",
    "$$\\mathrm{idf}=\\frac{\\log(\\mathrm{number \\; of \\; the \\; documents \\; in \\; the \\; collection})}{\\log(\\mathrm{number \\; of \\; documents \\; in \\; the \\; collection \\; containing \\; the \\; term})}$$\n",
    "\n",
    "Thus the overall term-frequency inverse document frequency can be calculated by multiplying the term-frequency and the inverse document frequency:\n",
    "\n",
    "$$\\mathrm{Tfidf}=\\mathrm{Tf} * \\mathrm{idf}$$\n",
    "\n",
    "$\\mathrm{Tfidf}$ aims to add more discriminability to frequency as a word relevance metric by downweighting words that appear in many documents since these common words are less discriminative. In other words, the importance of a term is high when it occurs a lot in a given document and rarely in others.\n",
    "\n",
    "If you are interested in learning more about the mathematical equations used to develop these two methods, please refer to the additional details in the \"Further Reading\" section for this day."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6c053030",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's run both of these methods and store the vectorized data in a dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dcffa1c",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 241,
     "status": "ok",
     "timestamp": 1682444744215,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "vectypes = [\"counts\", \"Tfidf\"]\n",
    "\n",
    "\n",
    "def vectorize(doc_data, ngram_range=(1, 1), remove_words=[], min_doc_freq=1):\n",
    "\n",
    "    vectorized_data_dict = {}\n",
    "    for vectorizer_type in vectypes:\n",
    "        if vectorizer_type == \"counts\":\n",
    "            vectorizer = CountVectorizer(\n",
    "                stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range\n",
    "            )\n",
    "        elif vectorizer_type == \"Tfidf\":\n",
    "            vectorizer = TfidfVectorizer(\n",
    "                stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range\n",
    "            )\n",
    "\n",
    "        vectorized_doc_list = vectorizer.fit_transform(data).todense().tolist()\n",
    "        feature_names = (\n",
    "            vectorizer.get_feature_names_out()\n",
    "        )  # or  get_feature_names() depending on scikit learn version\n",
    "        print(\"vocabulary size:\" + str(len(feature_names)))\n",
    "        wdf = pd.DataFrame(vectorized_doc_list, columns=feature_names)\n",
    "        vectorized_data_dict[vectorizer_type] = wdf\n",
    "    return vectorized_data_dict, feature_names\n",
    "\n",
    "\n",
    "def plot_wordcloud_and_freqdist(wdf, title_str, feature_names):\n",
    "    \"\"\"\n",
    "    Plots a word cloud\n",
    "    \"\"\"\n",
    "    pixel_size = 600\n",
    "    x, y = np.ogrid[:pixel_size, :pixel_size]\n",
    "    mask = (x - pixel_size / 2) ** 2 + (y - pixel_size / 2) ** 2 > (\n",
    "        pixel_size / 2 - 20\n",
    "    ) ** 2\n",
    "    mask = 255 * mask.astype(int)\n",
    "    wc = WordCloud(\n",
    "        background_color=\"rgba(255, 255, 255, 0)\", mode=\"RGBA\", mask=mask, max_words=50\n",
    "    )  # ,relative_scaling=1)\n",
    "    wordfreqs = wdf.T.sum(axis=1)\n",
    "    num_show = 50\n",
    "    sorted_ids = np.argsort(wordfreqs)[::-1]\n",
    "\n",
    "    fig, ax = plt.subplots(figsize=(10, 5))\n",
    "    ax.bar(x=range(num_show), height=wordfreqs[sorted_ids][:num_show])\n",
    "    ax.set_xticks(range(num_show))\n",
    "    ax.set_xticklabels(\n",
    "        feature_names[sorted_ids][:num_show], rotation=45, fontsize=8, ha=\"right\"\n",
    "    )\n",
    "    ax.set_ylabel(\"total frequency\")\n",
    "    ax.set_title(title_str + \" vectorizer\")\n",
    "    ax.set_ylim(0, 10 * wordfreqs[sorted_ids][int(num_show / 2)])\n",
    "\n",
    "    ax_wc = inset_axes(ax, width=\"90%\", height=\"90%\")\n",
    "    wc.generate_from_frequencies(wordfreqs)\n",
    "    ax_wc.imshow(wc, interpolation=\"bilinear\")\n",
    "    ax_wc.axis(\"off\")\n",
    "\n",
    "\n",
    "nltk.download(\n",
    "    \"stopwords\"\n",
    ")  # downloads basic stop words, i.e. words with little semantic value  (e.g. \"the\"), to be used as words to be removed\n",
    "remove_words = stopwords.words(\"english\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9b007814",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We can now vectorize and look at the wordclouds for single word statistics. Let's explicitly exclude some words and implicity exclude ones that appear in fewer than some threshold number of tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "DAgLmKhVf6YH",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 7,
     "status": "ok",
     "timestamp": 1682444734903,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "data = (\n",
    "    selected_tweets[\"text\"].sample(frac=0.1).values\n",
    ")  # reduce size since the vectorization computation transforms the corpus into an array of large size (vocabulary size x number of tweets)\n",
    "# let's add some more words that we don't want to track (you can generate this kind of list iteratively by looking at the results and adding to this list):\n",
    "remove_words += [\n",
    "    \"cop26\",\n",
    "    \"http\",\n",
    "    \"https\",\n",
    "    \"30\",\n",
    "    \"000\",\n",
    "    \"je\",\n",
    "    \"rt\",\n",
    "    \"climate\",\n",
    "    \"limacop20\",\n",
    "    \"un_climatetalks\",\n",
    "    \"climatechange\",\n",
    "    \"via\",\n",
    "    \"ht\",\n",
    "    \"talks\",\n",
    "    \"unfccc\",\n",
    "    \"peru\",\n",
    "    \"peruvian\",\n",
    "    \"lima\",\n",
    "    \"co\",\n",
    "]\n",
    "print(str(len(data)) + \" tweets\")\n",
    "min_doc_freq = 5 / len(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1884136",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 12260,
     "status": "ok",
     "timestamp": 1682444758286,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ngram_range = (1, 1)  # start and end number of words\n",
    "vectorized_data_dict, feature_names = vectorize(\n",
    "    selected_tweets,\n",
    "    ngram_range=ngram_range,\n",
    "    remove_words=remove_words,\n",
    "    min_doc_freq=min_doc_freq,\n",
    ")\n",
    "for vectorizer_type in vectypes:\n",
    "    plot_wordcloud_and_freqdist(\n",
    "        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names\n",
    "    )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cd3d679a",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Note in the histograms how the $\\mathrm{Tfidf}$ vectorizer has scaled down the hypocrisy words such that they are less prevalent relative to the count vectorizer. \n",
    "\n",
    "There are some words here  (e.g. `private` and `jet`) that look like they likely would appear in pairs. Let's tell the vectorizer to also look for high frequency *pairs* of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fde8ddf7",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 30399,
     "status": "ok",
     "timestamp": 1682444823790,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ngram_range = (1, 2)  # start and end number of words\n",
    "vectorized_data_dict, feature_names = vectorize(\n",
    "    selected_tweets,\n",
    "    ngram_range=ngram_range,\n",
    "    remove_words=remove_words,\n",
    "    min_doc_freq=min_doc_freq,\n",
    ")\n",
    "for vectorizer_type in vectypes:\n",
    "    plot_wordcloud_and_freqdist(\n",
    "        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names\n",
    "    )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cb4fe0fb",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The hypocrisy words take up so much frequency that it is hard to see what the remaining words are. To clear this list a bit more, let's also remove the hypocrisy words altogether."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "MX9q2k-WiM1D",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 153,
     "status": "ok",
     "timestamp": 1682444966342,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "remove_words += selected_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ZyAs5RMXiADq",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 25360,
     "status": "ok",
     "timestamp": 1682444992874,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ngram_range = (1, 2)  # start and end number of words\n",
    "vectorized_data_dict, feature_names = vectorize(\n",
    "    selected_tweets,\n",
    "    ngram_range=ngram_range,\n",
    "    remove_words=remove_words,\n",
    "    min_doc_freq=min_doc_freq,\n",
    ")\n",
    "for vectorizer_type in vectypes:\n",
    "    plot_wordcloud_and_freqdist(\n",
    "        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names\n",
    "    )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fsgxp-SWh_iO",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Observe that terms we might have expected are associated with hypocrisy, e.g. \"flying\" are still present. Even when allowing for pairs, the semantics are hard to extract from this analysis that ignores the correlations in usage among multiple words. \n",
    "\n",
    "To futher assess statistics, one approach is use a generative model with latent structure.\n",
    "\n",
    "Topic models (the [structural topic model](https://www.structuraltopicmodel.com/) in particular) are a nice modelling framework to start analyzing those correlations.\n",
    "\n",
    "For a modern introduction to text analysis in the social sciences, I recommend the textbook:\n",
    "\n",
    "    Text as Data: A New Framework for Machine Learning and the Social Sciences (2022) by Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0301dbf0-b0f5-4d7e-98b9-761480aad846",
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 8,
     "status": "ok",
     "timestamp": 1682441415561,
     "user": {
      "displayName": "Maximilian Puelma Touzel",
      "userId": "09308600515315501700"
     },
     "user_tz": 240
    }
   },
   "source": [
    "# **Summary**\n",
    "In this tutorial, you've learned how to analyze large amounts of text data from social media to understand public sentiment about climate change. You've been introduced to the process of loading and examining Twitter data, specifically relating to the COP climate change conferences. You've also gained insights into identifying and analyzing sentiments associated with specific words, with a focus on those indicating 'hypocrisy'.\n",
    "\n",
    "We used techniques to normalize sentiment scores and to compare sentiment among different categories of tweets. You have also learned about text vectorization methods, term-frequency (tf) and term-frequency inverse document frequency (tfidf), and their applications in word usage analysis. This tutorial provided you a valuable stepping stone to further delve into text analysis, which could help deeper our understanding of public sentiment on climate change. Such analysis helps us track how global perceptions and narratives about climate change evolve over time, which is crucial for policy planning and climate communication strategies.\n",
    "\n",
    "This tutorial therefore not only provided you with valuable tools for text analysis but also demonstrated their potential in contributing to our understanding of climate change perceptions, a key factor in driving climate action."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8d0b6901",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Resources**\n",
    "\n",
    "The data for this tutorial can be accessed from [Falkenberg et al. *Nature Clim. Chg.* 2022](https://www.nature.com/articles/s41558-022-01527-x). "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W2D3_Tutorial4",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}