{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ClimateMatchAcademy/course-content/blob/main/tutorials/W1D4_Paleoclimate/instructor/W1D4_Tutorial5.ipynb)   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/{ORG}/course-content/main/tutorials/W1D4_Paleoclimate/instructor/W1D4_Tutorial5.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Tutorial 5: Paleoclimate Data Analysis Tools**\n",
    "**Week 1, Day 4, Paleoclimate**\n",
    "\n",
    "**Content creators:** Sloane Garelick\n",
    "\n",
    "**Content reviewers:** Yosmely Bermúdez, Dionessa Biton, Katrina Dobson, Maria Gonzalez, Will Gregory, Nahid Hasan, Sherry Mi, Beatriz Cosenza Muralles, Brodie Pearson, Jenna Pearson, Chi Zhang, Ohad Zivan \n",
    "\n",
    "**Content editors:** Yosmely Bermúdez, Zahra Khodakaramimaghsoud, Jenna Pearson, Agustina Pesce, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Production editors:** Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Our 2023 Sponsors:** NASA TOPS and Google DeepMind"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Tutorial Objectives**\n",
    "\n",
    "In this tutorial, you will explore various computational analyses for interpreting paleoclimate data and understand why these methods are useful. A common issue in paleoclimate is the presence of uneven time spacing between consecutive observations. `Pyleoclim` includes several methods that can deal with uneven sampling effectively, but there are certain applications and analyses for which it's ncessary to place the records on a uniform time axis. In this tutorial you'll learn a few ways to do this with `Pyleoclim`. Additionally, we will explore another useful paleoclimate data analysis tool, Principal Component Analysis (PCA), which allows us to identify a common signal between various paleoclimate reconstructions. \n",
    "\n",
    "By the end of this tutorial you'll be able to perform the following data analysis techniques on proxy-based climate reconstructions:\n",
    "\n",
    "*   Interpolation\n",
    "*   Binning \n",
    "*   Principal component analysis\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas as pd\n",
    "import cartopy\n",
    "import pyleoclim as pyleo\n",
    "import matplotlib.pyplot as plt\n",
    "import pooch\n",
    "import os\n",
    "import tempfile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Video 1: Speaker Introduction\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Video 1: Speaker Introduction\n",
    "# Tech team will add code to format and display the video"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# helper functions\n",
    "\n",
    "\n",
    "def pooch_load(filelocation=None, filename=None, processor=None):\n",
    "    shared_location = \"/home/jovyan/shared/Data/tutorials/W1D4_Paleoclimate\"  # this is different for each day\n",
    "    user_temp_cache = tempfile.gettempdir()\n",
    "\n",
    "    if os.path.exists(os.path.join(shared_location, filename)):\n",
    "        file = os.path.join(shared_location, filename)\n",
    "    else:\n",
    "        file = pooch.retrieve(\n",
    "            filelocation,\n",
    "            known_hash=None,\n",
    "            fname=os.path.join(user_temp_cache, filename),\n",
    "            processor=processor,\n",
    "        )\n",
    "\n",
    "    return file"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 1: Load the sample dataset for analysis**\n",
    "\n",
    "For this tutorial, we are going to use an example dataset to practice the various data analysis techniques. The dataset we'll be using is a record of hydrogen isotopes of leaf waxes (δD<sub>wax</sub>) from Lake Tanganyika in East Africa [(Tierney et al., 2008)](https://www.science.org/doi/10.1126/science.1160485?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed). Recall from the video that δD<sub>wax</sub> is a proxy that is typically thought to record changes in the amount of precipitation in the tropics via the amount effect. In the previous tutorial, we looked at δD data from high-latitude ice cores. In that case, δD was a proxy for temperature, but in the tropics, δD more commonly reflects rainfall amount, as explained in the introductory video.\n",
    "\n",
    "Let's first read the data from a .csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "filename_tang = \"tanganyika_dD.csv\"\n",
    "url_tang = \"https://osf.io/sujvp/download/\"\n",
    "tang_dD = pd.read_csv(pooch_load(filelocation=url_tang, filename=filename_tang))\n",
    "tang_dD.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We can now create a `Series` in Pyleoclim and assign names to different variables so that we can easily plot the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 736,
     "status": "ok",
     "timestamp": 1681484071869,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ts_tang = pyleo.Series(\n",
    "    time=tang_dD[\"Age\"],\n",
    "    value=tang_dD[\"dD_IVonly\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    value_unit=\"per mille\",\n",
    "    label=\"Lake Tanganyika dDprecip\",\n",
    ")\n",
    "\n",
    "ts_tang.plot(color=\"C1\", invert_yaxis=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "You may notice that the y-axis is inverted. When we're plotting δD data, we typically invert the y-axis because more negative (\"depleted\") values suggest increased rainfall, whereas more positive (\"enriched\") values suggest decreased rainfall."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 2: Uniform Time-Sampling of the Data**\n",
    "There are a number of different reasons we might want to assign new values to our data. For example, if the data is not evenly spaced, we might need to resample in order to use a sepcific data analysis technique or to more easily compare to other data of a different sampling resolution. \n",
    "\n",
    "First, let's check whether our data is already evenly spaced using the `.is_evenly_spaced()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 138,
     "status": "ok",
     "timestamp": 1681484075182,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ts_tang.is_evenly_spaced()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Our data is not evenly spaced. There are a few different methods available in `pyleoclim` to place the data on a uniform axis, and in this tutorial, we'll explore two methods: interpolating and binning. In general, these methods use the available data near a chosen time to estimate what the value was at that time, but each method differs in which nearby data points it uses and how it uses them.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## **Section 2.1: Interpolation**\n",
    "To start out, let's try using interpolation to evenly space our data. Interpolation projects the data onto an evenly spaced time axis with a distance between points (step size) of our choosing. There are a variety of different methods by which the data can be interpolated, these being: `linear`, `nearest`, `zero`, `slinear`, `quadratic`, `cubic`, `previous`, and `next`. More on these and their associated key word arguments can be found in the [documentation](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.series.Series.interp). By default, the function `.interp()` implements linear interpolation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "tang_linear = ts_tang.interp()  # default method = 'linear'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 167,
     "status": "ok",
     "timestamp": 1681484080381,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# check whether or not the series is now evenly spaced\n",
    "tang_linear.is_evenly_spaced()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Now that we've interpolated our data, let's compare the original dataset to the linearly interpolated dataset we just created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 443,
     "status": "ok",
     "timestamp": 1681484174043,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "ts_tang.plot(ax=ax, label=\"Original\", invert_yaxis=True)\n",
    "tang_linear.plot(ax=ax, label=\"Linear\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Notice there are only some minor differences between the original and linearly interpolated data.\n",
    "\n",
    "You can print the data in the original and interpolated time series to see the difference in the ages between the two datasets. The interpolated dataset is now evenly spaced with a δD value every ~290 years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "ts_tang"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "tang_linear"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's compare a few of the different interpolation methods (e.g., quadratic, next, zero) with one another just to see how they are similar and different:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "ts_tang.plot(ax=ax, label=\"original\", invert_yaxis=True)\n",
    "for method in [\"linear\", \"quadratic\", \"next\", \"zero\"]:\n",
    "    ts_tang.interp(method=method).plot(\n",
    "        ax=ax, label=method, alpha=0.9\n",
    "    )  # plot all the method we want"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The methods can produce slightly different results, but mostly reproduce the same overall trend. In this case, the quadractic method may be less appropriate than the other methods."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## **Section 2.2: Binning**\n",
    "Another option for resampling our data onto a uniform time axis is binning. Binning is when a set of time intervals is defined and data is grouped or binned with other data in the same interval, then all those points in a \"bin\" are averaged to get a data value for that bin. The defaults for binning pick a bin size at the coarsest time spacing present in the dataset and average data over a uniform sequence of such intervals. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "tang_bin = (\n",
    "    ts_tang.bin()\n",
    ")  # default settings pick the coarsest time spacing in the data as the binning period"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 457,
     "status": "ok",
     "timestamp": 1681484224785,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "ts_tang.plot(ax=ax, label=\"Original\", invert_yaxis=True)\n",
    "tang_bin.plot(ax=ax, label=\"Binned\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Again, notice that although there are some minor differences between the original and binned data, the records still capture the same overall trend."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## **Coding Exercises 2.2**\n",
    "\n",
    "\n",
    "1. Experiment with different bin sizes to see how this affects the resampling. You can do so by adding  `bin_size = ` to `ts_tang.bin()`. Try a bin size of 500 years and 1,000 years and plot both of them on the same graph as the original data and the binned data using the default bin size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "# bin size of 500\n",
    "tang_bin_500 = ...\n",
    "\n",
    "# bin size of 1000\n",
    "tang_bin_1000 = ...\n",
    "\n",
    "# plot\n",
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "_ = ...\n",
    "_ = ...\n",
    "_ = ...\n",
    "_ = ...\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "\n",
    "# bin size of 500\n",
    "tang_bin_500 = ts_tang.bin(bin_size=500)\n",
    "\n",
    "# bin size of 1000\n",
    "tang_bin_1000 = ts_tang.bin(bin_size=1000)\n",
    "\n",
    "# plot\n",
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "_ = ts_tang.plot(ax=ax, label=\"Original\", invert_yaxis=True)\n",
    "_ = tang_bin.plot(ax=ax, label=\"Binned Default\")\n",
    "_ = tang_bin_500.plot(ax=ax, label=\"Binned 500yrs\")\n",
    "_ = tang_bin_1000.plot(ax=ax, label=\"Binned 1000yrs\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Section 3: Principal Component Analysis (PCA)**\n",
    "Large datasets, such as global climate datasets, are often difficult to interpret due to their multiple dimensions. Although tools such as Xarray help us to organize large, multidimensional climate datasets, it can still sometimes be difficult to interpret certain aspects of such data. Principal Component Analysis (PCA) is a tool for reducing the dimensionality of such datasets and  increasing interpretability with minimal modification or loss of data. In other words, PCA allows us to reduce the number of variables of a dataset, while preserving as much information as possible.\n",
    "\n",
    "The first step in PCA is to calculate a matrix that summarizes how the variables in a dataset all relate to one another. This matrix is then broken down into new uncorrelated variables that are linear combinations or mixtures of the initial variables. These new variables are the **principal components**. The initial dimensions of the dataset determines the number of principal components calculated. Most of the information within the initial variables is compressed into the first components. Additional details about PCA and the calculations involved can be found [here](https://builtin.com/data-science/step-step-explanation-principal-component-analysis).\n",
    "\n",
    "Applied to paleoclimate, PCA can reduce the dimensionality of large paleoclimate datasets with multiple variables and can help us identify a common signal between various paleoclimate reconstructions. An example of a study that applies PCA to paleoclimate is [Otto-Bliesner et al., 2014](https://www.science.org/doi/full/10.1126/science.1259531). This study applies PCA to rainfall reconstructions from models and proxies from throughout Africa to determine common climate signals in these reconstructions.\n",
    "\n",
    "In this section, we will calculate the PCA of four δD paleoclimate records from Africa to assess common climate signals in the four records.  In order to calculate the PCA of multiple paleoclimate time series, all of the records need to be on a common time-step. To do so, we can apply the resampling tools we've learned in this tutorial."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "So far, we've been looking at δD data from Lake Tanganyika in tropical East Africa. Let's compare this δD record to other existing δD records from lake and marine sediment cores in tropical Africa from the Gulf of Aden [(Tierney and deMenocal, 2017)](https://doi.org/10.1126/science.1240411), Lake Bosumtwi [(Shanahan et al., 2015)](https://doi.org/10.1038/ngeo2329), and the West African Margin [(Tierney et al., 2017)](https://doi.org/10.1126/sciadv.1601503).\n",
    "\n",
    "First, let's load these datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 484,
     "status": "ok",
     "timestamp": 1681484228855,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# Gulf of Aden\n",
    "filename_aden = \"aden_dD.csv\"\n",
    "url_aden = \"https://osf.io/gm2v9/download/\"\n",
    "aden_dD = pd.read_csv(pooch_load(filelocation=url_aden, filename=filename_aden))\n",
    "aden_dD.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 370,
     "status": "ok",
     "timestamp": 1681484235076,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# Lake Bosumtwi\n",
    "filename_Bosumtwi = \"bosumtwi_dD.csv\"\n",
    "url_Bosumtwi = \"https://osf.io/mr7d9/download/\"\n",
    "bosumtwi_dD = pd.read_csv(\n",
    "    pooch_load(filelocation=url_Bosumtwi, filename=filename_Bosumtwi)\n",
    ")\n",
    "bosumtwi_dD.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 354,
     "status": "ok",
     "timestamp": 1681484237400,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# GC27 (West African Margin)\n",
    "filename_GC27 = \"gc27_dD.csv\"\n",
    "url_GC27 = \"https://osf.io/k6e3a/download/\"\n",
    "gc27_dD = pd.read_csv(pooch_load(filelocation=url_GC27, filename=filename_GC27))\n",
    "gc27_dD.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Next, let's convert each dataset into a `Series` in Pyleoclim."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "ts_tanganyika = pyleo.Series(\n",
    "    time=tang_dD[\"Age\"],\n",
    "    value=tang_dD[\"dD_IVonly\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"Lake Tanganyika\",\n",
    ")\n",
    "ts_aden = pyleo.Series(\n",
    "    time=aden_dD[\"age_calBP\"],\n",
    "    value=aden_dD[\"dDwaxIVcorr\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"Gulf of Aden\",\n",
    ")\n",
    "ts_bosumtwi = pyleo.Series(\n",
    "    time=bosumtwi_dD[\"age_calBP\"],\n",
    "    value=bosumtwi_dD[\"d2HleafwaxC31ivc\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"Lake Bosumtwi\",\n",
    ")\n",
    "ts_gc27 = pyleo.Series(\n",
    "    time=gc27_dD[\"age_BP\"],\n",
    "    value=gc27_dD[\"dDwax_iv\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"GC27\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Now let's set up a `MultipleSeries` using Pyleoclim with all four δD datasets. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "ts_list = [ts_tanganyika, ts_aden, ts_bosumtwi, ts_gc27]\n",
    "ms_africa = pyleo.MultipleSeries(ts_list, label=\"African dDwax\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We can now create a stackplot with all four δD records:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 756,
     "status": "ok",
     "timestamp": 1681484266500,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = ms_africa.stackplot()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "By creating a stackplot, we can visually compare between the datasets. However, the four δD records aren't the same resolution and don't span the same time interval.\n",
    "\n",
    "To better compare the records and assess a common trend, we can use PCA. First, we can use [`.common_time()`] to place the records on a shared time axis with a common sampling frequency. This function takes the argument `method`, which can be either `bin`, `interp`, and `gdkernel`. The binning and interpolation methods are what we just covered in the previous section. Let's set the time step to 500 years, the method to `interp`, and standarize the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1048,
     "status": "ok",
     "timestamp": 1681484350993,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "africa_ct = ms_africa.common_time(method=\"interp\", step=0.5).standardize()\n",
    "fig, ax = africa_ct.stackplot()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We now have standardized δD records that are the same sampling resolution and span the same time interval. Note this meant trimming the longer time series down to around 20,000 years in length.\n",
    "\n",
    "We can now apply PCA which will allow us to quantitatively identify a common signal between the four δD paleoclimate reconstructions through the [`.pca`](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.multipleseries.MultipleSeries.pca) method. Note that this line of code may take a few minutes to run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "africa_PCA = africa_ct.pca()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The result is an object containing multiple outputs, and with two plotting methods attached to it. The two outputs we'll look at are pctvar (the variance) and pcs (the principal components). \n",
    "\n",
    "First, let's print the percentage of variance accounted for by each mode, which is saved as pctvar:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 224,
     "status": "ok",
     "timestamp": 1681484449770,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "print(africa_PCA.pctvar.round())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "This means that 97% of the variance in the four paleoclimate records is explained by the first principal component. The number of datasets in the PCA constrains the number of principal components that can be defined, which is why we only have four components in this example.\n",
    "\n",
    "We can now look at the principal component of the first mode of variance. Let's create a new series for the first mode of variance and plot it against the original datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "africa_pc1 = africa_PCA.pcs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "africa_mode1 = pyleo.Series(\n",
    "    time=africa_ct.series_list[0].time,\n",
    "    value=africa_PCA.pcs[:, 0],\n",
    "    label=r\"$PC_1$\",\n",
    "    value_name=\"PC1\",\n",
    "    time_name=\"age\",\n",
    "    time_unit=\"yr BP\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1026,
     "status": "ok",
     "timestamp": 1681484506972,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax1 = plt.subplots()\n",
    "\n",
    "ax1.set_ylabel(\"dDwax\")\n",
    "ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis\n",
    "ax2.set_ylabel(\"PC1\")  # we already handled the x-label with ax1\n",
    "\n",
    "# plt.plot(mode1.time,pc1_scaled)\n",
    "africa_mode1.plot(color=\"black\", ax=ax2, invert_yaxis=True)\n",
    "africa_ct.plot(ax=ax1, linewidth=0.5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The original δD records are shown in the colored lines, and the first principal component (PC1) time series is shown in black. \n",
    "\n",
    "## **Questions 3: Climate Connection**\n",
    " \n",
    "\n",
    "1.   How do the original time series compare to the PC1 time series? Do they record similar trends?\n",
    "2.   Which original δD record most closely resembles the PC1 time series? Which is the most different?\n",
    "3.   What changes in climate does the PC1 time series record over the past 20,000 years? *Hint: remember that more depleted (more negative) δD suggests increased rainfall.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove explanation\n",
    "\n",
    "\"\"\"\n",
    "1. The overall trends are similar. However, there is more variability in the original δD records than in the PC1 time series. This makes sense because the principal component analysis identifies a common signal between original δD reconstructions and therefore is inherently less variable.\n",
    "2. The Gulf of Aden δD record is the most similar to the PC1 time series. The Lake Bosumtwi δD record is the least similar to the PC1 time series. This difference is particularly noticeable between 5,000-10,000 years ago.\n",
    "3. The PC1 time series suggests a drier climate (more positive δD) in Africa over the past 20,000 years. Recall that 20,000 years ago was the last glacial period. African rainfall increased ~15,000 years ago during the deglacial transition. There is a short period of drying ~12,000 years ago, which corresponds to the timing of the Younger Dryas (a millennial-scale, Northern Hemisphere high-latitude cooling event). This is followed by an increase in rainfall associated with what is known as the African Humid Period (a period of wet, humid conditions in northern and equatorial Africa, driven by intensification of the African monsoon). Finally, there is a decrease in rainfall over the past ~8,000 years towards the present. Note: δD only records qualitative changes in rainfall (i.e., wetter or drier relative to another time period), and does not provide quantitative measurements of the amount of rainfall.\n",
    "\"\"\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Summary**\n",
    "In this tutorial, you explored a variety of computational techniques for analyzing paleoclimate data. You learned how to handle irregular data and place these records on a uniform time axis using interpolation and binning. \n",
    "\n",
    "You then explored Principal Component Analysis (PCA), a method that reveals common signals across various paleoclimate reconstructions. To effectively utilize PCA, it's essential to have all records on the same sampling resolution and same time-step, which can be achieved using the resampling tools you learned in this tutorial.\n",
    "\n",
    "For your practical example, we used a dataset of hydrogen isotopes of leaf waxes (δD<sub>wax</sub>) from Lake Tanganyika in East Africa to further enhance your understanding of the discussed techniques, equipping you to better analyze and understand the complexities of paleoclimate data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# **Resources**\n",
    "\n",
    "Code for this tutorial is based on existing notebooks from LinkedEarth for [anlayzing LiPD datasets](https://github.com/LinkedEarth/paleoHackathon/blob/main/notebooks/PaleoHack-nb03_EDA.ipynb) and [resampling data with `Pyleoclim`](https://github.com/LinkedEarth/PyleoTutorials/blob/main/notebooks/L1_uniform_time_sampling.ipynb).\n",
    "\n",
    "Data from the following sources are used in this tutorial:\n",
    "\n",
    "*   Tierney, J.E., et al. 2008. Northern Hemisphere Controls on Tropical Southeast African Climate During the Past 60,000 Years. Science, Vol. 322, No. 5899, pp. 252-255, 10 October 2008. https://doi.org/10.1126/science.1160485.\n",
    "*   Tierney, J.E., and deMenocal, P.. 2013. Abrupt Shifts in Horn of Africa Hydroclimate Since the Last Glacial Maximum. Science, 342(6160), 843-846. https://doi.org/10.1126/science.1240411.\n",
    "*   Tierney, J.E., Pausata, F., deMenocal, P. 2017. Rainfall Regimes of the Green Sahara. Science Advances, 3(1), e1601503. https://doi.org/10.1126/sciadv.1601503. \n",
    "*   Shanahan, T.M., et al. 2015. The time-transgressive termination of the African Humid Period. Nature Geoscience, 8(2), 140-144. https://doi.org/10.1038/ngeo2329."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "machine_shape": "hm",
   "name": "W1D4_Tutorial5",
   "provenance": [
    {
     "file_id": "1l594NT5i1ubNU9d5esRFZvwj87ivhusI",
     "timestamp": 1680610931501
    },
    {
     "file_id": "1lHuVrVtAW4fQzc0dFdlZuwY0i71KWw_t",
     "timestamp": 1677637469824
    }
   ],
   "toc_visible": true
  },
  "gpuClass": "standard",
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}