{ "cells": [ { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ClimateMatchAcademy/course-content/blob/main/tutorials/W1D1_ClimateSystemOverview/W1D1_Tutorial1.ipynb)   \"Open" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Tutorial 1: Creating DataArrays and Datasets to Assess Global Climate Data\n", "\n", "\n", "**Week 1, Day 1, Climate System Overview**\n", "\n", "**Content creators:** Sloane Garelick, Julia Kent\n", "\n", "**Content reviewers:** Yosmely Bermúdez, Katrina Dobson, Younkap Nina Duplex, Danika Gupta, Maria Gonzalez, Will Gregory, Nahid Hasan, Sherry Mi, Beatriz Cosenza Muralles, Jenna Pearson, Chi Zhang, Ohad Zivan \n", "\n", "**Content editors:** Jenna Pearson, Chi Zhang, Ohad Zivan \n", "\n", "**Production editors:** Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan\n", "\n", "**Our 2023 Sponsors:** NASA TOPS and Google DeepMind" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## ![project pythia](https://projectpythia.org/_static/images/logos/pythia_logo-blue-rtext.svg)\n", "\n", "Pythia credit: Rose, B. E. J., Kent, J., Tyle, K., Clyne, J., Banihirwe, A., Camron, D., May, R., Grover, M., Ford, R. R., Paul, K., Morley, J., Eroglu, O., Kailyn, L., & Zacharias, A. (2023). Pythia Foundations (Version v2023.05.01) https://zenodo.org/record/8065851\n", "\n", "## ![CMIP.png](https://github.com/ClimateMatchAcademy/course-content/blob/main/tutorials/Art/CMIP.png?raw=true)\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Tutorial Objectives\n", "\n", "\n", "As you just learned in the Introduction to Climate video, variations in global climate involve various forcings, feedbacks, and interactions between multiple processes and systems. Because of this complexity, global climate datasets are often very large with multiple dimensions and variables.\n", "\n", "One useful computational tool for organizing, analyzing and interpreting large global datasets is [Xarray](https://xarray.pydata.org/en/v2023.05.0/getting-started-guide/why-xarray.html), an open source project and Python package that makes working with labelled multi-dimensional arrays simple and efficient.\n", "\n", "In this first tutorial, we will use the `DataArray` and `Dataset` objects, which are used to represent and manipulate spatial data, to practice organizing large global climate datasets and to understand variations in Earth's climate system." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Setup" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Similar to `numpy`, `np`; `pandas`, `pd`; you may often encounter `xarray` imported within a shortened namespace as `xr`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "tags": [] }, "outputs": [], "source": [ "# imports\n", "import numpy as np\n", "import pandas as pd\n", "import xarray as xr\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Figure Settings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Figure Settings\n", "import ipywidgets as widgets # interactive display\n", "\n", "%config InlineBackend.figure_format = 'retina'\n", "plt.style.use(\n", " \"https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video 1: Introduction to Climate\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @title Video 1: Introduction to Climate\n", "\n", "from ipywidgets import widgets\n", "from IPython.display import YouTubeVideo\n", "from IPython.display import IFrame\n", "from IPython.display import display\n", "\n", "\n", "class PlayVideo(IFrame):\n", " def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n", " self.id = id\n", " if source == 'Bilibili':\n", " src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n", " elif source == 'Osf':\n", " src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n", " super(PlayVideo, self).__init__(src, width, height, **kwargs)\n", "\n", "\n", "def display_videos(video_ids, W=400, H=300, fs=1):\n", " tab_contents = []\n", " for i, video_id in enumerate(video_ids):\n", " out = widgets.Output()\n", " with out:\n", " if video_ids[i][0] == 'Youtube':\n", " video = YouTubeVideo(id=video_ids[i][1], width=W,\n", " height=H, fs=fs, rel=0)\n", " print(f'Video available at https://youtube.com/watch?v={video.id}')\n", " else:\n", " video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n", " height=H, fs=fs, autoplay=False)\n", " if video_ids[i][0] == 'Bilibili':\n", " print(f'Video available at https://www.bilibili.com/video/{video.id}')\n", " elif video_ids[i][0] == 'Osf':\n", " print(f'Video available at https://osf.io/{video.id}')\n", " display(video)\n", " tab_contents.append(out)\n", " return tab_contents\n", "\n", "\n", "video_ids = [('Youtube', 'mc-DkvYLdOA'), ('Bilibili', 'BV1Th4y1j7SS')]\n", "tab_contents = display_videos(video_ids, W=730, H=410)\n", "tabs = widgets.Tab()\n", "tabs.children = tab_contents\n", "for i in range(len(tab_contents)):\n", " tabs.set_title(i, video_ids[i][0])\n", "display(tabs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tutorial slides\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " These are the slides for the videos in all tutorials today\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "pycharm": { "name": "#%%\n" }, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @title Tutorial slides\n", "# @markdown These are the slides for the videos in all tutorials today\n", "from IPython.display import IFrame\n", "link_id = \"4suf5\"" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Introducing the `DataArray` and `Dataset`\n", "\n", "[Xarray](https://xarray.pydata.org/en/v2023.05.0/getting-started-guide/why-xarray.html) expands on the capabilities on [NumPy](https://numpy.org/doc/stable/user/index.html#user) arrays, providing a lot of streamlined data manipulation. It is similar in that respect to [Pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide), but whereas Pandas excels at working with tabular data, Xarray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's [Common Data Model (CDM)](https://docs.unidata.ucar.edu/netcdf-java/current/userguide/common_data_model_overview.html). " ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Section 1: Creation of a `DataArray` Object\n", "\n", "The `DataArray` is one of the basic building blocks of Xarray (see docs [here](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataarray)). It provides a `numpy.ndarray`-like object that expands to provide two critical pieces of functionality:\n", "\n", "1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful\n", "2. It has a built-in container for attributes\n", "\n", "Here we'll initialize a `DataArray` object by wrapping a plain NumPy array, and explore a few of its properties." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Section 1.1: Generate a Random Numpy Array\n", "\n", "For our first example, we'll just create a random array of \"temperature\" data in units of Kelvin:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 151, "status": "ok", "timestamp": 1681570301490, "user": { "displayName": "Sloane Garelick", "userId": "04706287370408131987" }, "user_tz": 240 }, "tags": [] }, "outputs": [], "source": [ "rand_data = 283 + 5 * np.random.randn(5, 3, 4)\n", "rand_data" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Section 1.2: Wrap the Array: First Attempt\n", "\n", "Now we create a basic `DataArray` just by passing our plain `data` as an input:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 154, "status": "ok", "timestamp": 1681570303856, "user": { "displayName": "Sloane Garelick", "userId": "04706287370408131987" }, "user_tz": 240 }, "tags": [] }, "outputs": [], "source": [ "temperature = xr.DataArray(rand_data)\n", "temperature" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Note two things:\n", "\n", "1. Xarray generates some basic dimension names for us (`dim_0`, `dim_1`, `dim_2`). We'll improve this with better names in the next example.\n", "2. Wrapping the numpy array in a `DataArray` gives us a rich display in the notebook! (Try clicking the array symbol to expand or collapse the view)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Section 1.3: Assign Dimension Names\n", "\n", "Much of the power of Xarray comes from making use of named dimensions. So let's add some more useful names! We can do that by passing an ordered list of names using the keyword argument `dims`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 511, "status": "ok", "timestamp": 1679942484345, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "temperature = xr.DataArray(rand_data, dims=[\"time\", \"lat\", \"lon\"])\n", "temperature" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "This is already an improvement over a NumPy array because we have names for each of the dimensions (or axes). Even better, we can associate arrays representing the values for the coordinates for each of these dimensions with the data when we create the `DataArray`. We'll see this in the next example." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Section 2: Create a `DataArray` with Named Coordinates\n", "\n", "## Section 2.1: Make Time and Space Coordinates\n", "\n", "Here we will use [Pandas](https://foundations.projectpythia.org/core/pandas.html) to create an array of [datetime data](https://foundations.projectpythia.org/core/datetime.html), which we will then use to create a `DataArray` with a named coordinate `time`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 656, "status": "ok", "timestamp": 1679942588784, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "times_index = pd.date_range(\"2018-01-01\", periods=5)\n", "times_index" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "We'll also create arrays to represent sample longitude and latitude:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "tags": [] }, "outputs": [], "source": [ "lons = np.linspace(-120, -60, 4)\n", "lats = np.linspace(25, 55, 3)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "### Section 2.1.1: Initialize the `DataArray` with Complete Coordinate Info\n", "\n", "When we create the `DataArray` instance, we pass in the arrays we just created:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 320, "status": "ok", "timestamp": 1679942603438, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "temperature = xr.DataArray(\n", " rand_data, coords=[times_index, lats, lons], dims=[\"time\", \"lat\", \"lon\"]\n", ")\n", "temperature" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "### Section 2.1.2: Set Useful Attributes\n", "\n", "We can also set some attribute metadata, which will help provide clear descriptions of the data. In this case, we can specify that we're looking at 'air_temperature' data and the units are 'kelvin'." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 445, "status": "ok", "timestamp": 1679942614596, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "temperature.attrs[\"units\"] = \"kelvin\"\n", "temperature.attrs[\"standard_name\"] = \"air_temperature\"\n", "\n", "temperature" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "### Section 2.1.3: Attributes Are Not Preserved by Default!\n", "\n", "Notice what happens if we perform a mathematical operaton with the `DataArray`: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.\n", "\n", "To illustrate this, we'll do a simple unit conversion from Kelvin to Celsius:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 308, "status": "ok", "timestamp": 1679942626636, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "temperature_in_celsius = temperature - 273.15\n", "temperature_in_celsius" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "We usually wish to keep metadata with our dataset, even after manipulating the data. For example it can tell us what the units are of a variable of interest. So when you perform operations on your data, make sure to check that all the information you want is carried over. If it isn't, you can add it back in following the instructions in the section before this. For an in-depth discussion of how Xarray handles metadata, you can find more information in the Xarray documents [here](http://xarray.pydata.org/en/stable/getting-started-guide/faq.html#approach-to-metadata)." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Section 3: The `Dataset`: a Container for `DataArray`s with Shared Coordinates\n", "\n", "Along with `DataArray`, the other key object type in Xarray is the `Dataset`, which is a dictionary-like container that holds one or more `DataArray`s, which can also optionally share coordinates (see docs [here](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset)).\n", "\n", "The most common way to create a `Dataset` object is to load data from a file (which we will practice in a later tutorial). Here, instead, we will create another `DataArray` and combine it with our `temperature` data.\n", "\n", "This will illustrate how the information about common coordinate axes is used." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Section 3.1: Create a Pressure `DataArray` Using the Same Coordinates\n", "\n", "For our next `DataArry` example, we'll create a random array of `pressure` data in units of hectopascal (hPa). This code mirrors how we created the `temperature` object above." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 335, "status": "ok", "timestamp": 1679942669187, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)\n", "pressure = xr.DataArray(\n", " pressure_data, coords=[times_index, lats,\n", " lons], dims=[\"time\", \"lat\", \"lon\"]\n", ")\n", "pressure.attrs[\"units\"] = \"hPa\"\n", "pressure.attrs[\"standard_name\"] = \"air_pressure\"\n", "\n", "pressure" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Section 3.2: Create a `Dataset` Object\n", "\n", "Each `DataArray` in our `Dataset` needs a name! \n", "\n", "The most straightforward way to create a `Dataset` with our `temperature` and `pressure` arrays is to pass a dictionary using the keyword argument `data_vars`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 322, "status": "ok", "timestamp": 1679942691730, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "ds = xr.Dataset(data_vars={\"Temperature\": temperature, \"Pressure\": pressure})\n", "ds" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Notice that the `Dataset` object `ds` is aware that both data arrays sit on the same coordinate axes." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Section 3.3: Access Data Variables and Coordinates in a `Dataset`\n", "\n", "We can pull out any of the individual `DataArray` objects in a few different ways.\n", "\n", "Using the \"dot\" notation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 678, "status": "ok", "timestamp": 1679942736703, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "ds.Pressure" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "... or using dictionary access like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {}, "executionInfo": { "elapsed": 610, "status": "ok", "timestamp": 1679942746338, "user": { "displayName": "Yosmely Tamira Bermudez Gutierrez", "userId": "07776907551108334395" }, "user_tz": 180 }, "tags": [] }, "outputs": [], "source": [ "ds[\"Pressure\"]" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "We'll return to the `Dataset` object when we start loading data from files in later tutorials today." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Summary\n", "\n", "In this initial tutorial, the `DataArray` and `Dataset` objects were utilized to create and explore synthetic examples of climate data." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Resources\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {}, "tags": [] }, "source": [ "Code and data for this tutorial is based on existing content from [Project Pythia](https://foundations.projectpythia.org/core/xarray/xarray-intro.html)." ] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "W1D1_Tutorial1", "provenance": [ { "file_id": "1f2uyMuRNCH2LLG5u4Z4Tdb_OHLHB9saW", "timestamp": 1679941598643 } ], "toc_visible": true }, "kernel": { "display_name": "Python 3", "language": "python", "name": "python3" }, "kernelspec": { "display_name": "climatematch", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "toc-autonumbering": false }, "nbformat": 4, "nbformat_minor": 4 }