Tutorial 8: Comparing Satellite Products With In Situ Data#

Week 1, Day 3, Remote Sensing

Content creators: Douglas Rao

Content reviewers: Katrina Dobson, Younkap Nina Duplex, Maria Gonzalez, Will Gregory, Nahid Hasan, Sherry Mi, Beatriz Cosenza Muralles, Jenna Pearson, Agustina Pesce, Chi Zhang, Ohad Zivan

Content editors: Jenna Pearson, Chi Zhang, Ohad Zivan

Production editors: Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan

Our 2023 Sponsors: NASA TOPS and Google DeepMind

Tutorial Objectives#

In this tutorial, our primary focus will be on the ‘Verification of Satellite Data’. Building on our previous modules where we explored various climate applications of satellite data, we will now delve into the critical task of assessing the quality and reliability of such data.

By the end of this tutorial, you will learn how to use land-based observations to validate satellite climate data. In this process you will:

  • Learn how to access the gridded climate data derived from station observations from AWS.

  • Learn how to convert monthly total values to a daily rate.

  • Learn how to correctly evaluate satellite data against land-based observations.

Setup#

# !pip install s3fs --quiet

# properly install cartopy in colab to avoid session crash
# !apt-get install libproj-dev proj-data proj-bin --quiet
# !apt-get install libgeos-dev --quiet
# !pip install cython --quiet
# !pip install cartopy --quiet

# !apt-get -qq install python-cartopy python3-cartopy  --quiet
# !pip uninstall -y shapely  --quiet
# !pip install shapely --no-binary shapely  --quiet
# imports
import s3fs
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
import pooch
import os
import tempfile
import boto3
import botocore
from scipy import stats

Figure settings#

# @title Figure settings
import ipywidgets as widgets  # interactive display

%config InlineBackend.figure_format = 'retina'
plt.style.use(
    "https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle"
)

Video 1: Video 1 Name#

# @title Video 1: Video 1 Name
# Tech team will add code to format and display the video
# helper functions


def pooch_load(filelocation="", filename=""):
    shared_location = "/home/jovyan/shared/data/tutorials/W1D3_RemoteSensingLandOceanandAtmosphere"  # this is different for each day
    user_temp_cache = tempfile.gettempdir()

    if os.path.exists(os.path.join(shared_location, filename)):
        file = os.path.join(shared_location, filename)
    else:
        file = pooch.retrieve(
            filelocation, known_hash=None, fname=os.path.join(user_temp_cache, filename)
        )

    return file

Section 1: Evaluating Satellite Data with Observations#

Satellite data is frequently cross-verified against observations deemed reliable to evaluate its quality. Station-based observations and derived data are typically regarded as a reliable reference. When it comes to oceanic data, measurements taken by ships, buoys, drifters, or gliders are often used as a benchmark to assess the quality of satellite data.

In this tutorial, we will be using the nClimGrid dataset, a gridded climate dataset produced by NOAA. This dataset provides daily and monthly temperature and precipitation data, leveraging all available station observations. However, it’s important to note that this dataset is exclusive to the United States. We have selected this dataset due to its public availability on AWS. You are encouraged to explore other station data for the evaluation of satellite data in your projects.

Section 1.1: Accesing nClimGrid - a station based gridded climate data#

The nClimGrid-monthly dataset is a gridded dataset derived from spatially interpolating data from the Global Historical Climatology Network (GHCN). The dataset includes monthly precipitation, monthly temperature average, monthly temperature maximum and monthly temperature minimum. The dataset provides monthly values in a approximate 5x5 km lat/lon grid for the Continental United States. Data is available from 1895 to the present via NOAA NCEI or AWS. We will be accessing the data via AWS directly.

# connect to the AWS S3 bucket for the nClimGrid Monthly Precipitation data

# read in the monthly precipitation data from nClimGrid on AWS
client = boto3.client(
    "s3", config=botocore.client.Config(signature_version=botocore.UNSIGNED)
)  # initialize aws s3 bucket client

file_nclimgrid = "noaa-nclimgrid-monthly-pds/nclimgrid_prcp.nc"
ds = xr.open_dataset(
    pooch_load(
        filelocation="http://s3.amazonaws.com/" + file_nclimgrid,
        filename=file_nclimgrid,
    )
)
ds
Downloading data from 'http://s3.amazonaws.com/noaa-nclimgrid-monthly-pds/nclimgrid_prcp.nc' to file '/tmp/noaa-nclimgrid-monthly-pds/nclimgrid_prcp.nc'.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[6], line 10
      4 client = boto3.client(
      5     "s3", config=botocore.client.Config(signature_version=botocore.UNSIGNED)
      6 )  # initialize aws s3 bucket client
      8 file_nclimgrid = "noaa-nclimgrid-monthly-pds/nclimgrid_prcp.nc"
      9 ds = xr.open_dataset(
---> 10     pooch_load(
     11         filelocation="http://s3.amazonaws.com/" + file_nclimgrid,
     12         filename=file_nclimgrid,
     13     )
     14 )
     15 ds

Cell In[5], line 11, in pooch_load(filelocation, filename)
      9     file = os.path.join(shared_location, filename)
     10 else:
---> 11     file = pooch.retrieve(
     12         filelocation, known_hash=None, fname=os.path.join(user_temp_cache, filename)
     13     )
     15 return file

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
    236 if downloader is None:
    237     downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
    241 if known_hash is None:
    242     get_logger().info(
    243         "SHA256 hash of downloaded file: %s\n"
    244         "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
   (...)
    247         file_hash(str(full_path)),
    248     )

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/core.py:803, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    799 try:
    800     # Stream the file to a temporary so that we can safely check its
    801     # hash before overwriting the original.
    802     with temporary_file(path=str(fname.parent)) as tmp:
--> 803         downloader(url, tmp, pooch)
    804         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    805         shutil.move(tmp, str(fname))

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/downloaders.py:226, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
    224     progress = self.progressbar
    225     progress.total = total
--> 226 for chunk in content:
    227     if chunk:
    228         output_file.write(chunk)

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/requests/models.py:816, in Response.iter_content.<locals>.generate()
    814 if hasattr(self.raw, "stream"):
    815     try:
--> 816         yield from self.raw.stream(chunk_size, decode_content=True)
    817     except ProtocolError as e:
    818         raise ChunkedEncodingError(e)

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:628, in HTTPResponse.stream(self, amt, decode_content)
    626 else:
    627     while not is_fp_closed(self._fp):
--> 628         data = self.read(amt=amt, decode_content=decode_content)
    630         if data:
    631             yield data

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:567, in HTTPResponse.read(self, amt, decode_content, cache_content)
    564 fp_closed = getattr(self._fp, "closed", False)
    566 with self._error_catcher():
--> 567     data = self._fp_read(amt) if not fp_closed else b""
    568     if amt is None:
    569         flush_decoder = True

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:533, in HTTPResponse._fp_read(self, amt)
    530     return buffer.getvalue()
    531 else:
    532     # StringIO doesn't like amt=None
--> 533     return self._fp.read(amt) if amt is not None else self._fp.read()

File ~/miniconda3/envs/climatematch/lib/python3.10/http/client.py:466, in HTTPResponse.read(self, amt)
    463 if self.length is not None and amt > self.length:
    464     # clip the read to the "end of response"
    465     amt = self.length
--> 466 s = self.fp.read(amt)
    467 if not s and amt:
    468     # Ideally, we would raise IncompleteRead if the content-length
    469     # wasn't satisfied, but it might break compatibility.
    470     self._close_conn()

File ~/miniconda3/envs/climatematch/lib/python3.10/socket.py:705, in SocketIO.readinto(self, b)
    703 while True:
    704     try:
--> 705         return self._sock.recv_into(b)
    706     except timeout:
    707         self._timeout_occurred = True

KeyboardInterrupt: 

The nClimGrid dataset is available from 1895-01-01 until present. Since our GPCP data is only available between 1979-01-01 and 2022-12-01, extract only the data from that time period from the nClimGrid monthly data.

prcp_obs = ds.sel(time=slice("1979-01-01", "2022-12-31"))
prcp_obs

From the information about the precipitation data from nClimGird monthly dataset, we know it is the monthly total precipitation, which is the total amount of rainfall that a location receives for the entire month with the unit of millimeter.

prcp_obs.prcp

However, the GPCP satellite precipitation variable is the daily precipitation rate with the unit of mm/day. This variable quantifies the average amount of precipitation in a day for a given location in a month.

To convert from the total amount to the precipitation rate, we just need to divide the amount by the number of days within a month (e.g., 31 days for January). We can use .days_in_month to achieve that.

# calculate precipitation rate from nClimGrid
obs_rate = prcp_obs.prcp / prcp_obs.time.dt.days_in_month
obs_rate
# generate the map of precipitation rate from nClimGrid monthly data
obs_rate[0].plot()

In this quick map, we can see the value range of the precipitation rate appears to be reasonable compared to the GPCP monthly precipitation CDR data (0-20 mm/day).

Section 1.2: Read GPCP Monthly Precipitation Data#

Now we are ready to compare our land-based observations to the monthly GPCP satellite from AWS public data catalog.

# get the list of all data files in the AWS S3 bucket
fs = s3fs.S3FileSystem(anon=True)
file_pattern = "noaa-cdr-precip-gpcp-monthly-pds/data/*/gpcp_v02r03_monthly_*.nc"
file_location = fs.glob(file_pattern)

# open connection to all data files
file_ob = [
    pooch_load(filelocation="http://s3.amazonaws.com/" + file, filename=file)
    for file in file_location
]

# open all the monthly data files and concatenate them along the time dimension.
# this process will take ~ 1 minute to complete due to the number of data files.
ds_gpcp = xr.open_mfdataset(file_ob, combine="nested", concat_dim="time")
ds_gpcp
# get the GPCP precipitation rate
prcp_sat = ds_gpcp.precip
prcp_sat

Section 1.3: Spatial Pattern#

Now, let’s take a quick look at the spatial pattern between these two datasets for a selected month (e.g., 1979-01-01).

# set up the geographical region for continental US
central_lat = 37.5
central_lon = -96
extent = [-120, -70, 21, 50]
central_lon = np.mean(extent[:2])
central_lat = np.mean(extent[2:])

# extract sat and obs data for the month of 1979-01-01
sat = prcp_sat.sel(time="1979-01-01")
obs = obs_rate.sel(time="1979-01-01")

# initate plot for North America using two suplots
fig, axs = plt.subplots(
    2,
    subplot_kw={"projection": ccrs.AlbersEqualArea(central_lon, central_lat)},
    figsize=(9, 12),
    sharex=True,
    sharey=True,
)
axs[0].set_extent(extent)
axs[0].coastlines()
axs[0].set_title("GPCP Monthly")
sat.plot(
    ax=axs[0],
    transform=ccrs.PlateCarree(),
    vmin=0,
    vmax=15,
    cbar_kwargs=dict(shrink=0.5, label="GPCP Precipitation (mm/day)"),
)
axs[1].set_extent(extent)
axs[1].coastlines()
axs[1].set_title("nClimGrid Monthly")
obs.plot(
    ax=axs[1],
    transform=ccrs.PlateCarree(),
    vmin=0,
    vmax=15,
    cbar_kwargs=dict(shrink=0.5, label="nClimGrid Precipitation (mm/day)"),
)

Overall, we have a similar spatial pattern but with widely different spatial resolution (i.e., 5km v.s. 2.5°).

Section 1.4: Time series comparison#

Let’s use New York City as an example, we can examine the time series of the satellite and observation-based dataset to evaluate the performance.

The latitute and longitute of NYC is (40.71°N, 74.01°W). We will use it to extract the time series from GPCP and nClimGrid using the nearest method you discussed in Day 1 due to the differing resolutions of each dataset.

# note that GPCP data is stored as 0-360 degree for the longitude, so the longitude should be using (360 - lon)
sat = prcp_sat.sel(longitude=285.99, latitude=40.71, method="nearest")
obs = obs_rate.sel(lon=-74.01, lat=40.71, method="nearest")  # precipitation rate
obs_total = prcp_obs.sel(lon=-74.01, lat=40.71, method="nearest")  # total amount
# let's look at the comparison between the precipitation rate from nClimGrid and satellite CDR
fig, ax = plt.subplots(figsize=(12, 6))
obs.plot(label="nClimGrid Monthly Precipitation Rate", ax=ax)
sat.plot(color="k", alpha=0.6, label="GPCP Monthly Precipitation Rate", ax=ax)
ax.set_ylabel("nClimGrid v.s. GPCP (New York City) (mm/day)")
ax.legend()

Now we are going to zoom in to a few years to see how the data compares.

fig, ax = plt.subplots(figsize=(12, 6))
obs.sel(time=slice("2011-01-01", "2015-12-01")).plot(label="nClimGrid", ax=ax)
sat.sel(time=slice("2011-01-01", "2015-12-01")).plot(
    marker="o", label="GPCP Monthly", ax=ax
)
ax.legend()

We see a great alignment in the precipitation rate between the nClimGrid and GPCP data when we look at the details over this small chosen window. However, we cannot zoom in to every location for all times to confirm they are a good match. We can use statistics to quantify the relationship between these two datasets for as in a automated fashion, and will do so in the next section.

Section 1.5: Quantify the Difference#

One way to more robustly compare the datasets is through a scatterplot. The data would ideally follow the 1:1 line, which would suggest that they are very closely matched. Let’s make this plot and observe how our data compares to one another:

# make sure that both observation and satellite data are for the samte time period
sat = sat.sel(time=slice("1979-01-01", "2022-12-01"))
obs = obs.sel(time=slice("1979-01-01", "2022-12-01"))
# plot the scatter plot between nClimGrid and GPCP monthly precipitation CDR
fig, ax = plt.subplots(figsize=(12, 6))
fig.suptitle("GPCP Precipitaion v.s. nClimGrid")
ax.scatter(sat, obs, alpha=0.6)
# Add 1:1 line
y_lim = (0, 15)
x_lim = (0, 15)
ax.plot((0, 15), (0, 15), "r-")
ax.set_ylim(y_lim)
ax.set_xlim(x_lim)
ax.set_xlabel("GPCP Precipitation (mm/day)")
ax.set_ylabel("nClimGrid (mm/day)")

By eye, there appears to be a stong correlation between the satellite data and the observations for NYC, with much of the data following the 1:1 line plotted in red. As in the last tutorial, we can calculate the correlation coefficient and corresponding p-value.

r, p = stats.pearsonr(sat, obs)
print("Corr Coef: " + str(r) + ", p-val: " + str(p))

As we expected, the data are significantly correlated.

Coding Exercises 1.5#

  1. Sometimes, we are more interested in the difference among the anomaly data rather than the total data. Using the same location, compare the anomaly data.

#################################################
# Students: Fill in missing code (...) and comment or remove the next line
raise NotImplementedError(
    "Student exercise: Compare the anomaly of precipitation rate for NYC."
)
#################################################

# calculate climatology for the 1981-2010 period for both GPCP and nClimGrid
sat_clim = ...
obs_clim = ...

# calculate anomaly of the NYC time series for both GPCP and nClimGrid
sat_clim_anom = ...
obs_clim_anom = ...

# plot time series and scatter plot between two time series
fig, ax = plt.subplots(figsize=(12, 6))
obs_clim_anom.sel(time=slice("2011-01-01", "2015-12-01")).plot(
    label="nClimGrid anomaly", ax=ax
)
sat_clim_anom.sel(time=slice("2011-01-01", "2015-12-01")).plot(
    marker="o", label="GPCP Monthly anomaly", ax=ax
)
ax.legend()

# plot the scatter plot between nClimGrid and GPCP monthly precipitation CDR
fig, ax = plt.subplots(figsize=(12, 6))
fig.suptitle("GPCP Precipitaion v.s. nClimGrid")
ax.scatter(sat_clim_anom, obs_clim_anom, alpha=0.6)
# Add 1:1 line
y_lim = (0, 15)
x_lim = (0, 15)
ax.plot((0, 15), (0, 15), "r-")
ax.set_ylim(y_lim)
ax.set_xlim(x_lim)
ax.set_xlabel("GPCP Precipitation anomaly (mm/day)")
ax.set_ylabel("nClimGrid anomaly (mm/day)")

# calculate and print correlation coefficient and p-value
r, p = ...
print("Corr Coef: " + str(r) + ", p-val: " + str(p))

# to_remove solution

# calculate climatology for the 1981-2010 period for both GPCP and nClimGrid
sat_clim = (
    sat.sel(time=slice("1981-01-01", "2010-12-01"))
    .groupby("time.month")
    .mean(dim="time")
)
obs_clim = (
    obs.sel(time=slice("1981-01-01", "2010-12-01"))
    .groupby("time.month")
    .mean(dim="time")
)

# calculate anomaly of the NYC time series for both GPCP and nClimGrid
sat_clim_anom = sat.groupby("time.month") - sat_clim
obs_clim_anom = obs.groupby("time.month") - obs_clim

# plot time series and scatter plot between two time series
fig, ax = plt.subplots(figsize=(12, 6))
obs_clim_anom.sel(time=slice("2011-01-01", "2015-12-01")).plot(
    label="nClimGrid anomaly", ax=ax
)
sat_clim_anom.sel(time=slice("2011-01-01", "2015-12-01")).plot(
    marker="o", label="GPCP Monthly anomaly", ax=ax
)
ax.legend()

# plot the scatter plot between nClimGrid and GPCP monthly precipitation CDR
fig, ax = plt.subplots(figsize=(12, 6))
fig.suptitle("GPCP Precipitaion v.s. nClimGrid")
ax.scatter(sat_clim_anom, obs_clim_anom, alpha=0.6)
# Add 1:1 line
y_lim = (0, 15)
x_lim = (0, 15)
ax.plot((0, 15), (0, 15), "r-")
ax.set_ylim(y_lim)
ax.set_xlim(x_lim)
ax.set_xlabel("GPCP Precipitation anomaly (mm/day)")
ax.set_ylabel("nClimGrid anomaly (mm/day)")

# calculate and print correlation coefficient and p-value
r, p = stats.pearsonr(sat_clim_anom, obs_clim_anom)
print("Corr Coef: " + str(r) + ", p-val: " + str(p))

Summary#

In this tutorial, you explored how to use station-based observations within the U.S. to evaluate satellite precipitation data. While this isn’t a global comparison, the methodology can be applied to other station or observation data you may wish to utilize.

When carrying out these comparisons, remember the following key points:

  • Ensure that both the satellite data and the observations represent the same quantity (for example, total precipitation amount versus precipitation rate).

  • Comparisons should be made for the same geolocation (or very near to it) and the same time period.

  • Be aware of potential spatial scale effects. Satellite data measures over a large area, whereas observations may be narrowly focused, particularly for elements that exhibit substantial spatial variability. This can lead to considerable uncertainty in the satellite data.

Resources#

Data from this tutorial can be accessed here and here.