---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.11.5
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

+++ {"slideshow": {"slide_type": "slide"}}

# Pandas

+++ {"slideshow": {"slide_type": "-"}, "tags": ["remove-cell"]}

**CS1302 Introduction to Computer Programming**
___

+++ {"slideshow": {"slide_type": "subslide"}}

In this lab, we will analyze COVID19 data using a powerful package called [`pandas`](https://pandas.pydata.org/docs/user_guide/index.html).  
The package name comes from *panel data* and *Python for data analysis*.

+++ {"slideshow": {"slide_type": "slide"}}

## Loading CSV Files with Pandas 

+++ {"slideshow": {"slide_type": "fragment"}}

[DATA.GOV.HK](https://data.gov.hk/en-data/dataset/hk-dh-chpsebcddr-novel-infectious-agent) provides an [API](https://data.gov.hk/en/help/api-spec#historicalAPI) to retrieve historical data on COVID-19 cases in Hong Kong.

+++ {"slideshow": {"slide_type": "fragment"}}

The following uses the `urlencode` function to create the url that links to a csv file containing probable and confirmed cases of COVID-19 by Aug 1st, 2020.

```{code-cell} ipython3
---
slideshow:
  slide_type: '-'
---
from urllib.parse import urlencode

url_data_gov_hk_get = "https://api.data.gov.hk/v1/historical-archive/get-file"
url_covid_csv = "http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv"
time = "20200801-1204"
url_covid = url_data_gov_hk_get + "?" + urlencode({"url": url_covid_csv, "time": time})

print(url_covid)
```

+++ {"slideshow": {"slide_type": "fragment"}}

````{tip}

`urlencode` creates a string `'url=<...>&time=<...>'` with some [special symbols encoded](https://www.w3schools.com/tags/ref_urlencode.ASP), e.g.:
- `:` is replaced by `%3A`, and
- `/` is replaced by `%2F`.

````

+++ {"slideshow": {"slide_type": "subslide"}}

**Exercise** Write a function `simple_encode` that takes in a string and return a string with `:` and `/` encoded as described above.

```{code-cell} ipython3
---
deletable: false
nbgrader:
  cell_type: code
  checksum: fcc5ddee23fc2282a337b2ae4e443334
  grade: false
  grade_id: simple_encode
  locked: false
  schema_version: 3
  solution: true
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output]
---
def simple_encode(string):
    """Returns the string with : and / encoded to %3A and %2F respectively."""
    # YOUR CODE HERE
    raise NotImplementedError()
```

````{hint} 

Use the `replace` method of `str`.

````

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: 69acab1e962a5e893f87449880b0fb7d
  grade: true
  grade_id: test-simple_encode
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
slideshow:
  slide_type: '-'
tags: [hide-input, remove-output]
---
# tests
assert (
    simple_encode("http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv")
    == "http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv"
)
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: f435b2f22350119a1ccbb85254d296bb
  grade: true
  grade_id: htest-simple_encode
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
tags: [remove-cell]
---
# hidden tests
```

+++ {"slideshow": {"slide_type": "subslide"}}

Like the function `open` that loads a file into memory, `pandas` has a function `read_csv` that loads a csv file. The csv file can even reside on the web:

```{code-cell} ipython3
---
slideshow:
  slide_type: fragment
---
import pandas as pd

df_covid = pd.read_csv(url_covid)

print(type(df_covid))
df_covid
```

+++ {"slideshow": {"slide_type": "fragment"}}

````{tip}

The above creates a [`DataFrame` object](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame):  
- The content of the csv file is displayed as an HTML table conveniently.   
- We can control how much information to show by setting the [display options](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).)

````

+++ {"slideshow": {"slide_type": "subslide"}}

**Exercise** Using the function `pd.read_csv`, load `building_list_eng.csv` as `df_building` from the url `url_building`.  

```{code-cell} ipython3
---
deletable: false
nbgrader:
  cell_type: code
  checksum: d45e94c9d1bc90db12e8d75eb74673d3
  grade: false
  grade_id: df_building
  locked: false
  schema_version: 3
  solution: true
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output]
---
url_building_csv = "http://www.chp.gov.hk/files/misc/building_list_eng.csv"
time = "20200801-1203"
url_building = (
    url_data_gov_hk_get + "?" + urlencode({"url": url_building_csv, "time": time})
)
# YOUR CODE HERE
raise NotImplementedError()
df_building
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: f723c297315d804d79a6408f22fd1ea2
  grade: true
  grade_id: test-df_building
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output, hide-input]
---
# tests
assert all(
    df_building.columns
    == [
        "District",
        "Building name",
        "Last date of residence of the case(s)",
        "Related probable/confirmed cases",
    ]
)  # check column names
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: ea66059af48263d1d8b23cbcf48e7637
  grade: true
  grade_id: htest-df_building
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
tags: [remove-cell]
---
# hidden tests
```

+++ {"slideshow": {"slide_type": "subslide"}}

## Selecting and Removing columns

+++ {"slideshow": {"slide_type": "fragment"}}

We can obtain the column labels of a `Dataframe` using its `columns` attribute.

```{code-cell} ipython3
---
slideshow:
  slide_type: '-'
---
df_covid.columns
```

+++ {"slideshow": {"slide_type": "fragment"}}

Using the indexing operator `[]`, a column of a `DataFrame` can be returned as a [`Series` object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), which is essentially a named array.   
We can further use the method `value_counts` to return the counts of different values in another `Series` object.

```{code-cell} ipython3
---
slideshow:
  slide_type: '-'
---
series_gender_counts = df_covid[
    "Gender"
].value_counts()  # return the number of male and female cases

print(type(series_gender_counts))
series_gender_counts
```

+++ {"slideshow": {"slide_type": "subslide"}}

**Exercise** For `df_building`, use the operator `[]` and method `value_counts` to assign `series_district_counts` to a `Series` object that stores the counts of buildings in different district.

```{code-cell} ipython3
---
code_folding: []
deletable: false
nbgrader:
  cell_type: code
  checksum: b9589e1b6a5cec1ed3814a03a688a2df
  grade: false
  grade_id: series_district_counts
  locked: false
  schema_version: 3
  solution: true
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output]
---
# YOUR CODE HERE
raise NotImplementedError()
series_district_counts
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: 4fe69462a446c737e8604c053360475a
  grade: true
  grade_id: test-series_district_counts
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output, hide-input]
---
# tests
assert all(series_district_counts[["Wong Tai Sin", "Kwun Tong"]] == [313, 212])
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: 76ed779a7bcd6dd420ec33444bf1a743
  grade: true
  grade_id: htest-series_district_counts
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
tags: [remove-cell]
---
# hidden tests
```

+++ {"slideshow": {"slide_type": "subslide"}}

In `df_covid`, it appears that the column `Name of hospital admitted` contains no information. We can confirm this by
1. returning the column as a `Series` with `df_covid_cases['Name of hospital admitted']`, and
1. printing an array of unique column values using the method `unique`.

```{code-cell} ipython3
---
slideshow:
  slide_type: '-'
---
df_covid["Name of hospital admitted"].unique()
```

+++ {"slideshow": {"slide_type": "subslide"}}

**Exercise** Drop the column `Name of hospital admitted` from `df_covid` using the `drop` method of the DataFrame. 

```{code-cell} ipython3
---
deletable: false
nbgrader:
  cell_type: code
  checksum: 92ec259cb0de77ebe185cbbc17d58395
  grade: false
  grade_id: drop
  locked: false
  schema_version: 3
  solution: true
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output]
---
# YOUR CODE HERE
raise NotImplementedError()
df_covid
```

````{hint}

Consider reading the documentation of the `drop` method for  
- mutating `df_covid` in place instead of creating a copy of the DataFrame with the column dropped, but
- suppressing error when dropping a column that does not exist or has already been dropped.

````

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: f924628b48e99a070a9ef0f25b297aad
  grade: true
  grade_id: test-drop
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
tags: [hide-input, remove-output]
---
# tests
assert all(
    df_covid.columns
    == [
        "Case no.",
        "Report date",
        "Date of onset",
        "Gender",
        "Age",
        "Hospitalised/Discharged/Deceased",
        "HK/Non-HK resident",
        "Case classification*",
        "Confirmed/probable",
    ]
)
```

+++ {"slideshow": {"slide_type": "slide"}}

## Selecting Rows of DataFrame

+++ {"slideshow": {"slide_type": "fragment"}}

We can select the confirmed male cases using the attribute [`loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) and the indexing operator `[]`.

```{code-cell} ipython3
---
slideshow:
  slide_type: '-'
---
df_confirmed_male = df_covid.loc[
    (df_covid["Confirmed/probable"] == "Confirmed") & (df_covid["Gender"] == "M")
]
print(type(df_covid.loc))
df_confirmed_male
```

````{tip}

`loc` essentially returns an object that implements the [advanced indexing method](https://numpy.org/doc/stable/reference/arrays.indexing.html#advanced-indexing) for `__getitem__`. In particular, the above uses [boolean indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#boolean-array-indexing).

````

+++ {"slideshow": {"slide_type": "subslide"}}

**Exercise** Assign `df_confirmed_local` to a `DataFrame` of confirmed cases that are local or epidemiologically linked with a local case.

```{code-cell} ipython3
---
deletable: false
nbgrader:
  cell_type: code
  checksum: 617372bff993b2f7f7810f2d97a9d106
  grade: false
  grade_id: df_confirmed_local
  locked: false
  schema_version: 3
  solution: true
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output]
---
# YOUR CODE HERE
raise NotImplementedError()
df_confirmed_local
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: 1291e901430037b15777d2fe481cddbb
  grade: true
  grade_id: test-df_confirmed_local
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
slideshow:
  slide_type: fragment
tags: [remove-output, hide-input]
---
# tests
assert set(df_confirmed_local["Case classification*"].unique()) == {
    "Epidemiologically linked with local case",
    "Local case",
}
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: 285c0f24376b876b8f458e07a05563df
  grade: true
  grade_id: htest-df_confirmed_local
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
tags: [remove-cell]
---
# hidden tests
```

+++ {"slideshow": {"slide_type": "fragment"}}

**Exercise** Write a function `case_counts` that 
- takes an argument `district`, and
- returns the number of cases in `district`. 

```{code-cell} ipython3
---
deletable: false
nbgrader:
  cell_type: code
  checksum: 86188136c62558ec15f61d47cfbbdd7a
  grade: false
  grade_id: case_counts
  locked: false
  schema_version: 3
  solution: true
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output]
---
def case_counts(district):
    # YOUR CODE HERE
    raise NotImplementedError()
```

````{hint} 

Be careful that there can be more than one case for each building and there may be multiple buildings associated with one case. You may want to use the `split` and `strip` methods of `str` to obtain a list of cases from the `Dataframe`.

````

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: 721a790b1a8a2344baaa39eab502b96a
  grade: true
  grade_id: test-case_counts
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
slideshow:
  slide_type: '-'
tags: [remove-output, hide-input]
---
# tests
assert case_counts("Kwai Tsing") == 109
```

```{code-cell} ipython3
---
code_folding: [0]
deletable: false
editable: false
nbgrader:
  cell_type: code
  checksum: e73f4d8c0f1732551ea7ffa66af9fa1d
  grade: true
  grade_id: htest-case_counts
  locked: true
  points: 1
  schema_version: 3
  solution: false
  task: false
tags: [remove-cell]
---
# hidden tests
```