--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.11.5 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- +++ {"slideshow": {"slide_type": "slide"}} # Pandas +++ {"slideshow": {"slide_type": "-"}, "tags": ["remove-cell"]} **CS1302 Introduction to Computer Programming** ___ +++ {"slideshow": {"slide_type": "subslide"}} In this lab, we will analyze COVID19 data using a powerful package called [`pandas`](https://pandas.pydata.org/docs/user_guide/index.html). The package name comes from *panel data* and *Python for data analysis*. +++ {"slideshow": {"slide_type": "slide"}} ## Loading CSV Files with Pandas +++ {"slideshow": {"slide_type": "fragment"}} [DATA.GOV.HK](https://data.gov.hk/en-data/dataset/hk-dh-chpsebcddr-novel-infectious-agent) provides an [API](https://data.gov.hk/en/help/api-spec#historicalAPI) to retrieve historical data on COVID-19 cases in Hong Kong. +++ {"slideshow": {"slide_type": "fragment"}} The following uses the `urlencode` function to create the url that links to a csv file containing probable and confirmed cases of COVID-19 by Aug 1st, 2020. ```{code-cell} ipython3 --- slideshow: slide_type: '-' --- from urllib.parse import urlencode url_data_gov_hk_get = "https://api.data.gov.hk/v1/historical-archive/get-file" url_covid_csv = "http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv" time = "20200801-1204" url_covid = url_data_gov_hk_get + "?" + urlencode({"url": url_covid_csv, "time": time}) print(url_covid) ``` +++ {"slideshow": {"slide_type": "fragment"}} ````{tip} `urlencode` creates a string `'url=<...>&time=<...>'` with some [special symbols encoded](https://www.w3schools.com/tags/ref_urlencode.ASP), e.g.: - `:` is replaced by `%3A`, and - `/` is replaced by `%2F`. ```` +++ {"slideshow": {"slide_type": "subslide"}} **Exercise** Write a function `simple_encode` that takes in a string and return a string with `:` and `/` encoded as described above. ```{code-cell} ipython3 --- deletable: false nbgrader: cell_type: code checksum: fcc5ddee23fc2282a337b2ae4e443334 grade: false grade_id: simple_encode locked: false schema_version: 3 solution: true task: false slideshow: slide_type: '-' tags: [remove-output] --- def simple_encode(string): """Returns the string with : and / encoded to %3A and %2F respectively.""" # YOUR CODE HERE raise NotImplementedError() ``` ````{hint} Use the `replace` method of `str`. ```` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: 69acab1e962a5e893f87449880b0fb7d grade: true grade_id: test-simple_encode locked: true points: 1 schema_version: 3 solution: false task: false slideshow: slide_type: '-' tags: [hide-input, remove-output] --- # tests assert ( simple_encode("http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv") == "http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv" ) ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: f435b2f22350119a1ccbb85254d296bb grade: true grade_id: htest-simple_encode locked: true points: 1 schema_version: 3 solution: false task: false tags: [remove-cell] --- # hidden tests ``` +++ {"slideshow": {"slide_type": "subslide"}} Like the function `open` that loads a file into memory, `pandas` has a function `read_csv` that loads a csv file. The csv file can even reside on the web: ```{code-cell} ipython3 --- slideshow: slide_type: fragment --- import pandas as pd df_covid = pd.read_csv(url_covid) print(type(df_covid)) df_covid ``` +++ {"slideshow": {"slide_type": "fragment"}} ````{tip} The above creates a [`DataFrame` object](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame): - The content of the csv file is displayed as an HTML table conveniently. - We can control how much information to show by setting the [display options](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).) ```` +++ {"slideshow": {"slide_type": "subslide"}} **Exercise** Using the function `pd.read_csv`, load `building_list_eng.csv` as `df_building` from the url `url_building`. ```{code-cell} ipython3 --- deletable: false nbgrader: cell_type: code checksum: d45e94c9d1bc90db12e8d75eb74673d3 grade: false grade_id: df_building locked: false schema_version: 3 solution: true task: false slideshow: slide_type: '-' tags: [remove-output] --- url_building_csv = "http://www.chp.gov.hk/files/misc/building_list_eng.csv" time = "20200801-1203" url_building = ( url_data_gov_hk_get + "?" + urlencode({"url": url_building_csv, "time": time}) ) # YOUR CODE HERE raise NotImplementedError() df_building ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: f723c297315d804d79a6408f22fd1ea2 grade: true grade_id: test-df_building locked: true points: 1 schema_version: 3 solution: false task: false slideshow: slide_type: '-' tags: [remove-output, hide-input] --- # tests assert all( df_building.columns == [ "District", "Building name", "Last date of residence of the case(s)", "Related probable/confirmed cases", ] ) # check column names ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: ea66059af48263d1d8b23cbcf48e7637 grade: true grade_id: htest-df_building locked: true points: 1 schema_version: 3 solution: false task: false tags: [remove-cell] --- # hidden tests ``` +++ {"slideshow": {"slide_type": "subslide"}} ## Selecting and Removing columns +++ {"slideshow": {"slide_type": "fragment"}} We can obtain the column labels of a `Dataframe` using its `columns` attribute. ```{code-cell} ipython3 --- slideshow: slide_type: '-' --- df_covid.columns ``` +++ {"slideshow": {"slide_type": "fragment"}} Using the indexing operator `[]`, a column of a `DataFrame` can be returned as a [`Series` object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), which is essentially a named array. We can further use the method `value_counts` to return the counts of different values in another `Series` object. ```{code-cell} ipython3 --- slideshow: slide_type: '-' --- series_gender_counts = df_covid[ "Gender" ].value_counts() # return the number of male and female cases print(type(series_gender_counts)) series_gender_counts ``` +++ {"slideshow": {"slide_type": "subslide"}} **Exercise** For `df_building`, use the operator `[]` and method `value_counts` to assign `series_district_counts` to a `Series` object that stores the counts of buildings in different district. ```{code-cell} ipython3 --- code_folding: [] deletable: false nbgrader: cell_type: code checksum: b9589e1b6a5cec1ed3814a03a688a2df grade: false grade_id: series_district_counts locked: false schema_version: 3 solution: true task: false slideshow: slide_type: '-' tags: [remove-output] --- # YOUR CODE HERE raise NotImplementedError() series_district_counts ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: 4fe69462a446c737e8604c053360475a grade: true grade_id: test-series_district_counts locked: true points: 1 schema_version: 3 solution: false task: false slideshow: slide_type: '-' tags: [remove-output, hide-input] --- # tests assert all(series_district_counts[["Wong Tai Sin", "Kwun Tong"]] == [313, 212]) ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: 76ed779a7bcd6dd420ec33444bf1a743 grade: true grade_id: htest-series_district_counts locked: true points: 1 schema_version: 3 solution: false task: false tags: [remove-cell] --- # hidden tests ``` +++ {"slideshow": {"slide_type": "subslide"}} In `df_covid`, it appears that the column `Name of hospital admitted` contains no information. We can confirm this by 1. returning the column as a `Series` with `df_covid_cases['Name of hospital admitted']`, and 1. printing an array of unique column values using the method `unique`. ```{code-cell} ipython3 --- slideshow: slide_type: '-' --- df_covid["Name of hospital admitted"].unique() ``` +++ {"slideshow": {"slide_type": "subslide"}} **Exercise** Drop the column `Name of hospital admitted` from `df_covid` using the `drop` method of the DataFrame. ```{code-cell} ipython3 --- deletable: false nbgrader: cell_type: code checksum: 92ec259cb0de77ebe185cbbc17d58395 grade: false grade_id: drop locked: false schema_version: 3 solution: true task: false slideshow: slide_type: '-' tags: [remove-output] --- # YOUR CODE HERE raise NotImplementedError() df_covid ``` ````{hint} Consider reading the documentation of the `drop` method for - mutating `df_covid` in place instead of creating a copy of the DataFrame with the column dropped, but - suppressing error when dropping a column that does not exist or has already been dropped. ```` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: f924628b48e99a070a9ef0f25b297aad grade: true grade_id: test-drop locked: true points: 1 schema_version: 3 solution: false task: false tags: [hide-input, remove-output] --- # tests assert all( df_covid.columns == [ "Case no.", "Report date", "Date of onset", "Gender", "Age", "Hospitalised/Discharged/Deceased", "HK/Non-HK resident", "Case classification*", "Confirmed/probable", ] ) ``` +++ {"slideshow": {"slide_type": "slide"}} ## Selecting Rows of DataFrame +++ {"slideshow": {"slide_type": "fragment"}} We can select the confirmed male cases using the attribute [`loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) and the indexing operator `[]`. ```{code-cell} ipython3 --- slideshow: slide_type: '-' --- df_confirmed_male = df_covid.loc[ (df_covid["Confirmed/probable"] == "Confirmed") & (df_covid["Gender"] == "M") ] print(type(df_covid.loc)) df_confirmed_male ``` ````{tip} `loc` essentially returns an object that implements the [advanced indexing method](https://numpy.org/doc/stable/reference/arrays.indexing.html#advanced-indexing) for `__getitem__`. In particular, the above uses [boolean indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#boolean-array-indexing). ```` +++ {"slideshow": {"slide_type": "subslide"}} **Exercise** Assign `df_confirmed_local` to a `DataFrame` of confirmed cases that are local or epidemiologically linked with a local case. ```{code-cell} ipython3 --- deletable: false nbgrader: cell_type: code checksum: 617372bff993b2f7f7810f2d97a9d106 grade: false grade_id: df_confirmed_local locked: false schema_version: 3 solution: true task: false slideshow: slide_type: '-' tags: [remove-output] --- # YOUR CODE HERE raise NotImplementedError() df_confirmed_local ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: 1291e901430037b15777d2fe481cddbb grade: true grade_id: test-df_confirmed_local locked: true points: 1 schema_version: 3 solution: false task: false slideshow: slide_type: fragment tags: [remove-output, hide-input] --- # tests assert set(df_confirmed_local["Case classification*"].unique()) == { "Epidemiologically linked with local case", "Local case", } ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: 285c0f24376b876b8f458e07a05563df grade: true grade_id: htest-df_confirmed_local locked: true points: 1 schema_version: 3 solution: false task: false tags: [remove-cell] --- # hidden tests ``` +++ {"slideshow": {"slide_type": "fragment"}} **Exercise** Write a function `case_counts` that - takes an argument `district`, and - returns the number of cases in `district`. ```{code-cell} ipython3 --- deletable: false nbgrader: cell_type: code checksum: 86188136c62558ec15f61d47cfbbdd7a grade: false grade_id: case_counts locked: false schema_version: 3 solution: true task: false slideshow: slide_type: '-' tags: [remove-output] --- def case_counts(district): # YOUR CODE HERE raise NotImplementedError() ``` ````{hint} Be careful that there can be more than one case for each building and there may be multiple buildings associated with one case. You may want to use the `split` and `strip` methods of `str` to obtain a list of cases from the `Dataframe`. ```` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: 721a790b1a8a2344baaa39eab502b96a grade: true grade_id: test-case_counts locked: true points: 1 schema_version: 3 solution: false task: false slideshow: slide_type: '-' tags: [remove-output, hide-input] --- # tests assert case_counts("Kwai Tsing") == 109 ``` ```{code-cell} ipython3 --- code_folding: [0] deletable: false editable: false nbgrader: cell_type: code checksum: e73f4d8c0f1732551ea7ffa66af9fa1d grade: true grade_id: htest-case_counts locked: true points: 1 schema_version: 3 solution: false task: false tags: [remove-cell] --- # hidden tests ```