Pandas¶

In this lab, we will analyze COVID19 data using a powerful package called pandas.
The package name comes from panel data and Python for data analysis.

Loading CSV Files with Pandas¶

DATA.GOV.HK provides an API to retrieve historical data on COVID-19 cases in Hong Kong.

The following uses the urlencode function to create the url that links to a csv file containing probable and confirmed cases of COVID-19 by Aug 1st, 2020.

from urllib.parse import urlencode

url_data_gov_hk_get = "https://api.data.gov.hk/v1/historical-archive/get-file"
url_covid_csv = "http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv"
time = "20200801-1204"
url_covid = url_data_gov_hk_get + "?" + urlencode({"url": url_covid_csv, "time": time})

print(url_covid)

https://api.data.gov.hk/v1/historical-archive/get-file?url=http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv&time=20200801-1204

Tip

urlencode creates a string 'url=<...>&time=<...>' with some special symbols encoded, e.g.:

: is replaced by %3A, and
/ is replaced by %2F.

Exercise Write a function simple_encode that takes in a string and return a string with : and / encoded as described above.

def simple_encode(string):
    """Returns the string with : and / encoded to %3A and %2F respectively."""
    # YOUR CODE HERE
    raise NotImplementedError()

Hint

Use the replace method of str.

# tests
assert (
    simple_encode("http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv")
    == "http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv"
)

Like the function open that loads a file into memory, pandas has a function read_csv that loads a csv file. The csv file can even reside on the web:

import pandas as pd

df_covid = pd.read_csv(url_covid)

print(type(df_covid))
df_covid

<class 'pandas.core.frame.DataFrame'>

	Case no.	Report date	Date of onset	Gender	Age	Name of hospital admitted	Hospitalised/Discharged/Deceased	HK/Non-HK resident	Case classification*	Confirmed/probable
0	1	23/01/2020	21/01/2020	M	39	NaN	Discharged	Non-HK resident	Imported case	Confirmed
1	2	23/01/2020	18/01/2020	M	56	NaN	Discharged	HK resident	Imported case	Confirmed
2	3	24/01/2020	20/01/2020	F	62	NaN	Discharged	Non-HK resident	Imported case	Confirmed
3	4	24/01/2020	23/01/2020	F	62	NaN	Discharged	Non-HK resident	Imported case	Confirmed
4	5	24/01/2020	23/01/2020	M	63	NaN	Discharged	Non-HK resident	Imported case	Confirmed
...	...	...	...	...	...	...	...	...	...	...
3268	3269	31/07/2020	26/07/2020	M	22	NaN	To be provided	HK Resident	Local case	Confirmed
3269	3270	31/07/2020	28/07/2020	F	31	NaN	To be provided	HK Resident	Epidemiologically linked with local case	Confirmed
3270	3271	31/07/2020	Asymptomatic	F	36	NaN	To be provided	HK Resident	Epidemiologically linked with local case	Confirmed
3271	3272	31/07/2020	Pending	F	22	NaN	To be provided	HK Resident	Local case	Confirmed
3272	3273	31/07/2020	28/07/2020	M	68	NaN	To be provided	HK Resident	Epidemiologically linked with local case	Confirmed

3273 rows × 10 columns

Tip

The above creates a DataFrame object:

The content of the csv file is displayed as an HTML table conveniently.
We can control how much information to show by setting the display options.)

Exercise Using the function pd.read_csv, load building_list_eng.csv as df_building from the url url_building.

url_building_csv = "http://www.chp.gov.hk/files/misc/building_list_eng.csv"
time = "20200801-1203"
url_building = (
    url_data_gov_hk_get + "?" + urlencode({"url": url_building_csv, "time": time})
)
# YOUR CODE HERE
raise NotImplementedError()
df_building

# tests
assert all(
    df_building.columns
    == [
        "District",
        "Building name",
        "Last date of residence of the case(s)",
        "Related probable/confirmed cases",
    ]
)  # check column names

Selecting and Removing columns¶

We can obtain the column labels of a Dataframe using its columns attribute.

df_covid.columns

Index(['Case no.', 'Report date', 'Date of onset', 'Gender', 'Age',
       'Name of hospital admitted', 'Hospitalised/Discharged/Deceased',
       'HK/Non-HK resident', 'Case classification*', 'Confirmed/probable'],
      dtype='object')

Using the indexing operator [], a column of a DataFrame can be returned as a Series object, which is essentially a named array.
We can further use the method value_counts to return the counts of different values in another Series object.

series_gender_counts = df_covid[
    "Gender"
].value_counts()  # return the number of male and female cases

print(type(series_gender_counts))
series_gender_counts

<class 'pandas.core.series.Series'>

F    1648
M    1625
Name: Gender, dtype: int64

Exercise For df_building, use the operator [] and method value_counts to assign series_district_counts to a Series object that stores the counts of buildings in different district.

# YOUR CODE HERE
raise NotImplementedError()
series_district_counts

# tests
assert all(series_district_counts[["Wong Tai Sin", "Kwun Tong"]] == [313, 212])

In df_covid, it appears that the column Name of hospital admitted contains no information. We can confirm this by

returning the column as a Series with df_covid_cases['Name of hospital admitted'], and
printing an array of unique column values using the method unique.

df_covid["Name of hospital admitted"].unique()

array([nan])

Exercise Drop the column Name of hospital admitted from df_covid using the drop method of the DataFrame.

# YOUR CODE HERE
raise NotImplementedError()
df_covid

Hint

Consider reading the documentation of the drop method for

mutating df_covid in place instead of creating a copy of the DataFrame with the column dropped, but
suppressing error when dropping a column that does not exist or has already been dropped.

# tests
assert all(
    df_covid.columns
    == [
        "Case no.",
        "Report date",
        "Date of onset",
        "Gender",
        "Age",
        "Hospitalised/Discharged/Deceased",
        "HK/Non-HK resident",
        "Case classification*",
        "Confirmed/probable",
    ]
)

Selecting Rows of DataFrame¶

We can select the confirmed male cases using the attribute loc and the indexing operator [].

df_confirmed_male = df_covid.loc[
    (df_covid["Confirmed/probable"] == "Confirmed") & (df_covid["Gender"] == "M")
]
print(type(df_covid.loc))
df_confirmed_male

<class 'pandas.core.indexing._LocIndexer'>

	Case no.	Report date	Date of onset	Gender	Age	Name of hospital admitted	Hospitalised/Discharged/Deceased	HK/Non-HK resident	Case classification*	Confirmed/probable
0	1	23/01/2020	21/01/2020	M	39	NaN	Discharged	Non-HK resident	Imported case	Confirmed
1	2	23/01/2020	18/01/2020	M	56	NaN	Discharged	HK resident	Imported case	Confirmed
4	5	24/01/2020	23/01/2020	M	63	NaN	Discharged	Non-HK resident	Imported case	Confirmed
5	6	26/01/2020	21/01/2020	M	47	NaN	Discharged	HK resident	Imported case	Confirmed
7	8	26/01/2020	25/01/2020	M	64	NaN	Discharged	Non-HK resident	Imported case	Confirmed
...	...	...	...	...	...	...	...	...	...	...
3264	3265	31/07/2020	25/07/2020	M	54	NaN	To be provided	HK Resident	Local case	Confirmed
3265	3266	31/07/2020	30/07/2020	M	69	NaN	To be provided	HK Resident	Epidemiologically linked with local case	Confirmed
3267	3268	31/07/2020	23/07/2020	M	61	NaN	To be provided	HK Resident	Epidemiologically linked with local case	Confirmed
3268	3269	31/07/2020	26/07/2020	M	22	NaN	To be provided	HK Resident	Local case	Confirmed
3272	3273	31/07/2020	28/07/2020	M	68	NaN	To be provided	HK Resident	Epidemiologically linked with local case	Confirmed

1624 rows × 10 columns

Tip

loc essentially returns an object that implements the advanced indexing method for __getitem__. In particular, the above uses boolean indexing.

Exercise Assign df_confirmed_local to a DataFrame of confirmed cases that are local or epidemiologically linked with a local case.

# YOUR CODE HERE
raise NotImplementedError()
df_confirmed_local

# tests
assert set(df_confirmed_local["Case classification*"].unique()) == {
    "Epidemiologically linked with local case",
    "Local case",
}

Exercise Write a function case_counts that

takes an argument district, and
returns the number of cases in district.

def case_counts(district):
    # YOUR CODE HERE
    raise NotImplementedError()

Hint

Be careful that there can be more than one case for each building and there may be multiple buildings associated with one case. You may want to use the split and strip methods of str to obtain a list of cases from the Dataframe.

# tests
assert case_counts("Kwai Tsing") == 109

CS1302

Pandas¶

Loading CSV Files with Pandas¶

Selecting and Removing columns¶

Selecting Rows of DataFrame¶