Dictionaries and Sets

CS1302 Introduction to Computer Programming

import random
import matplotlib.pyplot as plt
%matplotlib widget
%reload_ext divewidgets

Motivation for associative container¶

The following code simulates the outcomes of rolling a die multiple times.

dice_rolls = [random.randint(1, 6) for i in range(10)]
print(*dice_rolls)

4 4 6 4 4 3 2 6 3 2

What is the distribution, i.e., fractional counts?

distribution = [dice_rolls.count(i) / len(dice_rolls) for i in range(7)]
plt.stem(range(7), distribution)
plt.xlabel("Outcomes")
plt.title("Distribution")
plt.ylim(0, 1)

Loading...

In the above code, distribution[i] stores the fractional count of outcome i.

However, distribution[0] is 0 because a dice does not have outcome 0. Can we avoid such redundancy?

distinct_outcomes = [
    outcome for outcome in range(1, 7) if dice_rolls.count(outcome) > 0
]
distribution = [
    dice_rolls.count(distinct_outcomes[i]) / len(dice_rolls)
    for i in range(len(distinct_outcomes))
]

plt.stem(distinct_outcomes, distribution)
plt.xlabel("Outcomes")
plt.title("Distribution")
plt.ylim(0, 1)

(0.0, 1.0)

In the above code,

distinct_outcomes stores the list of distinct outcomes, and
distribution[distinct_outcomes[i]] stores the fractional count of the i-th distinct outcome.

What about finding the distribution of characters in an article?
There are over 1 million unicode characters. Can we:

Obtain the distribution efficiently without creating an entry for each unicode character?
Compute the set of distinct characters efficiently without iterating over the set of all unicode characters?
Access the fractional count of a particular character efficiently without searching through the list of distinct outcomes?
Index distribution directly by the distinct characters in the article?

It is desirable to have a composite data type that

can keep a set of unique keys of different types (such as the characters in our example), and
associate to different keys possibly different values of any types (such as the fractional counts of the characters).

Such a data structure is called an associative container.

How to use associative containers in python?

We have already used sets and dictionaries before.

%%optlite -h 400
a = (lambda **kwargs: kwargs)(start=0, stop=5, step=1)
b = set([1, 1, 2, 3, 3, 3])
assert len(a) == len(b)

Loading...

Constructing associative containers¶

How to create a set/dictionary?

Similar to tuple/list, we can use enclosure, constructors, and comprehension.

How to create a set/dict by enumerating its keys/values?

For dict, enclose a comma-separated sequence of key : value pairs by braces { and }.

%%optlite -h 400
empty_dictionary = {}
a = {"a": 0, "b": 1}
b = {**a, "c": 0, "d": 1}

Loading...

For set, omit : value.

%%optlite -h 300
a = {(1, 2.0), print, *range(2), *"23"}
empty_set = {*()}  # Why not use {}?

Loading...

We can also create a set/dictionary from other objects using their constructors set/dict.

%%optlite -l -h 550
empty_set = set()
string2set = set("abc")
range2set = set(range(2))
list2set = set(["abc", range(2)])
set2set = set(list2set)

Loading...

%%optlite -l -h 650
empty_dict = dict()
enumerate2dict = dict(enumerate("abc"))
zip2dict = dict(zip("abc", "123"))
kwargs2dict = dict(one=1, two=2)
dict2dict = dict(kwargs2dict)

Loading...

dict.fromkeys?
### BEGIN SOLUTION
fromkeys_dict = dict.fromkeys(range(100), 0)
### END SOLUTION

# test
assert all(fromkeys_dict[k] == 0 for k in fromkeys_dict)

Signature: dict.fromkeys(iterable, value=None, /)
Docstring: Create a new dictionary with keys from iterable and values set to value.
Type:      builtin_function_or_method

How to use a rule to construct a set/dictionary?

The following function uses a one-line dictionary comprehension to return the distribution of items in a sequence:

def distribute(seq):
    return {k: seq.count(k) / len(seq) for k in set(seq)}

def plot_distribution(seq):
    dist = distribute(seq)
    plt.stem(
        dist.keys(),  # set-like view of the keys
        dist.values(),  # view of the values
    )
    plt.xlabel("Items")
    plt.title("Distribution")
    plt.ylim(0, 1)


plot_distribution("What is the distribution of different characters?")

The object methods keys and values provide a dynamic view of the keys.
Unlike a copy, subsequent changes to the dictionary are also reflected in a previously returned view.
items provides a set-like view of the key-value pairs.

%%optlite -h 500
a = dict(enumerate("abc"))
views = a.keys(), a.values(), a.items()
print(a.pop(1))  # remove the key 1 and its associated value
print(a.popitem())  # remove and return one key-value pair
a.clear()  # clear the dictionary

Loading...

set has pop and clear but not popitem. However, set.pop behaves like dict.popitem instead of dict.pop.

%%optlite -h 250
a = set("abc")
print(a.pop())  # remove and return an element
a.clear()  # clear the set

Loading...

def composite_set(stop):
    ### BEGIN SOLUTION
    return {x for factor in range(2, stop) for x in range(factor * 2, stop, factor)}
    ### END SOLUTION


print(*sorted(composite_set(100)))

4 6 8 9 10 12 14 15 16 18 20 21 22 24 25 26 27 28 30 32 33 34 35 36 38 39 40 42 44 45 46 48 49 50 51 52 54 55 56 57 58 60 62 63 64 65 66 68 69 70 72 74 75 76 77 78 80 81 82 84 85 86 87 88 90 91 92 93 94 95 96 98 99

Hashability¶

Why are some keys missing in the following dictionary/set?

%%optlite -h 350
a = {0: "a", 0.0: "b", 2: "b"}
b = {0j, 0, 0.0, "", False}
assert 0 == 0.0 == 0j == False != ""

Loading...

Associative containers are implemented as hash tables for efficient lookup of keys.

%%html
<iframe width="800" height="450" src="https://www.youtube.com/embed/LPzN8jgbnvA" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

Loading...

Python also uses dictionaries to implement global/local frames, and hash collisions can slow down the lookup process.

What are hashable?

def hashable(obj):
    try:
        hash(obj)
    except Exception:
        return False
    return True


for i in 0, 0.0, 0j, "", False, (), [], {}, set(), frozenset(), ([],):
    if hashable(i):
        print("{} may be hashable. E.g., hash({!r}) == {}".format(type(i), i, hash(i)))
    else:
        print("{} may not be hashable.".format(type(i)))

<class 'int'> may be hashable. E.g., hash(0) == 0
<class 'float'> may be hashable. E.g., hash(0.0) == 0
<class 'complex'> may be hashable. E.g., hash(0j) == 0
<class 'str'> may be hashable. E.g., hash('') == 0
<class 'bool'> may be hashable. E.g., hash(False) == 0
<class 'tuple'> may be hashable. E.g., hash(()) == 5740354900026072187
<class 'list'> may not be hashable.
<class 'dict'> may not be hashable.
<class 'set'> may not be hashable.
<class 'frozenset'> may be hashable. E.g., hash(frozenset()) == 133146708735736
<class 'tuple'> may not be hashable.

set has an immutable counterpart called frozenset, which is hashable.

hashable(frozenset())

True

Why most mutable objects are not hashable?

Mutating a key makes it a different key, which is hard to track.

Why dict does not have any immutable counterpart?

While elements of a set must be hashable and, therefore, often immutable, dictionary values may be of mutable types.

assert hash(0) == hash(0.0) == hash(0j) == hash(False) == hash("") and False != ""

Solution to Exercise 3

To avoid duplicate keys occupying different entries in a hash table.
Hash collision can be detected by == and handled by collision resolution techniques. To keep the hash table small, it is unavoidable to have hash collisions.

Accessing keys/values¶

How to traverse a set/dictionary?

Set and dictionaries are iterable:

%%optlite -h 500
s = set("abcde")
d = dict(enumerate("abcde"))
print(*(element for element in s))
print(*((k, d[k]) for k in d))
s[0]  # TypeError

Loading...

The above raises KeyError because -1 is not a key in the dictionary b.

Dictionary implements the __setitem__ method so we can enter a key-value pair to a dictionary using the assignment operator.

%%optlite -h 400
d = {}
d[-1] = d
del d[-1]
d[-1]

Loading...

To avoid KeyError, we can check if a key is in a dictionary efficiently (due to hashing) using the in operator.
The following is a different implementation of distribute.

def distribute(seq):
    dist = {}
    for i in seq:
        dist[i] = (dist[i] if i in dist else 0) + 1 / len(seq)
    return dist


plot_distribution("What is the distribution of different characters?")

Exercise 6

Unlike the previous implementation using one-line dictionary comprehension, the above alternative implementation uses multiple lines of code to build the dictionary incrementally, starting from an empty dictionary.

def distribute(seq):
    return {k : seq.count(k)/len(seq) for k in set(seq)}

Explain whether the alternative implementation is more efficient.

It is more efficient because

the alternative implementation traverses seq once with near constant time lookup of the key, but
the list comprehension can traverse seq a multiple times linear in len(seq), since every call to seq.count has to traverse seq once.

Shorter code needs not be more efficient.

dict.get?
def distribute(seq):
    dist = {}
    for i in seq:
        ### BEGIN SOLUTION
        dist[i] = dist.get(i, 0) + 1 / len(seq)
        ### END SOLUTION
    return dist


plot_distribution("What is the distribution of different characters?")

Signature: dict.get(self, key, default=None, /)
Docstring: Return the value for key if key is in the dictionary, else default.
Type:      method_descriptor

How to traverse in ascending order of the keys?

We can apply the function sorted to a set/dictionary to return a sorted list of the keys.

%%optlite -h 600
a = set(reversed("abcde"))
b = dict(reversed([*enumerate("abcde")]))
sorted_elements = sorted(a)
sorted_keys = sorted(b)

Loading...

def plot_distribution(seq):
    dist = distribute(seq)
    # plt.stem(dist.keys(), dist.values())
    ### BEGIN SOLUTION
    dist_list = sorted(dist.items(), key=lambda p: p[0])
    plt.stem(
        [p[0] for p in dist_list], [p[1] for p in dist_list]
    )
    ### END SOLUTION
    plt.xlabel("Items")
    plt.title("Distribution")
    plt.ylim(0, 1)


plot_distribution("What is the distribution of different characters?")

How to add an element to a set and remove an element from it?

Instead of subscription, set has the add/discard/remove methods for adding/removing elements.

%%optlite -h 400
a = set("abc")
a.add("d")
a.discard("a")
a.remove("b")
a.clear()
a.discard("a")  # no error
a.remove("b")  # KeyError

Loading...

Other operators and methods¶

Unlike str/tuple/list, set and dict do not implement addition + and multiplication *:

any(
    hasattr(container, attr)
    for attr in ("__add__", "__mult__")
    for container in (dict, set, frozenset)
)

False

set1 = set("abc")
set2 = set("cde")
### BEGIN SOLUTION
concatenated_set = {*set1, *set2}
### END SOLUTION
concatenated_set

{'a', 'b', 'c', 'd', 'e'}

dict1 = dict(enumerate("abc"))
dict2 = dict(enumerate("def", start=2))
### BEGIN SOLUTION
concatenated_dict = {**dict1, **dict2}
### END SOLUTION
concatenated_dict

{0: 'a', 1: 'b', 2: 'd', 3: 'e', 4: 'f'}

set overloads many other operators:

%%optlite -h 550
a, b = {1, 2}, {2, 3}

union = a | b
assert all(i in union for i in a) and all(i in union for i in b)

intersection = a & b
assert all(i in a and i in b for i in intersection)

assert intersection <= a <= union  # subset
assert union > b > intersection  # proper superset
assert len(a) + len(b) == len(intersection) + len(union)

symmetric_difference = a ^ b
assert all((i in a or i in b) and not (i in a and i in b) for i in symmetric_difference)
assert symmetric_difference == union - intersection
assert set.isdisjoint(intersection, symmetric_difference)
assert len(union) == len(intersection) + len(symmetric_difference)

Loading...

The following uses & and - to compare the sets of public attributes for set and dict:

set_attributes = {attr for attr in dir(set) if attr[0] != "_"}
dict_attributes = {attr for attr in dir(dict) if attr[0] != "_"}
print("Common attributes:", ", ".join(set_attributes & dict_attributes))
print("dict-specific attributes:", ", ".join(dict_attributes - set_attributes))
print("set-specific attributes:", ", ".join(set_attributes - dict_attributes))

Common attributes: copy, pop, update, clear
dict-specific attributes: keys, fromkeys, popitem, setdefault, items, values, get
set-specific attributes: difference_update, issubset, intersection_update, issuperset, remove, symmetric_difference_update, union, add, discard, symmetric_difference, difference, intersection, isdisjoint

For set, the intersection operation & can also be performed by

the class method intersection, which returns the intersection of its arguments, and
the object method intersection_update, which mutates a set object by intersecting the set with the arguments.

%%optlite -h 300
a = {0, 1, 2}
b = {1, 2, 3}
c = set.intersection(a, b, {2, 3, 4})
a.intersection_update(b, c)

Loading...

All other set-specific methods have an associated operator except isdisjoint as shown below.
The object method for union is update not union_update.

class method	object method	operator
`union`	`update`	`\|`
`intersection`	`intersection_update`	`&`
`symmetric_difference`	`symmetric_difference_update`	`^`
`issubset`		`<=`
`issuperset`		`>=`
`isdisjoint`

dict also has an update method that can update a dictionary using dictionary, iterables, and keyword arguments:

%%optlite -h 300
a = {}
a.update(enumerate("a"), b=2)
b = a.copy()
a.update(b, c=3)

Loading...

def group_by_type(seq):
    group = {}
    for i in seq:
        ### BEGIN SOLUTION
        group.setdefault(repr(type(i)), []).append(i)
        ### END SOLUTION
    return group


group_by_type(
    [
        *range(3),
        *"abc",
        *[i / 2 for i in range(3)],
        *[(i,) for i in range(3)],
        *[[i] for i in range(3)],
        *[{i} for i in range(3)],
        *[{i: i} for i in range(3)],
        print,
        hash,
        int,
        str,
        float,
        set,
        dict,
        (i for i in range(10)),
        enumerate("abc"),
        range(3),
        zip(),
        set.add,
        dict.copy,
    ]
)

{"<class 'int'>": [0, 1, 2],
 "<class 'str'>": ['a', 'b', 'c'],
 "<class 'float'>": [0.0, 0.5, 1.0],
 "<class 'tuple'>": [(0,), (1,), (2,)],
 "<class 'list'>": [[0], [1], [2]],
 "<class 'set'>": [{0}, {1}, {2}],
 "<class 'dict'>": [{0: 0}, {1: 1}, {2: 2}],
 "<class 'builtin_function_or_method'>": [<function print(*args, sep=' ', end='\n', file=None, flush=False)>,
  <function hash(obj, /)>],
 "<class 'type'>": [int, str, float, set, dict],
 "<class 'generator'>": [<generator object <genexpr> at 0x7ffaa8621b10>],
 "<class 'enumerate'>": [<enumerate at 0x7ffab083f9c0>],
 "<class 'range'>": [range(0, 3)],
 "<class 'zip'>": [<zip at 0x7ffaa861a880>],
 "<class 'method_descriptor'>": [<method 'add' of 'set' objects>,
  <method 'copy' of 'dict' objects>]}