Sequence Types

Motivation of composite data type

The following code calculates the average of five numbers:

def average_five_numbers(n1, n2, n3, n4, n5):
    return (n1 + n2 + n3 + n4 + n5) / 5


average_five_numbers(1, 2, 3, 4, 5)
3.0

What about using the above function to compute the average household income in Hong Kong.
The labor size in Hong Kong is close to 4 million.

  • Should we create a variable to store the income of each individual?

  • Should we recursively apply the function to groups of five numbers?

What we need is

  • a composite data type that can keep a variable number of items, so that

  • we can then define a function that takes an object of the composite data type,

  • and returns the average of all items in the object.

How to store a sequence of items in Python?

We learned a composite data type that stores a sequence of characters. What is it?

tuple and list are two other built-in sequence types for ordered collections of objects. Unlike string, they can store items of possibly different types.

Indeed, we have already used tuples and lists before.

%%mytutor -h 300
a_list = "1 2 3".split()
a_tuple = (lambda *args: args)(1, 2, 3)
a_list[0] = 0
a_tuple[0] = 0

What is the difference between tuple and list?

  • List is mutable so programmers can change its items.

  • Tuple is immutable like int, float, and str, so

    • programmers can be certain the content stay unchanged, and

    • Python can preallocate a fixed amount of memory to store its content.

Constructing sequences

How to create tuple/list?

Mathematicians often represent a set of items in two different ways:

  1. Roster notation, which enumerates the elements in the sequence, e.g.,

\[ \{0, 1, 4, 9, 16, 25, 36, 49, 64, 81\} \]
  1. Set-builder notation, which describes the content using a rule for constructing the elements, e.g.,

\[ \{x^2| x\in \mathbb{N}, x< 10 \}, \]

namely the set of perfect squares less than 100.

Python also provides two corresponding ways to create a tuple/list:

  1. Enclosure

  2. Comprehension

How to create a tuple/list by enumerating its items?

To create a tuple, we enclose a comma separated sequence by parentheses:

%%mytutor -h 450
empty_tuple = ()
singleton_tuple = (0,)   # why not (0)?
heterogeneous_tuple = (singleton_tuple, (1, 2.0), print)
enclosed_starred_tuple = (*range(2), *"23")

Note that:

  • If the enclosed sequence has one term, there must be a comma after the term.

  • The elements of a tuple can have different types.

  • The unpacking operator * can unpack an iterable into a sequence in an enclosure.

To create a list, we use square brackets to enclose a comma separated sequence of objects.

%%mytutor -h 450
empty_list = []
singleton_list = [0]  # no need to write [0,]
heterogeneous_list = [singleton_list, (1, 2.0), print]
enclosed_starred_list = [*range(2), *"23"]

We can also create a tuple/list from other iterables using the constructors tuple/list as well as addition and multiplication similar to str.

%%mytutor -h 950
str2list = list("Hello")
str2tuple = tuple("Hello")
range2list = list(range(5))
range2tuple = tuple(range(5))
tuple2list = list((1, 2, 3))
list2tuple = tuple([1, 2, 3])
concatenated_tuple = (1,) + (2, 3)
concatenated_list = [1, 2] + [3]
duplicated_tuple = (1,) * 2
duplicated_list = 2 * [1]

Exercise Explain the difference between following two expressions. Why a singleton tuple must have a comma after the item.

print((1 + 2) * 2, (1 + 2,) * 2, sep="\n")
6
(3, 3)

(1+2)*2 evaluates to 6 but (1+2,)*2 evaluates to (3,3).

  • The parentheses in (1+2) indicate the addition needs to be performed first, but

  • the parentheses in (1+2,) creates a tuple.

Hence, singleton tuple must have a comma after the item to differentiate these two use cases.

How to use a rule to construct a tuple/list?

We can specify the rule using a comprehension,
which we have used in a generator expression.
E.g., the following is a python one-liner that returns a generator for prime numbers.

all?
prime_sequence = lambda stop: (
    x for x in range(2, stop) if all(x % divisor for divisor in range(2, x))
)
print(*prime_sequence(100))
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97

There are two comprehensions used:

  • In all(x % divisor for divisor in range(2, x)), the comprehension creates a generator of remainders to the function all, which returns True if all the remainders are non-zero else False.

  • In the return value (x for x in range(2, stop) if ...) of the anonymous function, the comprehension creates a generator of numbers from 2 to stop-1 that satisfy the condition of the if clause.

Exercise Use comprehension to define a function composite_sequence that takes a non-negative integer stop and returns a generator of composite numbers strictly smaller than stop. Use any instead of all to check if a number is composite.

any?
### BEGIN SOLUTION
composite_sequence = lambda stop: (
    x for x in range(2, stop) if any(x % divisor == 0 for divisor in range(2, x))
)
### END SOLUTION

print(*composite_sequence(100))
4 6 8 9 10 12 14 15 16 18 20 21 22 24 25 26 27 28 30 32 33 34 35 36 38 39 40 42 44 45 46 48 49 50 51 52 54 55 56 57 58 60 62 63 64 65 66 68 69 70 72 74 75 76 77 78 80 81 82 84 85 86 87 88 90 91 92 93 94 95 96 98 99

We can construct a list instead of a generator using list comprehension:

[x ** 2 for x in range(10)]  # Enclose comprehension by brackets
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Is the list comprehension the same as applying list to a generator expression?

list(x ** 2 for x in range(10))  # Enclose comprehension by brackets
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

List comprehension is more efficient as it does not need to create generator first:

%%timeit
[x ** 2 for x in range(10)]
1.96 µs ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(x ** 2 for x in range(10))
2.15 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Exercise The following are two different ways to use comprehension to construct a tuple. Which one is faster? Try predicting the results before running them.

%%timeit
tuple(x for x in range(100))
3.67 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
tuple([x for x in range(100)])
2.49 µs ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The second method is often faster because the list of items can be created faster with list comprehension instead of generator expression. This benefits appear to out-weight the cost in converting a list to a tuple.

With list comprehension, we can simulate a sequence of biased coin flips.

from random import random as rand

p = rand()  # unknown bias
coin_flips = ["H" if rand() <= p else "T" for i in range(1000)]
print("Chance of head:", p)
print("Coin flips:", *coin_flips)
Chance of head: 0.5861498409916134
Coin flips: H T H H H T H T T H H H H H H H H H T H H H T T H H H T H H H T H T H H H T T T H H T H H H T H H T H H T H T T T H H T H T T H H T H H H T H H H T T H H H H H H T H T T H H T H T T H H H H H H H H H T T H T H T H T T H T H H H H T H H H T H T H T H T T T H H T H H H H T T T T T H H H H T T H T T H H H T H H T H T H H T T T H H H H T T H T H H H T H H T H T H H T H H H H T H T T T T H T H H H H T H T H H H H T T T T H T H T T H T H T T T H H T H T H H T H H H H H H H T T T H H H H T H H H T T H T T T T T T H T H T T T H T T H H H H T H T T T T T H T T H H H H T T H H T H T T H T H T T H H T H H T H H H H H H H H T H T T H H H H T H H T H T H H H T H T H H H T T H T H T T H T T H T T T H T H H H H H H H T H H H H H H T H H H H T H H T T H T H T T H H T T T T H H H T H H T T H H H T T T T H T H H T H T H H T T T H H T T T H T H T T H T T T T H H H H H H H T H T H H H H T H H T H H H H T T T T H T H T H T H H H H H H T H H H H H T H T T T T H T T H H T T T H H T H T T H T H H T T H H T H T T H T H H H T H T H H T T H T T H T T T H H T T H T H T H H T T T H H H H H H T H T T T H T H T T T T T H H H T H H H H H H T T H T H H H H H H T T H H H H T T H H H T H H H T T H H T T T T H H H T T H T T H H H H T T H T H T H T H H T H H H H T H H T H H H T T T T T T H T H H H H H H H H H T H H H H H H T H H T T T H H H T H T H H T T H H H T T H T T H H T H H H T H T T H H T T H T H T H H H T H H T H T H T T H T T T H H H T H T H H T T H H T T T H T H H H H T T H H T H H H H T H H T T H H T H H H T T T T T H H T H T H H T T H T H T H H H H T H H T H H H T H T H T H H T H T H T H T H T T H T T H H H H T H T H H H H T T T T T H H T H T H T H H H T H T H H H H H H H H H H T H T T T H T H H T H H H H H T H H H T H T H H H T H H H T T H T H H H H T T H T H H T H H T H T H H T T H T H H T T T T T H H H T T T H T T H T H T H H H H T T H H H H T T T T H T H T T T H H T H H H H T T H H H T H H H T H T H H H T H T T H T H H H H T T H H H T H T T T T H H T T T T H H T T H H T H T H H H H H T

We can then estimate the bias by the fraction of heads coming up.

def average(seq):
    return sum(seq) / len(seq)


head_indicators = [1 if outcome == "H" else 0 for outcome in coin_flips]
fraction_of_heads = average(head_indicators)
print("Fraction of heads:", fraction_of_heads)
Fraction of heads: 0.576

Note that sum and len returns the sum and length of the sequence.

Exercise Define a function variance that takes in a sequence seq and returns the variance of the sequence.

def variance(seq):
    ### BEGIN SOLUTION
    return sum(i ** 2 for i in seq) / len(seq) - average(seq) ** 2
    ### END SOLUTION


delta = (variance(head_indicators) / len(head_indicators)) ** 0.5
print("95% confidence interval: [{:.2f},{:.2f}]".format(p - 2 * delta, p + 2 * delta))
95% confidence interval: [0.55,0.62]

Selecting items in a sequence

How to traverse a tuple/list?

Instead of calling the dunder method directly, we can use a for loop to iterate over all the items in order.

a = (*range(5),)
for item in a:
    print(item, end=" ")
0 1 2 3 4 

To do it in reverse, we can use the reversed function.

reversed?
a = [*range(5)]
for item in reversed(a):
    print(item, end=" ")
4 3 2 1 0 

We can also traverse multiple tuples/lists simultaneously by zipping them.

zip?
a = (*range(5),)
b = reversed(a)
for item1, item2 in zip(a, b):
    print(item1, item2)
0 4
1 3
2 2
3 1
4 0

How to select an item in a sequence?

Sequence objects such as str/tuple/list implements the getter method __getitem__ to return their items.

We can select an item of a sequence a by subscription

a[i]

where a is a list and i is an integer index.

A non-negative index indicates the distance from the beginning.

\[\boldsymbol{a} = (a_0, ... , a_{n-1})\]
a = (*range(10),)
print(a)
print("Length:", len(a))
print("First element:", a[0])
print("Second element:", a[1])
print("Last element:", a[len(a) - 1])
print(a[len(a)])  # IndexError
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Length: 10
First element: 0
Second element: 1
Last element: 9
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-22-2b19badaedfe> in <module>
      5 print("Second element:", a[1])
      6 print("Last element:", a[len(a) - 1])
----> 7 print(a[len(a)])  # IndexError

IndexError: tuple index out of range

a[i] with i >= len(a) results in an IndexError.

A negative index represents a negative offset from an imaginary element one past the end of the sequence.

\[\begin{split}\begin{aligned} \boldsymbol{a} &= (a_0, ... , a_{n-1})\\ & = (a_{-n}, ..., a_{-1}) \end{aligned}\end{split}\]
a = [*range(10)]
print(a)
print("Last element:", a[-1])
print("Second last element:", a[-2])
print("First element:", a[-len(a)])
print(a[-len(a) - 1])  # IndexError
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Last element: 9
Second last element: 8
First element: 0
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-23-738e893c8a70> in <module>
      4 print("Second last element:", a[-2])
      5 print("First element:", a[-len(a)])
----> 6 print(a[-len(a) - 1])  # IndexError

IndexError: list index out of range

a[i] with i < -len(a) results in an IndexError.

How to select multiple items?

We can use slicing to select a range of items as follows:

a[start:stop]
a[start:stop:step]

The selected items corresponds to those indexed using range:

(a[i] for i in range(start, stop))
(a[i] for i in range(start, stop, step))
a = (*range(10),)
print(a[1:4])
print(a[1:4:2])
(1, 2, 3)
(1, 3)

Unlike range, the parameters for slicing take their default values if missing or equal to None:

a = [*range(10)]
print(a[:4])  # start defaults to 0
print(a[1:])  # stop defaults to len(a)
print(a[1:4:])  # step defaults to 1
[0, 1, 2, 3]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3]

The parameters can also take negative values:

print(a[-1:])
print(a[:-1])
print(a[::-1])  # What are the default values used here?
[9]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

A mixture of negative and postive values are also okay:

print(a[-1:1])      # equal [a[-1], a[0]]?
print(a[1:-1])      # equal []?
print(a[1:-1:-1])   # equal [a[1], a[0]]?
print(a[-100:100])  # result in IndexError like subscription?
[]
[1, 2, 3, 4, 5, 6, 7, 8]
[]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Exercise (Challenge) Complete the following function to return a tuple (start, stop, step) such that range(start, stop, step) gives the non-negative indexes of the sequence of elements selected by a[i:j:k].

Hint: See note 3-5 in the python documentation.

def sss(a, i=None, j=None, k=None):
    ### BEGIN SOLUTION
    l = len(a)
    step = 1 if k is None else k
    m = l if step > 0 else l - 1
    start = 0 if i is None else min(i if i > 0 else max(i + l, 0), m)
    stop = l if j is None else min(j if j > 0 else max(j + l, 0), m)
    ### END SOLUTION
    return start, stop, step


a = [*range(10)]
assert sss(a, -1, 1) == (9, 1, 1)
assert sss(a, 1, -1) == (1, 9, 1)
assert sss(a, 1, -1, -1) == (1, 9, -1)
assert sss(a, -100, 100) == (0, 10, 1)

Exercise With slicing, we can now implement a practical sorting algorithm called quicksort to sort a sequence. Explain how the code works:

def quicksort(seq):
    """Return a sorted list of items from seq."""
    if len(seq) <= 1:
        return list(seq)
    i = random.randint(0, len(seq) - 1)
    pivot, others = seq[i], [*seq[:i], *seq[i + 1 :]]
    left = quicksort([x for x in others if x < pivot])
    right = quicksort([x for x in others if x >= pivot])
    return [*left, pivot, *right]


seq = [random.randint(0, 99) for i in range(10)]
print(seq, quicksort(seq), sep="\n")
[28, 5, 42, 34, 18, 71, 17, 92, 0, 52]
[0, 5, 17, 18, 28, 34, 42, 52, 71, 92]

The above recursion creates a sorted list as [*left, pivot, *right] where

  • pivot is a randomly selected item in seq,

  • left is the sorted list of items smaller than pivot, and

  • right is the sorted list of items no smaller than pivot.

The base case happens when seq contains at most one item, in which case seq is already sorted.