Data structures and performance

ITEC 4400/2160 Python Programming for Data Analysis,
Cengiz Günay

Prev - Python basics refresher, Next - Errors and Exceptions

Time complexity, big-O notation

General idea: time it takes for an algorithm to run.
Complexity: the number of steps involved in a process or pieces that make up a system ?

Complexity makes a difference in everyday use:

Fast and high quality services vs
those that lag, crash, and are unreliable

Big-O is calculated based on input size

In data analytics, we will often work with large amounts of data
Performance of algorithms matter more with large data

Does this algorithm scale up when it is given a large input?

Therefore, $O()$ notation indicates growth of time with an input, n

Definitions: Big-O notation

$O(1)$: constant time, independent of input size, $n$
(e.g. getting an item from an array with an index)
$O(\log(n))$: less than linear, exponent of $n$’s growth
(e.g., binary tree search; $log_2(4)=2$, $log_2(16)=4$)
$O(\sqrt{n})$: more than log, less than $n$ (e.g., process one row of matrix data)
$O(n)$: linear time with size (e.g. summation of every element in list)
$O(n^k)$: polynomial time ($k$: constant) (e.g., nested loops over whole input)
$O(2^n)$: exponential time (worst case; e.g., finding best route between two points)

Constant time: The holy grail

Why does indexing an array take constant time, $O(1)$?

In homogeneous arrays, each item takes the same memory space!
You can multiply the index by item size to find offset of item: $$ \mathsf{Item~at~index~} i = \mathsf{array~starting~address} + i * \mathsf{size~of~one~item}$$

Hint: In Python, list can contain heterogeneous items, but each item is an object reference that take up equal space.

Popular algorithms

Index lookup: $O(1)$
Search in array: $O(n)$
Hash lookup: $O(1)$ (in Python: dict and sets)

About sets

Pros:

faster lookup times
ensures uniqueness of items
awesome methods for determining what elements two sets do and do not share

Cons:

uses more memory
slower to build and append to than a list

Big-O can also apply memory size

Same idea: $O(1)$, $O(n)$, etc
Hashing usually requires additional space
Do not use complex data structures unnecessarily!

What’s hashing?

hash is a math function that can calculate a number out anything
the number would be unique (or close to) for each input
```
hash("hello") -> 46487
```
hash number can be used (with some tricks) as an index
then you can use the index instead of searching with $O(1)$

Other uses:

Comparing complex objects
Sorting (only for values)
Checksumming internet download packages

Exercise time!

Compare speed of two operations:

Searching an item in list versus set
Appending an item in list versus set

To find speed:

Initialize a long list, then create a set from it
Start a timer
Loop a large number of times (e.g. 100,000)
Put only one operation inside the loop to measure (e.g., "hello" in my_list)
Stop the timer and find elapsed time
Report time per operation by dividing with repeat multiplier you selected above

OR use defbench that GGC graduate Shaun Mitchell made!

Example program to measure performance

Copy-paste locally because cloud runs are not consistent: