Data structures and performance

ITEC 3160 Python Programming for Data Analysis,
Cengiz Günay

(License: CC BY-SA 4.0)

Prev - Recursion, Next - Working with vector data

Time complexity, big-O notation

  • General idea: time it takes for an algorithm to run.
  • Complexity: the number of steps involved in a process or pieces that make up a system ?

Complexity makes a difference in everyday use:

Fast and high quality services vs
those that lag, crash, and are unreliable

Big-O is calculated based on input size

  • In data analytics, we will often work with large amounts of data
  • Performance of algorithms matter more with large data

Does this algorithm scale up when it is given a large input?

  • Therefore, $O()$ notation indicates growth of time with an input, n

Definitions: Big-O notation

  • $O(1)$: constant time, independent of input size, $n$
    (e.g. getting an item from an array with an index)
  • $O(\log(n))$: less than linear, exponent of $n$’s growth
    (e.g., binary tree search; $log_2(4)=2$, $log_2(16)=4$)
  • $O(\sqrt{n})$: more than log, less than $n$ (e.g., process one row of matrix data)
  • $O(n)$: linear time with size (e.g. summation of every element in list)
  • $O(n^k)$: polynomial time ($k$: constant) (e.g., nested loops over whole input)
  • $O(2^n)$: exponential time (worst case; e.g., finding best route between two points)

Common algorithms

  • Index lookup: $O(1)$
  • Hash lookup: $O(1)$ (in Python: dict and sets)
  • Binary search: $O(\log_2 n)$
  • Search in array: $O(n)$
  • Find prime numbers: $O(\sqrt{n})$
  • Nested loops: $O(n^k)$
  • Find diFFeRent cAPiTAlizations of words: $O(2^n)$

Practice!

  • Select one complexity option from below
  • Write an example program with your teammates
  • Each team must work on a different option
  • First come first serve
  • One person submits team work

Options:

  • O(1)
  • O(log(n))
  • O(sqrt(n))
  • O(n)
  • O(n^2)
  • O(n^3)
  • O(2^n)

Bitwise operators

Can you solve word capitalizations using binary numbers?

Take advantage of binary representation of computers!

Bitwise operators in Python :

  • & AND
  • | OR
  • >>, << shifting bits to right or left, resp
  • ~ complement (negate)
  • ^ exclusive-OR (XOR)

Examples:

x & 4  # will be non-zero only if x has 3rd bit on
x | 15 # will turn on lower 4 bits
x >> 4 # shift x by 4 bits to the right (faster than dividing by 2)
x & ~x # will always be 0
x ^ x  # will always be 0

Big-O can also apply memory size

  • Same idea: $O(1)$, $O(n)$, etc
  • Hashing usually requires additional space
  • Do not use complex data structures unnecessarily!

Constant time: The holy grail

Why does indexing an array take constant time, $O(1)$?

  • In homogeneous arrays, each item takes the same memory space!
  • You can multiply the index by item size to find offset of item: $$ \mathsf{Item~{}at~{}index~{}} i = \mathsf{array~{}starting~{}address} + i * \mathsf{size~{}of~{}one~{}item} $$

Hint: In Python, list can contain heterogeneous items, but each item is an object reference that take up equal space.

What’s hashing?

  • hash is a math function that can calculate a number out anything
  • the number would be unique (or close to) for each input
    hash("hello") -> 46487
    
  • hash number can be used (with some tricks) as an index
  • then you can use the index instead of searching with $O(1)$

Other uses:

  • Comparing complex objects
  • Sorting (only for values)
  • Checksumming internet download packages

About sets

Pros:

  • faster lookup times
  • ensures uniqueness of items
  • awesome methods (union, intersection, difference)

Cons:

  • uses more memory
  • slower to build and append to than a list

Exercise time!

Step 1: Choose one option:

Start by selecting what to compare:

  1. the programs you wrote in last exercise; compare different input sizes
  2. list vs set operations:
    1. Searching an item OR
    2. Appending an item
  3. finding common elements between two lists by:
    1. writing an $O(n^2)$ algorithm AND
    2. using set operations (hint: intersection)

Continue below to complete exercise!

Step 2: Measure speed:

  1. Initialize a long list, then create a set from it
  2. Start a timer
  3. Loop a large number of times (e.g. 100,000)
  4. Put only one operation inside the loop to measure (e.g., "hello" in my_list)
  5. Stop the timer and find elapsed time
  6. Report time per operation by dividing with repeat multiplier you selected above

OR use defbench that GGC graduate Shaun Mitchell made!

Step 3: Compare performance

Use local computer because cloud run times vary (individual submission only)

Instructions:

  • Copy the code below to a file on your computer and download Python to run it
  • Change the operation to compare as described on slides
  • Run the code locally and note the timing of the two different operations
  • In your submission, include your resulting times with an explanation

Examples on how to create a long list:

  • [ "test" + str(x) for x in range (10) ]
  • paste a document as a string and then use .split(" ") to find words as a list

Hints:

  • Use ("text" in list) operator to search in list and set
  • No need to write full if statements - instead just execute the condition
  • Increase max_repeat to millions to see a difference
  • Seach for items that are towards the end of the list for worst case
import time

start = time.time()
repeat = 0
max_repeat = 1000
while repeat < max_repeat:
    #################
    # do something you want to measure here
    a = 200**2
    #################
    repeat += 1
print("It took", "{:.10}".format((time.time() - start)), "seconds total")
print("It took", "{:.10}".format((time.time() - start) / max_repeat), "seconds per operation")

Prev - Recursion, Next - Working with vector data

Home