Data wrangling and aggregation

ITEC 4400/2160 Python Programming for Data Analysis,
Cengiz Günay

(License: CC BY-SA 4.0)

Prev - Data input/output and cleaning, Next - Instructor Materials

Data wrangling

  • See Python for Data Analysis, Chapter 8

Hierarchical indexing:

  • partial indexing
  • unstack() method converts to DataFrame
  • stack() is the reverse
  • swaplevel() for reordering hierarchical indices
  • sort_index() for sorting by one index
  • Summary statistics with vector operators, such as sum(level=, axis=)

Combining and merging

  • merge() by using keys (indices) like the SQL join operator
    • inner, left, right, and outer joins possible
  • concat() for stacking objects

Reshape and pivot

  • stack/ vs unstack
  • reshape
  • pivot vs melt

Data cleaning

  • Missing data with N/A, NaN, and NULL values
  • Filtering missing data out
  • Filling in missing data values
  • Eliminating duplicates
  • Replacing values
  • Adding new calculated columns
  • Cosmetics (axis labels, etc)
  • Discretization
  • Outliers
  • Random sampling and shuffling
  • String manipulation and regular expressions
Home