Data input/output and cleaning

ITEC 3160 Python Programming for Data Analysis,
Cengiz Günay

(License: CC BY-SA 4.0)

Prev - Working with vector data, Next - Data plotting, wrangling, and aggregation

Data formats

  • See Python for Data Analysis, Chapter 6 and 7

Some topics:

  • Loading and saving in different data formats
  • Common options for loading
  • Handling exceptions in formatting
  • Selecting index columns
  • Reading from URLs
  • Reading from databases
  • Binary formats (e.g., HDF5)

Data cleaning

  • Missing data with N/A, NaN, and NULL values
  • Filtering missing data out
  • Filling in missing data values
  • Eliminating duplicates
  • Replacing values
  • Adding new calculated columns
  • Cosmetics (axis labels, etc)
  • Discretization
  • Outliers
  • Random sampling and shuffling
  • String manipulation and regular expressions
Home