## ITEC 4220 - Advanced Data Analytics ### Module 3 - Data Scrubbing in Command Line (with regular expressions) #### Credits: Cengiz Gunay, Rick Price #### Reading: Ch 1 & 2 from [Book's website](https://www.datascienceatthecommandline.com/), [Safari Books/O'Reilly](https://learning.oreilly.com/library/view/data-science-at/9781491947845/)
or [free source](https://github.com/jeroenjanssens/data-science-at-the-command-line) #### [Conquering the Command Line](http://conqueringthecommandline.com/)
### Data Science is OSEMN: Anatomy of a project - Start with a question or hypothesis that is testable with existing data 1. (O)btain data 2. (S)crub it for relevant parts 3. (E)xplore data to understand what can be done 4. Convert question into statistical (M)odel 5. Select and use a technique to optimize or test model with data 6. i(N)terpret results: visualize, summarize, make a recommendation 1. Go back to 1 and revisit/modify/repeat
### Obtaining data - Download - Query database - Extract from sources (e.g., HTML crawl/parse) - Generate yourself (reading sensors)
### Scrubbing data - prepping for analysis: - Filtering lines - Extracting only some columns - Replacing values - Extracting words - Handling missing values - Converting data formats
### Exploring data To understand nature of data and what can be done with it - Browse and look at data - Derive statistics - Visualize
### Modeling data To predict from data - Example 1: testing whether global warming exists; check for correlation between time and temperature - Example 2: if person A buys book X, would person B also buy it?
### Interpret data - Draw conclusions - Evaluate meaning of results - Communicate results
### Apache Tools - **Hadoop:** MapReduce - **Pig:** High-level language that generates sequences of MapReduce programs - **Hive:** Data warehouse software. Allows SQL on distributed storage - **Spark** - Unified analytics engine. - Provides high-level APIs in Java, Scala, Python and R - Many libraries - **Cassandra:** Scalable, distributed database Most require the **command line**.
### What is the command line? - Endless text flow
- Allows streaming operations - More flexible than UI - Many tools are already included - Multiple ways to install/access
### Why command line? - Agile - REPL- Read, Evaluate, Print, Loop - Can rapidly start, stop, and adjust jobs - Use it in addition to UI - Scalable - Extensible: 40 years running - Ubiquitous: 95% of top 500 supercomputers
### How to practice? #### On Windows: Do ONE of the following: - Open [REPL.it](https://replit.com/languages/bash) - Install [Git](https://git-scm.com/) and use _Git Bash_ - Run textbook image using _Docker_ (next slide down) - Run your own Linux virtual machine - Connect to ITECLAB VPN to use our own servers (next slide down) #### On Mac/Linux: - Open Terminal
### Docker image from textbook - Container technology - **has all the tools used in book** - Allows running multiple OS's and technologies in a Container - Many pre-built containers that you can download - [Install Docker](https://www.docker.com/get-started), and run: ``` docker run --rm -it datasciencetoolbox/dsatcl2e ``` - Will launch a Docker container running Unix created by the author
### Connect to GGC servers via the ITECLAB VPN - Open `vpn2.ggc.edu` in browser - Submit form to download and install Cisco client - Run client and **uncheck option** to "Block unsafe connections" - connect to **`vpn.ggc.edu`** (not vpn2) - Select **ITECLAB** from pull down menu - Enter GGC ID and Password - Once VPN established, connect to our server with: ```bash ssh USERNAME@IP ``` (ask for USERNAME and IP in class)
### How to get started and learn more? Two resources: 1. [Conquering the command line](http://conqueringthecommandline.com/) by Mark Bates
(free online book for starters) 1. Ch 1-2 of course textbook **Data Science at the Command Line**
(see links on [cover slide](#/))
### Basic Commands (Unix) - `cd` - Change directory - `pwd` - Display directory path, current - `ls` - List files in current directory - `ls -la` - List all files with their attributes - `clear` - Clear screen - `rm` - delete a file - `mkdir` - Make a directory - `mv` - move a file or rename it - `cp` - Copy a file - `echo` - display a string
### More Advanced Commands (Unix) - `awk, nawk, gawk` - text stream processing - `grep` - find lines based on keyword - `cat` - stream file - `head/tail` - get beginning and end of file/stream - `sudo` - run command with administrative power - `curl` - download file from web
### awk
`BEGIN {awk commands}` Body block - applies the awk-commands on every input line. `END {awk-commands}`
- Download [marks.txt](files/marks.txt) and run one by one: ```bash awk 'BEGIN{printf "No Name Subject Marks\n"} {print}' marks.txt awk 'BEGIN{printf "Total Marks\n"; marks = 0} {marks = marks+$4} END{print marks}' marks.txt ``` - [TutorialsPoint](https://www.tutorialspoint.com/awk/awk_overview.htm)
### grep - searches named input file for lines matching a pattern ```bash grep "Kedar" marks.txt ``` [TutorialsPoint](https://www.tutorialspoint.com/unix_commands/grep.htm)
### cat - Concatenate file(s) to standard output (stream) - Download [marks2.txt](files/marks2.txt) and run: ```bash cat *.txt ``` [TutorialsPoint](https://www.tutorialspoint.com/unix_commands/cat.htm)
### sudo - "super user do" - Allows you to have root control of your Unix system - `sudo` _command_ [TutorialsPoint](https://www.tutorialspoint.com/unix_commands/sudo.htm)
### curl - Downloads file or HTML page from web - Allows you to use different protocols ```bash curl https://www.google.com/ ``` [TutorialsPoint](https://www.tutorialspoint.com/unix_commands/curl.htm)
### Reading and writing files: redirection - Use `<` for reading a file: ```bash cat < file.txt ``` - Use `>` for writing into a file: ```bash cat file.txt > newfile.txt cat another-file.txt >> newfile.txt # appends ``` - Use `|` for cascading commands, streaming, piping: ```bash grep word file.txt | head ``` - They can be combined! ```bash grep word < file.txt | head > top-finds.txt ``` [More info in DSCL Ch 2.3.5](https://jeroenjanssens.com/dsatcl/chapter-2-getting-started.html#redirecting-input-and-output)
### An example Finding out when is the next Fashion Week using New York Times web API: ```bash $ cd ~/book/ch01/data $ parallel -j1 --progress --delay 0.1 --results results "curl -sL "\ > "'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=New+York+'"\ > "'Fashion+Week&begin_date={1}0101&end_date={1}1231&page={2}&api-key='"\ > "'
'" ::: {2009..2013} ::: {0..99} > /dev/null Computers / CPU cores / Max jobs to run 1:local / 4 / 1 Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete local:1/9/100%/0.4s ``` Ran 500 parallel queries to receive 1000 articles in JSON format.
### Parsing
Looking at the output
directories under `results`:
Combine and process:
```bash $ cat results/1/*/2/*/stdout | > jq -c '.response.docs[] | {date: .pub_date, type: .document_type, '\ > 'title: .headline.main }' | json2csv -p -k date,type,title > fashion.csv ``` 1. `cat`: Outputs contents of files 1. `jq`: Parses JSON content 1. `json2csv`: Converts to CSV
### Exploring Count number of rows in result: ```bash $ wc -l fashion.csv 4856 fashion.csv ``` Select and browse some columns: ```bash $ < fashion.csv cols -c date cut -dT -f1 | head | csvlook |------------+------------+-----------------------------------------| | date | type | title | |------------+------------+-----------------------------------------| | 2009-02-15 | multimedia | Michael Kors | | 2009-02-20 | multimedia | Recap: Fall Fashion Week, New York | | 2009-09-17 | multimedia | UrbanEye: Backstage at Marc Jacobs | | 2009-02-16 | multimedia | Bill Cunningham on N.Y. Fashion Week | | 2009-02-12 | multimedia | Alexander Wang | | 2009-09-17 | multimedia | Fashion Week Spring 2010 | | 2009-09-11 | multimedia | Of Color | Diversity Beyond the Runway | | 2009-09-14 | multimedia | A Designer Reinvents Himself | | 2009-09-12 | multimedia | On the Street | Catwalk | |-------------+------------+-----------------------------------------| ```
### Plotting Create a plot using R, Rio, and ggplot2 ```bash $ < fashion.csv Rio -ge 'g + geom_freqpoly(aes(as.Date(date), color=type), '\ > 'binwidth=7) + scale_x_date() + labs(x="date", title="Coverage of New York'\ > ' Fashion Week in New York Times")' | display ```
### Regular Expressions - A [regular expression](https://en.wikipedia.org/wiki/Regular_expression) (or regex) is a language for forming search/replace patterns for text processing. - Succinct and powerful tool to extract information from text input - Fast implementations in many languages/frameworks/platforms - Practice on [RegExr](https://regexr.com/)
### [Commands in a regexp pattern](https://www.regular-expressions.info/quickstart.html) - `/` default quoting character at start or end a regex pattern - `.` matches any single character - `[]` matches a list of characters like `[abc]`, ranges like `[0-9]`, or a combination like `[a-zA-Z_]` - `\X` shortcut character lists: `\w` all word characters, `\s` all whitespace characters, so on #### Example: `/[HY][ea]y./` will match string "Hey", "Heya", "Yay!", ... #### Question: `/.e..\w/` will it match "Hello"?
### Repetition - `*` 0 or more matches of _previous_ character/pattern: `a*` will match empty string or `a`, `aa`, so on. - `+` is like `*`, but will match 1 or more: `a+` will not match empty string - `{}` specifies number of matches of previous pattern: `a{1,3}` will match `a`, `aa`, and `aaa` - `?` is 0 or 1 matches, meaning it's optional (same as `{0,1}`). #### Example: `/Gun.*/` will match text "Gunay" #### Question: `/Guna*/` will it match text "Gunay"?
### Escaping, grouping, and anchoring - `^` and `$` will match beginning and end of line (or string), resp. - `\` will help you _escape_ the commands to search literal characters:
`\+` will match `+` and `\$` will match `$` - `()` groups subsets for repeating or replacing - `|` logical OR operation, matches one of the possible expressions #### Example: `/^(gun|ay)+$/` will match both "gungungungun" and "ayayayay" #### Question: `/^(Price|is|right)*/` will it match text "Priceisright"? will it match text " rightisright"?
### Replacing text Very useful feature, general form: `s/search/replace/flags` - `search` is the regexp that may contain grouping `()` - `replace` is the replacement text, and can insert matched groups using `\1`, `\2`, and so on, according to order of matched groups - `flags` are single characters indicating: - `i`: case insensitive match - `g`: replace multiple times, globally - `m`: multi-line application; `^$` matches beginning and end of multi-line string and `.` can match end of line (`\n`) characters
### Next steps - Look at assignment - [Practice](https://regexr.com) - Read more about [regexp](https://www.regular-expressions.info/quickstart.html) or [Javascript Regexp](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Exp)
< Home