CNS*2020 Tutorial #6: Methods from Data Science for Model Simulation, Analysis, and Visualization

### Back to the command-line: inspect the data
                    
                    ```bash 
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
                    ```
                    
                    <img src="images/Hadoop_Ch2_example2-2.png" style="max-width: auto; height: 300px; margin: 10px;"/>
                    - Many small files; can be analyzed sequentially with  `awk`:

```bash
                    % ./max_temperature.sh
                    1901 317
                    1902 244
                    1903 289
                    1904 256
                    1905 283
                    ...
                    ```

### Map in Java: (ID, row of text) `$\Rightarrow$` (year, temp)

```java
                    public class MaxTemperatureMapper
                        extends Mapper<LongWritable, Text, Text, IntWritable> {
                      
                      private static final int MISSING = 9999;

@Override
                      public void map(LongWritable key, Text value, Context context)
                          throws IOException, InterruptedException {
                      
                        String line = value.toString();
                        String year = line.substring(15, 19);
                        int airTemperature;
                        if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
                          airTemperature = Integer.parseInt(line.substring(88, 92));
                        } else {
                          airTemperature = Integer.parseInt(line.substring(87, 92));
                        }
                        String quality = line.substring(92, 93);
                        if (airTemperature != MISSING && quality.matches("[01459]")) {
                          context.write(new Text(year), new IntWritable(airTemperature));
                        }
                      }
                    }
                    ```

### Putting it all together

```java
                    public class MaxTemperature {
                    
                      public static void main(String[] args) throws Exception {
                        if (args.length != 2) {
                          System.err.println("Usage: MaxTemperature <input path> <output path>");
                          System.exit(-1);
                        }
                        Job job = new Job();
                        job.setJarByClass(MaxTemperature.class);
                        job.setJobName("Max temperature");
                      
                        FileInputFormat.addInputPath(job, new Path(args[0]));
                        FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);
                        job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);
                        job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
                      }
                    }
                    ```

### A sample run

```bash
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop MaxTemperature input/ncdc/sample.txt output
14/09/16 09:48:39 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/09/16 09:48:40 WARN mapreduce.JobSubmitter: Hadoop command-line option
parsing not performed. Implement the Tool interface and execute your application
with ToolRunner to remedy this.
14/09/16 09:48:40 INFO input.FileInputFormat: Total input paths to process : 1
14/09/16 09:48:40 INFO mapreduce.JobSubmitter: number of splits:1
14/09/16 09:48:40 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local26392882_0001
14/09/16 09:48:40 INFO mapreduce.Job: The url to track the job:
http://localhost:8080/
14/09/16 09:48:40 INFO mapreduce.Job: Running job: job_local26392882_0001
14/09/16 09:48:40 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/09/16 09:48:40 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14/09/16 09:48:40 INFO mapred.LocalJobRunner: Waiting for map tasks
14/09/16 09:48:40 INFO mapred.LocalJobRunner: Starting task:
attempt_local26392882_0001_m_000000_0
14/09/16 09:48:40 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
14/09/16 09:48:40 INFO mapred.LocalJobRunner:
14/09/16 09:48:40 INFO mapred.Task: Task:attempt_local26392882_0001_m_000000_0
is done. And is in the process of committing
14/09/16 09:48:40 INFO mapred.LocalJobRunner: map
14/09/16 09:48:40 INFO mapred.Task: Task 'attempt_local26392882_0001_m_000000_0'
done.
14/09/16 09:48:40 INFO mapred.LocalJobRunner: Finishing task:
attempt_local26392882_0001_m_000000_0
14/09/16 09:48:40 INFO mapred.LocalJobRunner: map task executor complete.
14/09/16 09:48:40 INFO mapred.LocalJobRunner: Waiting for reduce tasks
14/09/16 09:48:40 INFO mapred.LocalJobRunner: Starting task:
attempt_local26392882_0001_r_000000_0
14/09/16 09:48:40 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
14/09/16 09:48:40 INFO mapred.LocalJobRunner: 1 / 1 copied.
14/09/16 09:48:40 INFO mapred.Merger: Merging 1 sorted segments
14/09/16 09:48:40 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 50 bytes
14/09/16 09:48:40 INFO mapred.Merger: Merging 1 sorted segments
14/09/16 09:48:40 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 50 bytes
14/09/16 09:48:40 INFO mapred.LocalJobRunner: 1 / 1 copied.
14/09/16 09:48:40 INFO mapred.Task: Task:attempt_local26392882_0001_r_000000_0
is done. And is in the process of committing
14/09/16 09:48:40 INFO mapred.LocalJobRunner: 1 / 1 copied.
14/09/16 09:48:40 INFO mapred.Task: Task attempt_local26392882_0001_r_000000_0
is allowed to commit now
14/09/16 09:48:40 INFO output.FileOutputCommitter: Saved output of task
'attempt...local26392882_0001_r_000000_0' to file:/Users/tom/book-workspace/
hadoop-book/output/_temporary/0/task_local26392882_0001_r_000000
14/09/16 09:48:40 INFO mapred.LocalJobRunner: reduce > reduce
14/09/16 09:48:40 INFO mapred.Task: Task 'attempt_local26392882_0001_r_000000_0'
done.
14/09/16 09:48:40 INFO mapred.LocalJobRunner: Finishing task:
attempt_local26392882_0001_r_000000_0
14/09/16 09:48:40 INFO mapred.LocalJobRunner: reduce task executor complete.
14/09/16 09:48:41 INFO mapreduce.Job: Job job_local26392882_0001 running in uber
mode : false
14/09/16 09:48:41 INFO mapreduce.Job: map 100% reduce 100%
14/09/16 09:48:41 INFO mapreduce.Job: Job job_local26392882_0001 completed
successfully
14/09/16 09:48:41 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=377168
FILE: Number of bytes written=828464
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=129
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=61
Reduce input records=5
Reduce output records=2
Spilled Records=10
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=39
Total committed heap usage (bytes)=226754560
File Input Format Counters
Bytes Read=529
File Output Format Counters
Bytes Written=29
                    ```

### Practice time! Connect to our Spark server

- See our [Sched page](https://cns2020online.sched.com/event/a1f62dac60f1e6d93a34724500d3ff66) for the IP address and downloads
                    
                    On Mac and Linux:
                    - Download `spark-key` from [Sched](https://cns2020online.sched.com/event/a1f62dac60f1e6d93a34724500d3ff66)
                    - Open terminal and run: 
                    ```bash
                    $ ssh -i spark-key cnsuser@IP address
                    ```

On Windows:
                    - Download `spark-windows.ppk` from [Sched](https://cns2020online.sched.com/event/a1f62dac60f1e6d93a34724500d3ff66)
                    - Download [Putty](https://www.chiark.greenend.org.uk/~sgtatham/putty/) and
                    follow [instructions](https://devops.ionos.com/tutorials/use-ssh-keys-with-putty-on-windows/#connect-to-server-with-private-key) to use ppk key file
                    - [Open SSH connection](https://www.ssh.com/ssh/putty/windows/) to IP address with username "cnsuser"

### RDD actions

- Most common `reduce()`: Takes two inputs and outputs one
                    ```python
                    sum = inputRDD.reduce(lambda x, y: x + y)
                    ```
                    - 
                    `fold()` also takes a _zero_ value for initialization
                    ```python
                    sum = inputRDD.fold(0, lambda x, y: x + y)
                    ```
                    
                    - 
                    `aggregate()` asks for accummulation and combine functions. Example that calculates a running sum and count of elements to calculate an average value:
                    
                    ```python
                    sumCount = inputRDD.aggregate((0, 0),
                      (lambda acc, value: (acc[0] + value, acc[1] + 1)),
                      (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))
                    return sumCount[0] / float(sumCount[1])
                    ```

### A Spark-ling challenge for you!

Load our CSV file with measurements from model neuron simulations:
                    ```python
                    df = spark.read.csv("data/AnalySim.csv") # watch for capitalization
                    # show info
                    df.printSchema()
                    df.show()
                    ```
                    Question:
                    - Use
                    [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
                    and
                    [Dataframe](https://spark.apache.org/docs/latest/sql-getting-started.html)
                    operations to calculate one of min, max, average,
                    standard deviation for one column, or an
                    arithmetic operation between multiple columns to
                    find a ratio, slopes, etc. 
                    - You can also filter rows based on column constraints.
                    - Post your code and results on [NeuroStars](https://neurostars.org/t/cns-2020-tutorial-t6-methods-from-data-science-for-model-simulation-analysis-and-visualization/13337) or Zoom.