Lec 21 - Apache Arrow

class: center, middle, inverse, title-slide

# Lec 21 - Apache Arrow
## <br/> Statistical Computing and Computation
### Sta 663 | Spring 2022
### <br/> Dr. Colin Rundel

---

exclude: true

```python
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy

import pymc3 as pm
import arviz as az

import os

plt.rcParams['figure.dpi'] = 200

np.set_printoptions(
  edgeitems=30, linewidth=200,
  precision = 5, suppress=True
  #formatter=dict(float=lambda x: "%.5g" % x)
)

pd.set_option("display.width", 130)
pd.set_option("display.max_columns", 10)
pd.set_option("display.precision", 6)
```

```r
knitr::opts_chunk$set(
  fig.align="center",
  cache=FALSE
)

library(lme4)
```

```
## Loading required package: Matrix
```

```r
local({
  hook_err_old <- knitr::knit_hooks$get("error")  # save the old hook
  knitr::knit_hooks$set(error = function(x, options) {
    # now do whatever you want to do with x, and pass
    # the new x to the old hook
    x = sub("## \n## Detailed traceback:\n.*$", "", x)
    x = sub("Error in py_call_impl\$.*?\$\\: ", "", x)
    #x = stringr::str_wrap(x, width = 100)
    hook_err_old(x, options)
  })
  
  hook_warn_old <- knitr::knit_hooks$get("warning")  # save the old hook
  knitr::knit_hooks$set(warning = function(x, options) {
    x = sub("<string>:1: ", "", x)
    #x = stringr::str_wrap(x, width = 100)
    hook_warn_old(x, options)
  })
  
  hook_msg_old <- knitr::knit_hooks$get("output")  # save the old hook
  knitr::knit_hooks$set(output = function(x, options) {
    x = stringr::str_replace(x, "(## ).* ([A-Za-z]+Warning:)", "\\1\\2")
    x = stringr::str_split(x, "\n")[[1]]
    #x = stringr::str_wrap(x, width = 120, exdent = 3)
    x = stringr::str_remove_all(x, "\r")
    x = stringi::stri_wrap(x, width=120, exdent = 3, normalize=FALSE)
    x = paste(x, collapse="\n")
    
    #x = stringr::str_wrap(x, width = 100)
    hook_msg_old(x, options)
  })
})
```

---

## Apache Arrow

> Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
>
> A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.

---

## Language support

.pull-left-narrow[
Core implementations in:

* C
* C++
* C#
* go
* Java
* JavaScript
* Julia
* Rust
* MATLAB
* Python
* R
* Ruby
]

.pull-right-wide[
<img src="imgs/arrow_memory.png" width="80%" style="display: block; margin: auto;" />
]

---

## pyarrow

```python
import pyarrow as pa
```

The basic building blocks of Arrow are `array` and `table` objects, arrays are collections of data of a uniform type.

.pull-left[

```python
num  = pa.array([1, 2, 3, 2], type=pa.int8())
num
```

```
## <pyarrow.lib.Int8Array object at 0x2a2f7d880>
## [
##   1,
##   2,
##   3,
##   2
## ]
```

```python
year = pa.array([2019,2020,2021,2022])
year
```

```
## <pyarrow.lib.Int64Array object at 0x2a2f7da60>
## [
##   2019,
##   2020,
##   2021,
##   2022
## ]
```
]

.pull-right[

```python
name = pa.array(
  ["Alice", "Bob", "Carol", "Dave"],
  type=pa.string()
)
name
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7dac0>
## [
##   "Alice",
##   "Bob",
##   "Carol",
##   "Dave"
## ]
```
]

---

## Tables

A table is created by combining multiple arrays together to form the columns while also attaching names for each column.

```python
t = pa.table(
  [num, year, name],
  names = ["num", "year", "name"]
)

t
```

```
## pyarrow.Table
## num: int8
## year: int64
## name: string
## ----
## num: [[1,2,3,2]]
## year: [[2019,2020,2021,2022]]
## name: [["Alice","Bob","Carol","Dave"]]
```

.footnote[`table` is part of pyarrow but not part of the arrow standard, more on this in a bit]

---

## Array indexing

Elements of an array can be selected using `[]` with an integer index or a slice, the former returns a typed scalar the latter an array.

.pull-left[

```python
name[0]
```

```
## <pyarrow.StringScalar: 'Alice'>
```

```python
name[0:3]
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7dbe0>
## [
##   "Alice",
##   "Bob",
##   "Carol"
## ]
```

```python
name[:]
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7db80>
## [
##   "Alice",
##   "Bob",
##   "Carol",
##   "Dave"
## ]
```
]

.pull-right[

```python
name[-1]
```

```
## <pyarrow.StringScalar: 'Dave'>
```

```python
name[::-1]
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7dca0>
## [
##   "Dave",
##   "Carol",
##   "Bob",
##   "Alice"
## ]
```

```python
name[4]
```

```
## IndexError: index out of bounds
```

```python
name[0] = "Patty"
```

```
## TypeError: 'pyarrow.lib.StringArray' object does not support item assignment
```
]

.footnote[Arrow arrays are immutable and as such do allow for element assignment.]

---

## Data Types

The following types are language agnostic for the purpose of portability, however some differ slightly from what is available from Numpy and Pandas (or R),

* **Fixed-length primitive types**: numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number
  * Examples: `bool_()`, `uint64()`, `timestamp()`, `date64()`, and many more

* **Variable-length primitive types**: binary, string

* **Nested types**: list, map, struct, and union

* **Dictionary type**: An encoded categorical type

.footnote[See [here](https://arrow.apache.org/docs/python/api/datatypes.html#api-types) for the full list of types.]

---

## Schemas

A data structure that contains information on the names and types of columns for a table (or record batch),

.pull-left[

```python
t.schema
```

```
## num: int8
## year: int64
## name: string
```
]

.pull-right[

```python
pa.schema([
  ('num', num.type),
  ('year', year.type),
  ('name', name.type)
])
```

```
## num: int8
## year: int64
## name: string
```
]

---

## Schema metadata

Schemas can also store additional metadata (e.g. codebook like textual descriptions) in the form a string:string dictionary,

```python
new_schema = t.schema.with_metadata({
  'num': "Favorite number",
  'year': "Year expected to graduate",
  'name': "First name"
})
new_schema
```

```
## num: int8
## year: int64
## name: string
## -- schema metadata --
## num: 'Favorite number'
## year: 'Year expected to graduate'
## name: 'First name'
```

```python
t.cast(new_schema).schema
```

```
## num: int8
## year: int64
## name: string
## -- schema metadata --
## num: 'Favorite number'
## year: 'Year expected to graduate'
## name: 'First name'
```

---

## Missing values / None / NANs

.pull-left[

```python
pa.array([1,2,None,3])
```

```
## <pyarrow.lib.Int64Array object at 0x2a2f7dca0>
## [
##   1,
##   2,
##   null,
##   3
## ]
```

```python
pa.array([1.,2.,None,3.])
```

```
## <pyarrow.lib.DoubleArray object at 0x2a2f7db80>
## [
##   1,
##   2,
##   null,
##   3
## ]
```
]

.pull-right[

```python
pa.array(["alice","bob",None,"dave"])
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7dca0>
## [
##   "alice",
##   "bob",
##   null,
##   "dave"
## ]
```

```python
pa.array([1,2,np.nan,3])
```

```
## <pyarrow.lib.DoubleArray object at 0x2a2f7db20>
## [
##   1,
##   2,
##   nan,
##   3
## ]
```
]

<div>

.pull-left[

```python
pa.array([1,2,None,3])[2]
```

```
## <pyarrow.Int64Scalar: None>
```

```python
pa.array([1.,2.,None,3.])[2]
```

```
## <pyarrow.DoubleScalar: None>
```
]

.pull-right[

```python
pa.array(["alice","bob",None,"dave"])[2]
```

```
## <pyarrow.StringScalar: None>
```

```python
pa.array([1,2,np.nan,3])[2]
```

```
## <pyarrow.DoubleScalar: nan>
```
]

</div>

---

## Nest type arrays

.pull-left[
list type:

```python
pa.array([[1,2], [3,4], None, [5,None]])
```

```
## <pyarrow.lib.ListArray object at 0x2a2f7db80>
## [
##   [
##     1,
##     2
##   ],
##   [
##     3,
##     4
##   ],
##   null,
##   [
##     5,
##     null
##   ]
## ]
```
]

.pull-right[
struct type:

```python
pa.array([
  {'x': 1, 'y': True, 'z': "Alice"},
  {'x': 2,            'z': "Bob"  },
  {'x': 3, 'y': False             }
])
```

```
## <pyarrow.lib.StructArray object at 0x2a2f7dca0>
## -- is_valid: all not null
## -- child 0 type: int64
##   [
##     1,
##     2,
##     3
##   ]
## -- child 1 type: bool
##   [
##     true,
##     null,
##     false
##   ]
## -- child 2 type: string
##   [
##     "Alice",
##     "Bob",
##     null
##   ]
```
]

.footnote[See also [map](https://arrow.apache.org/docs/python/data.html#map-arrays) and [union](https://arrow.apache.org/docs/python/data.html#union-arrays) arrays.]

---

## Dictionary array

A dictionary array is the equivalent to a factor in R or pd.Categorical in Pandas,

.pull-left[ .small[

```python
levels = pa.array(['sun', 'rain', 'clouds', 'snow'])
values = pa.array([0,0,2,1,3,None])
dict_array = pa.DictionaryArray.from_arrays(values, levels)
dict_array
```

```
## <pyarrow.lib.DictionaryArray object at 0x2a2f25970>
##
## -- dictionary:
##   [
##     "sun",
##     "rain",
##     "clouds",
##     "snow"
##   ]
## -- indices:
##   [
##     0,
##     0,
##     2,
##     1,
##     3,
##     null
##   ]
```

```python
dict_array.type
```

```
## DictionaryType(dictionary<values=string, indices=int64, ordered=0>)
```
] ]

.pull-right[ .small[

```python
dict_array.dictionary_decode()
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7ddc0>
## [
##   "sun",
##   "sun",
##   "clouds",
##   "rain",
##   "snow",
##   null
## ]
```

```python
levels.dictionary_encode()
```

```
## <pyarrow.lib.DictionaryArray object at 0x2a2f257b0>
##
## -- dictionary:
##   [
##     "sun",
##     "rain",
##     "clouds",
##     "snow"
##   ]
## -- indices:
##   [
##     0,
##     1,
##     2,
##     3
##   ]
```
] ]

---

## Record Batches

In between a table and an array Arrow has the concept of a Record Batch - which represents a chunk of the larger table. They are composed of a named collection of equal-length arrays.

```python
batch = pa.RecordBatch.from_arrays(
  [num, year, name],
  ["num", "year", "name"]
)

batch
```

```
## pyarrow.RecordBatch
## num: int8
## year: int64
## name: string
```

.pull-left[

```python
batch.num_columns
```

```
## 3
```

```python
batch.num_rows
```

```
## 4
```
]

.pull-right[

```python
batch.nbytes
```

```
## 69
```

```python
batch.schema
```

```
## num: int8
## year: int64
## name: string
```
]

---

## Batch indexing

`[]` can be used with a Record Batch to select columns (by name or index) or rows (by slice), additionally the `slice()` method can be used to select rows.

.pull-left[

```python
batch[0]
```

```
## <pyarrow.lib.Int8Array object at 0x2a2f7de80>
## [
##   1,
##   2,
##   3,
##   2
## ]
```

```python
batch["name"]
```

```
## <pyarrow.lib.StringArray object at 0x2a2f7de20>
## [
##   "Alice",
##   "Bob",
##   "Carol",
##   "Dave"
## ]
```
]

.pull-right[

```python
batch[0:2].to_pandas()
```

```
##    num  year   name
## 0    1  2019  Alice
## 1    2  2020    Bob
```

```python
batch.slice(0,2).to_pandas()
```

```
##    num  year   name
## 0    1  2019  Alice
## 1    2  2020    Bob
```
]

---

## Tables vs Record Batches

As mentioned previously, `table` objects are not part of the Arrow specification - rather they are a convenience tool provided to help with the wrangling of multiple Record Batches.

```python
table = pa.Table.from_batches([batch] * 3)
table
```

```
## pyarrow.Table
## num: int8
## year: int64
## name: string
## ----
## num: [[1,2,3,2],[1,2,3,2],[1,2,3,2]]
## year: [[2019,2020,2021,2022],[2019,2020,2021,2022],[2019,2020,2021,2022]]
## name: [["Alice","Bob","Carol","Dave"],["Alice","Bob","Carol","Dave"],["Alice","Bob","Carol","Dave"]]
```

.pull-left[

```python
table.num_columns
```

```
## 3
```

```python
table.num_rows
```

```
## 12
```
]

.pull-right[

```python
table.to_pandas()
```

```
##     num  year   name
## 0     1  2019  Alice
## 1     2  2020    Bob
## 2     3  2021  Carol
## 3     2  2022   Dave
## 4     1  2019  Alice
## 5     2  2020    Bob
## 6     3  2021  Carol
## 7     2  2022   Dave
## 8     1  2019  Alice
## 9     2  2020    Bob
## 10    3  2021  Carol
## 11    2  2022   Dave
```
]

---

## Chunked Array

The columns of `table` are therefore composed of the columns of each of the batches, these are stored as ChuckedArrays instead of Arrays to reflect this.

.pull-left[

```python
table["name"]
```

```
## <pyarrow.lib.ChunkedArray object at 0x2a2fbcd60>
## [
##   [
##     "Alice",
##     "Bob",
##     "Carol",
##     "Dave"
##   ],
##   [
##     "Alice",
##     "Bob",
##     "Carol",
##     "Dave"
##   ],
##   [
##     "Alice",
##     "Bob",
##     "Carol",
##     "Dave"
##   ]
## ]
```
]

.pull-right[

```python
table[1]
```

```
## <pyarrow.lib.ChunkedArray object at 0x2a2fbce50>
## [
##   [
##     2019,
##     2020,
##     2021,
##     2022
##   ],
##   [
##     2019,
##     2020,
##     2021,
##     2022
##   ],
##   [
##     2019,
##     2020,
##     2021,
##     2022
##   ]
## ]
```
]

---

## Arrow + NumPy

Conversion between NumPy arrays and Arrow arrays is straight forward,

```python
np.linspace(0,1,11)
```

```
## array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
```

```python
pa.array( np.linspace(0,1,6) )
```

```
## <pyarrow.lib.DoubleArray object at 0x2a2fa8880>
## [
##   0,
##   0.2,
##   0.4,
##   0.6000000000000001,
##   0.8,
##   1
## ]
```

```python
pa.array(range(10)).to_numpy()
```

```
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
```

---

## NumPy & data copies

```python
pa.array(["hello", "world"]).to_numpy()
```

```
## ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```

```python
pa.array(["hello", "world"]).to_numpy(zero_copy_only=False)
```

```
## array(['hello', 'world'], dtype=object)
```

&nbsp;

```python
pa.array([1,2,None,4]).to_numpy()
```

```
## ArrowInvalid: Needed to copy 1 chunks with 1 nulls, but zero_copy_only was True
```

```python
pa.array([1,2,None,4]).to_numpy(zero_copy_only=False)
```

```
## array([ 1.,  2., nan,  4.])
```

&nbsp;

```python
pa.array([[1,2], [3,4], [5,6]]).to_numpy()
```

```
## ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```

```python
pa.array([[1,2], [3,4], [5,6]]).to_numpy(zero_copy_only=False)
```

```
## array([array([1, 2]), array([3, 4]), array([5, 6])], dtype=object)
```

---

## Arrow + Pandas

We've already seen some basic conversion of Arrow table objects to Pandas, the conversions here are a bit more complex than with NumPy due in large part to how Pandas handles missing data.

.pull-left[ .small[
| Source (Pandas)       | Destination (Arrow)   |
|-----------------------|-----------------------|
| `bool`                | `BOOL`                |
| `(u)int{8,16,32,64}`  | `(U)INT{8,16,32,64}`  |
| `float32`             | `FLOAT`               |
| `float64`             | `DOUBLE`              |
| `str / unicode`       | `STRING`              |
| `pd.Categorical`      | `DICTIONARY`          |
| `pd.Timestamp`        | `TIMESTAMP(unit=ns)`  |
| `datetime.date`       | `DATE`                |
| `datetime.time`       | `TIME64`              |
] ]

.pull-right[ .small[
| Source (Arrow)                  | Destination (Pandas)                           |
|---------------------------------|------------------------------------------------|
| `BOOL`                          | `bool`                                         |
| `BOOL` with nulls	              | `object` (with values `True`, `False`,`None`)  |
| `(U)INT{8,16,32,64}`            | `(u)int{8,16,32,64}`                           |
| `(U)INT{8,16,32,64}` with nulls	| `float64`                                      |
| `FLOAT`	                        | `float32`                                      |
| `DOUBLE`	                      | `float64`                                      |
| `STRING`	                      | `str`                                          |
| `DICTIONARY`	                  | `pd.Categorical`                               |
| `TIMESTAMP(unit=*)`             | `pd.Timestamp` (`np.datetime64[ns]`)           |
| `DATE`	                        | `object` (with `datetime.date` objects)        |
| `TIME64`	                      | `object` (with `datetime.time` objects)        |
] ]

.footnote[From [Type differences](https://arrow.apache.org/docs/python/pandas.html#type-differences) documentation]

---

## Series & data copies

Due to these discrepancies it is much more likely that converting from Arrow array to a Panda series will require a type to be changed in which case the data will need to be copied. Like `to_numpy()` the `to_pandas()` method also accepts the `zero_copy_only` argument, however it defaults to `False`.

.pull-left[

```python
pa.array([1,2,3,4]).to_pandas()
```

```
## 0    1
## 1    2
## 2    3
## 3    4
## dtype: int64
```

```python
pa.array(["hello", "world"]).to_pandas()
```

```
## 0    hello
## 1    world
## dtype: object
```

```python
pa.array(["hello", "world"]).dictionary_encode().to_pandas()
```

```
## 0    hello
## 1    world
## dtype: category
## Categories (2, object): ['hello', 'world']
```
]

.pull-right[

```python
pa.array([1,2,3,4]).to_pandas(zero_copy_only=True)
```

```
## 0    1
## 1    2
## 2    3
## 3    4
## dtype: int64
```

```python
pa.array(["hello", "world"]).to_pandas(zero_copy_only=True)
```

```
## ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```

```python
pa.array(["hello", "world"]).dictionary_encode().to_pandas(zero_copy_only=True)
```

```
## ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```
]

.footnote[Note that Arrow arrays are converted to Series while tables & batches are converted to DataFrames]

---

## Zero Copy Series Conversions

> Zero copy conversions from `Array` or `ChunkedArray` to NumPy arrays or pandas Series are possible in certain narrow cases:
>
> * The Arrow data is stored in an integer (signed or unsigned `int8` through `int64`) or floating point type (`float16` through `float64`). This includes many numeric types as well as timestamps.
>
> * The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).
> 
> * For `ChunkedArray`, the data consists of a single chunk, i.e. `arr.num_chunks == 1`. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.
>
> In these scenarios, `to_pandas` or `to_numpy` will be zero copy. In all other scenarios, a copy will be required.

---

## DataFrame & data copies

```python
table.to_pandas()
```

```python
table.to_pandas(zero_copy_only=True)
```

```
## ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block
```

```python
table.drop(['name']).to_pandas(zero_copy_only=True)
```

```
## ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block
```

```python
pa.table([num,year], names=["num","year"]).to_pandas(zero_copy_only=True)
```

```
## ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block
```

.footnote[[Source](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions)]

---

## Pandas -> Arrow

To convert from a Pandas DataFrame to an Arrow Table we can use the `from_pandas()` method (schemas can also be inferred from DataFrames)

```python
df = pd.DataFrame({
  'x': np.random.normal(size=5),
  'y': ["A","A","B","C","C"],
  'z': [1,2,3,4,5]
})

pa.Table.from_pandas(df)
```

```
## pyarrow.Table
## x: double
## y: string
## z: int64
## ----
## x: [[0.7312616306514687,0.7492374224202655,0.016214128758116807,-0.11221364659471317,-1.068060983676372]]
## y: [["A","A","B","C","C"]]
## z: [[1,2,3,4,5]]
```

```python
pa.Schema.from_pandas(df)
```

```
## x: double
## y: string
## z: int64
## -- schema metadata --
## pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 562
```

.footnote[The import of Pandas indexes is governed by the `preserve_index` argument]

---
class: center, middle

## An aside on tabular file formats

---

## Comma Separated Values

This and other text & delimiter based file formats are the most common and generally considered the most portable, however they have a number of significant draw backs

* no explicit schema or other metadata

* column types must be inferred from the data

* numerical values stored as text (efficiency and precision issues)

* limited compression options

---

## (Apache) Parquet

> ... provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala, and Apache Spark adopting it as a shared standard for high performance data IO.

Core features:
> The values in each column are physically stored in contiguous memory locations and this columnar storage provides the following benefits:
>
> * Column-wise compression is efficient and saves storage space
> * Compression techniques specific to a type can be applied as the column values tend to be of the same type
> * Queries that fetch specific column values need not read the entire row data thus improving performance
> * Different encoding techniques can be applied to different columns

---

## Feather

> ... is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. Feather was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R.

Core features:
* Direct columnar serialization of Arrow tables

* Supports all Arrow data types and compression

* Language agnostic

* Metadata makes it possible to read only the necessary columns for an operation

---
class: middle

.center[
## Example - File Format Performance
]

.footnote[ Based on [Apache Arrow: Read DataFrame With Zero Memory](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a) ]
---

## Building a large dataset

```python
np.random.seed(1234)

df = (
    pd.read_csv("https://sta663-sp22.github.io/slides/data/penguins.csv")
      .sample(1000000, replace=True)
      .reset_index(drop=True)
)

num_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm",	"body_mass_g"]
df[num_cols] = df[num_cols] + np.random.normal(size=(df.shape[0],len(num_cols)))

df
```

```
##           species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
## 0       Chinstrap      Dream       48.519017      17.136138         200.642256  3800.527739    male  2008
## 1          Gentoo     Biscoe       49.819091      15.317831         223.507723  5550.240852    male  2008
## 2       Chinstrap      Dream       45.174882      17.558060         188.966751  3449.851303  female  2007
## 3          Adelie     Biscoe       43.043647      18.630037         201.280925  4048.906267    male  2008
## 4          Gentoo     Biscoe       44.926619      13.633214         209.714220  4400.958077  female  2008
## ...           ...        ...             ...            ...                ...          ...     ...   ...
## 999995  Chinstrap      Dream       48.692352      18.184983         200.579977  3799.702510    male  2009
## 999996     Adelie     Biscoe       43.046500      20.182105         197.283516  4776.263332    male  2009
## 999997     Adelie  Torgersen       43.192871      17.278273         196.408173  4251.476411    male  2008
## 999998     Gentoo     Biscoe       53.208097      16.119915         230.537089  5498.259372    male  2009
## 999999     Gentoo     Biscoe       47.719684      13.876822         214.583392  4399.060693  female  2007
##
## [1000000 rows x 8 columns]
```

---

## Create output files

```python
import os
os.makedirs("scratch/", exist_ok=True)

df.to_csv("scratch/penguins-large.csv")
df.to_parquet("scratch/penguins-large.parquet")

import pyarrow.feather

pyarrow.feather.write_feather(
    pa.Table.from_pandas(df), 
    "scratch/penguins-large.feather"
)

pyarrow.feather.write_feather(
    pa.Table.from_pandas(df.dropna()), 
    "scratch/penguins-large_nona.feather"
)
```

---

## File Sizes

```python
def file_size(f):
    x = os.path.getsize(f)
    print(f, "\t\t", round(x / (1024 * 1024),2), "MB")
```

```python
file_size( "scratch/penguins-large.csv" )
```

```
## scratch/penguins-large.csv 		 100.91 MB
```

```python
file_size( "scratch/penguins-large.parquet" )
```

```
## scratch/penguins-large.parquet 		 32.44 MB
```

```python
file_size( "scratch/penguins-large.feather" )
```

```
## scratch/penguins-large.feather 		 48.93 MB
```

```python
file_size( "scratch/penguins-large_nona.feather" )
```

```
## scratch/penguins-large_nona.feather 		 50.93 MB
```

---

## Read Performance

```python
%timeit pd.read_csv("scratch/penguins-large.csv")
```
```
## 566 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

```python
%timeit pd.read_parquet("scratch/penguins-large.parquet")
```
```
## 116 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

&nbsp;

```python
%timeit pyarrow.csv.read_csv("scratch/penguins-large.csv")
```
```
## 34.2 ms ± 99 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

```python
%timeit pyarrow.parquet.read_table("scratch/penguins-large.parquet")
```
```
## 55.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

&nbsp;

```python
%timeit pyarrow.feather.read_table("scratch/penguins-large.feather")
```
```
## 7.12 ms ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

```python
%timeit pyarrow.feather.read_table("scratch/penguins-large_nona.feather")
```
```
## 7.4 ms ± 263 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---

## Read Performance (Arrow -> Pandas)

```python
%timeit pyarrow.csv.read_csv("scratch/penguins-large.csv").to_pandas()
```
```
## 88.9 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

&nbsp;

```python
%timeit pyarrow.parquet.read_table("scratch/penguins-large.parquet").to_pandas()
```
```
## 109 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

&nbsp;

```python
%timeit pyarrow.feather.read_feather("scratch/penguins-large.feather")
```
```
## 61.5 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

```python
%timeit pyarrow.feather.read_feather("scratch/penguins-large_nona.feather")
```
```
## 60.6 ms ± 70.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

---

## Column subset calculations - CSV & Parquet

```python
%timeit pd.read_csv("scratch/penguins-large.csv")["flipper_length_mm"].mean()
```
```
## 568 ms ± 9.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

```python
%timeit pd.read_parquet("scratch/penguins-large.parquet",  columns=["flipper_length_mm"]).mean()
```
```
## 9.74 ms ± 53.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

&nbsp;

```python
%timeit pyarrow.parquet.read_table("scratch/penguins-large.parquet", columns=["flipper_length_mm"]).to_pandas().mean()
```
```
## 8.46 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

```python
%timeit pyarrow.parquet.read_table("scratch/penguins-large.parquet")["flipper_length_mm"].to_pandas().mean()
```
```
## 60.7 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

---

## Column subset calculations - Feather

```python
%timeit pyarrow.feather.read_table("scratch/penguins-large.feather", columns=["flipper_length_mm"]).to_pandas().mean()
```
```
## 8.1 ms ± 53.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

```python
%timeit pyarrow.feather.read_table("scratch/penguins-large.feather")["flipper_length_mm"].to_pandas().mean()
```
```
## 14.4 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

&nbsp;

```python
%timeit pyarrow.feather.read_table("scratch/penguins-large_nona.feather", columns=["flipper_length_mm"]).to_pandas().mean()
```
```
## 4.5 ms ± 76.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

```python
%timeit pyarrow.feather.read_table("scratch/penguins-large_nona.feather")["flipper_length_mm"].to_pandas().mean()
```
```
## 10 ms ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---
class: middle

.center[
## Demo 1 - Remote Files + Datasets
]