Lec 08 - pandas

class: center, middle, inverse, title-slide

# Lec 08 - pandas
## <br/> Statistical Computing and Computation
### Sta 663 | Spring 2022
### <br/> Dr. Colin Rundel

---

exclude: true

```python
import scipy
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

np.set_printoptions(edgeitems=3, linewidth=180)
```

```r
local({
  hook_old <- knitr::knit_hooks$get("error")  # save the old hook
  knitr::knit_hooks$set(error = function(x, options) {
    # now do whatever you want to do with x, and pass
    # the new x to the old hook
    x = sub("## \n## Detailed traceback:\n.*$", "", x)
    x = sub("Error in py_call_impl\$.*?\$\\: ", "", x)
    hook_old(x, options)
  })
})
```

---

## pandas?

pandas is an implementation of data frames in Python - it takes much of its inspiration from R and NumPy.

> pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

Key features:

* DataFrame object class

* Reading and writing tabular data

* Data munging (filtering, grouping, summarizing, joining, etc.)

* Data reshaping

---

## DataFrame

```python
import pandas as pd
```

.small[
.pull-left[

```python
iris = pd.read_csv("data/iris.csv")
iris
```

```
##      Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
## 0             5.1          3.5           1.4          0.2     setosa
## 1             4.9          3.0           1.4          0.2     setosa
## 2             4.7          3.2           1.3          0.2     setosa
## 3             4.6          3.1           1.5          0.2     setosa
## 4             5.0          3.6           1.4          0.2     setosa
## ..            ...          ...           ...          ...        ...
## 145           6.7          3.0           5.2          2.3  virginica
## 146           6.3          2.5           5.0          1.9  virginica
## 147           6.5          3.0           5.2          2.0  virginica
## 148           6.2          3.4           5.4          2.3  virginica
## 149           5.9          3.0           5.1          1.8  virginica
## 
## [150 rows x 5 columns]
```
]

.pull-right[

```python
print( type(iris) )
```

```
## <class 'pandas.core.frame.DataFrame'>
```

* Just like R a DataFrame is a collection of vectors with a common length

* Column types can be heterogeneous

* Both columns and rows can have names

]
]

---

## Series

The columns of a DataFrame are constructed as Series - a 1d array like object containing values of the same type (similar to an ndarray).

.pull-left[

```python
pd.Series([1,2,3,4])
```

```
## 0    1
## 1    2
## 2    3
## 3    4
## dtype: int64
```

```python
pd.Series(["C","B","A"])
```

```
## 0    C
## 1    B
## 2    A
## dtype: object
```

```python
pd.Series([True])
```

```
## 0    True
## dtype: bool
```
]

.pull-right[

```python
pd.Series(range(5))
```

```
## 0    0
## 1    1
## 2    2
## 3    3
## 4    4
## dtype: int64
```

```python
pd.Series([1,"A",True])
```

```
## 0       1
## 1       A
## 2    True
## dtype: object
```
]

---

## Series methods

Once constructed the components of a series can be accessed via `array()` and `index()` methods.

```python
s = pd.Series([4,2,1,3])
s.array
```

```
## <PandasArray>
## [4, 2, 1, 3]
## Length: 4, dtype: int64
```

```python
s.index
```

```
## RangeIndex(start=0, stop=4, step=1)
```

An index can also be explicitly provided when constructing a Series,

```python
t = pd.Series([4,2,1,3], index=["a","b","c","d"])
```

.pull-left[

```python
t
```

```
## a    4
## b    2
## c    1
## d    3
## dtype: int64
```
]

.pull-right[

```python
t.array
```

```
## <PandasArray>
## [4, 2, 1, 3]
## Length: 4, dtype: int64
```

```python
t.index
```

```
## Index(['a', 'b', 'c', 'd'], dtype='object')
```
]

---

## Series + NumPy

Series objects are compatible with NumPy like functions (vectorized)

```python
t = pd.Series([4,2,1,3], index=["a","b","c","d"])
```

```python
t + 1
```

```
## a    5
## b    3
## c    2
## d    4
## dtype: int64
```

```python
np.log(t)
```

```
## a    1.386294
## b    0.693147
## c    0.000000
## d    1.098612
## dtype: float64
```

```python
np.exp(-t**2/2)
```

```
## a    0.000335
## b    0.135335
## c    0.606531
## d    0.011109
## dtype: float64
```

---

## Series indexing

Series can be indexed in the same was as NumPy arrays with the addition of being able to use label(s) when selecting elements.

```python
t = pd.Series([4,2,1,3], index=["a","b","c","d"])
```

.pull-left[

```python
t[1]
```

```
## 2
```

```python
t[[1,2]]
```

```
## b    2
## c    1
## dtype: int64
```

```python
t["c"]
```

```
## 1
```

```python
t[["a","d"]]
```

```
## a    4
## d    3
## dtype: int64
```
]

.pull-right[

```python
t[t == 3]
```

```
## d    3
## dtype: int64
```

```python
t[t % 2 == 0]
```

```
## a    4
## b    2
## dtype: int64
```

```python
t["d"] = 6
t
```

```
## a    4
## b    2
## c    1
## d    6
## dtype: int64
```
]

---

## Index alignment

When performing (arithmetic) operations on series, they will attempt to align by their index,

```python
m = pd.Series([1,2,3,4], index = ["a","b","c","d"])
n = pd.Series([4,3,2,1], index = ["d","c","b","a"])
o = pd.Series([1,1,1,1,1], index = ["b","d","a","c","e"])
```

.pull-left[

```python
m + n
```

```
## a    2
## b    4
## c    6
## d    8
## dtype: int64
```

```python
n + m
```

```
## a    2
## b    4
## c    6
## d    8
## dtype: int64
```
]

.pull-right[

```python
n + o
```

```
## a    2.0
## b    3.0
## c    4.0
## d    5.0
## e    NaN
## dtype: float64
```
]

---

## Series and dicts

Series can also be constructed from dicts, in which case the keys are used to form the index,

```python
d = {"anna": "A+", "bob": "B-", "carol": "C", "dave": "D+"}
pd.Series(d)
```

```
## anna     A+
## bob      B-
## carol     C
## dave     D+
## dtype: object
```

Index order will follow key order, unless overriden by `index`,

```python
pd.Series(d, index = ["dave","carol","bob","anna"])
```

```
## dave     D+
## carol     C
## bob      B-
## anna     A+
## dtype: object
```

---

## Missing values

Pandas encodes missing values using NaN (mostly),

.pull-left[

```python
s = pd.Series(
  {"anna": "A+", "bob": "B-", "carol": "C", "dave": "D+"}, 
  index = ["erin","dave","carol","bob","anna"]
)
```

```python
s
```

```
## erin     NaN
## dave      D+
## carol      C
## bob       B-
## anna      A+
## dtype: object
```

```python
pd.isna(s)
```

```
## erin      True
## dave     False
## carol    False
## bob      False
## anna     False
## dtype: bool
```

]

.pull-right[

```python
s = pd.Series(
  {"anna": 97, "bob": 82, "carol": 75, "dave": 68}, 
  index = ["erin","dave","carol","bob","anna"],
  dtype = 'int64'
)
```

```python
s
```

```
## erin      NaN
## dave     68.0
## carol    75.0
## bob      82.0
## anna     97.0
## dtype: float64
```

```python
pd.isna(s)
```

```
## erin      True
## dave     False
## carol    False
## bob      False
## anna     False
## dtype: bool
```
]

---

## Aside - why `np.isna()`?

```python
s = pd.Series([1,2,3,None])
s
```

```
## 0    1.0
## 1    2.0
## 2    3.0
## 3    NaN
## dtype: float64
```

.pull-left[

```python
pd.isna(s)
```

```
## 0    False
## 1    False
## 2    False
## 3     True
## dtype: bool
```

```python
s == np.nan
```

```
## 0    False
## 1    False
## 2    False
## 3    False
## dtype: bool
```
]

.pull-right[

```python
np.nan == np.nan
```

```
## False
```

```python
np.nan != np.nan
```

```
## True
```
]

---

## Native NAs

Recent versions of pandas have attempted to adopt a more native missing value, particularly for integer and boolean types,

.pull-left[

```python
pd.Series([1,2,3,None])
```

```
## 0    1.0
## 1    2.0
## 2    3.0
## 3    NaN
## dtype: float64
```

```python
pd.Series([True,False,None])
```

```
## 0     True
## 1    False
## 2     None
## dtype: object
```
]

.pull-right[

```python
pd.isna( pd.Series([1,2,3,None]) )
```

```
## 0    False
## 1    False
## 2    False
## 3     True
## dtype: bool
```

```python
pd.isna( pd.Series([True,False,None]) )
```

```
## 0    False
## 1    False
## 2     True
## dtype: bool
```
]

We can force things by setting the Series dtype,

<div>
.pull-left[

```python
pd.Series([1,2,3,None], dtype = pd.Int64Dtype())
```

```
## 0       1
## 1       2
## 2       3
## 3    <NA>
## dtype: Int64
```
]

.pull-right[

```python
pd.Series([True, False,None], dtype = pd.BooleanDtype())
```

```
## 0     True
## 1    False
## 2     <NA>
## dtype: boolean
```
]
</div>

---

## String series

Series containing strings can be accessed via the `str` attribute,

.pull-left[

```python
s = pd.Series(["the quick", "brown fox", "jumps over", "a lazy dog"])
s
```

```
## 0     the quick
## 1     brown fox
## 2    jumps over
## 3    a lazy dog
## dtype: object
```

```python
s.str.upper()
```

```
## 0     THE QUICK
## 1     BROWN FOX
## 2    JUMPS OVER
## 3    A LAZY DOG
## dtype: object
```

```python
s.str.split(" ")
```

```
## 0      [the, quick]
## 1      [brown, fox]
## 2     [jumps, over]
## 3    [a, lazy, dog]
## dtype: object
```
]

.pull-right[

```python
s.str.split(" ").str[1]
```

```
## 0    quick
## 1      fox
## 2     over
## 3     lazy
## dtype: object
```

```python
pd.Series([1,2,3]).str
```

```
## AttributeError: Can only use .str accessor with string values!
```
]

---

## Categorical Series

.pull-left[ .small[

```python
pd.Series(["Mon", "Tue", "Wed", "Thur", "Fri"])
```

```
## 0     Mon
## 1     Tue
## 2     Wed
## 3    Thur
## 4     Fri
## dtype: object
```
] ]

.pull-right[ .small[

```python
pd.Series(["Mon", "Tue", "Wed", "Thur", "Fri"], dtype="category")
```

```
## 0     Mon
## 1     Tue
## 2     Wed
## 3    Thur
## 4     Fri
## dtype: category
## Categories (5, object): ['Fri', 'Mon', 'Thur', 'Tue', 'Wed']
```
] ]

.small[

```python
pd.Series(["Mon", "Tue", "Wed", "Thur", "Fri"], dtype=pd.CategoricalDtype(ordered=True))
```

```
## 0     Mon
## 1     Tue
## 2     Wed
## 3    Thur
## 4     Fri
## dtype: category
## Categories (5, object): ['Fri' < 'Mon' < 'Thur' < 'Tue' < 'Wed']
```

```python
pd.Series(
  ["Tue", "Thur", "Mon", "Sat"], 
  dtype=pd.CategoricalDtype(categories=["Mon", "Tue", "Wed", "Thur", "Fri"], ordered=True)
)
```

```
## 0     Tue
## 1    Thur
## 2     Mon
## 3     NaN
## dtype: category
## Categories (5, object): ['Mon' < 'Tue' < 'Wed' < 'Thur' < 'Fri']
```

]

---

## Constructing DataFrames

We just saw reading a DataFrame in via `read_csv()`, but data frames can also be constructed via `DataFrame()`, general this is done via dict of columns:

.footnote[
See more io functions [here](https://pandas.pydata.org/docs/reference/io.html)
]

```python
n = 5
d = {
  "id":     np.random.randint(100, 999, n),
  "weight": np.random.normal(70, 20, n),
  "height": np.random.normal(170, 15, n),
  "date":   pd.date_range(start='2/1/2022', periods=n, freq='D')
}
d
```

```
## {'id': array([168, 615, 346, 390, 556]), 'weight': array([102.91553493,  71.7673641 ,  76.66605866,  74.73546509,  50.53848772]), 'height': array([188.67776869, 155.90780119, 171.38683943, 173.15100756, 183.08340686]), 'date': DatetimeIndex(['2022-02-01', '2022-02-02', '2022-02-03', '2022-02-04',
##                '2022-02-05'],
##               dtype='datetime64[ns]', freq='D')}
```

```python
df = pd.DataFrame(d)
df
```

```
##     id      weight      height       date
## 0  168  102.915535  188.677769 2022-02-01
## 1  615   71.767364  155.907801 2022-02-02
## 2  346   76.666059  171.386839 2022-02-03
## 3  390   74.735465  173.151008 2022-02-04
## 4  556   50.538488  183.083407 2022-02-05
```

---

## DataFrame from nparray

For 2d ndarrays it is also possible to construct a DataFrame - generally it is a good idea to provide column names and row names (indexes)

.pull-left[

```python
pd.DataFrame(
  np.diag([1,2,3]),
  columns = ["x","y","z"]
)
```

```
##    x  y  z
## 0  1  0  0
## 1  0  2  0
## 2  0  0  3
```

```python
pd.DataFrame(
  np.diag([1,2,3]),
  columns = ["x","y","z"]
)
```

```
##    x  y  z
## 0  1  0  0
## 1  0  2  0
## 2  0  0  3
```
]

.pull-right[

```python
pd.DataFrame(
  np.tri(5,3,-1),
  columns = ["x","y","z"],
  index = ["a","b","c","d","e"]
)
```

```
##      x    y    z
## a  0.0  0.0  0.0
## b  1.0  0.0  0.0
## c  1.0  1.0  0.0
## d  1.0  1.0  1.0
## e  1.0  1.0  1.0
```
]

---

## DataFrame indexing

.small[

```python
df
```

.pull-left[ 
.midi[Selecting a column,]
.small[

```python
df[0]
```

```
## KeyError: 0
```

```python
df["id"]
```

```
## 0    168
## 1    615
## 2    346
## 3    390
## 4    556
## Name: id, dtype: int64
```

```python
df.id
```

```
## 0    168
## 1    615
## 2    346
## 3    390
## 4    556
## Name: id, dtype: int64
```
] ]

.pull-right[ 
.midi[Selecting rows, a single slice is assumed to refer to the rows]
.small[

```python
df[1:3]
```

```
##     id     weight      height       date
## 1  615  71.767364  155.907801 2022-02-02
## 2  346  76.666059  171.386839 2022-02-03
```

```python
df[0::2]
```

```
##     id      weight      height       date
## 0  168  102.915535  188.677769 2022-02-01
## 2  346   76.666059  171.386839 2022-02-03
## 4  556   50.538488  183.083407 2022-02-05
```
] ]

---

## Index by position

.small[

```python
df
```

.pull-left[ .small[

```python
df.iloc[1]
```

```
## id                        615
## weight              71.767364
## height             155.907801
## date      2022-02-02 00:00:00
## Name: 1, dtype: object
```

```python
df.iloc[[1]]
```

```
##     id     weight      height       date
## 1  615  71.767364  155.907801 2022-02-02
```

```python
df.iloc[0:2]
```

```
##     id      weight      height       date
## 0  168  102.915535  188.677769 2022-02-01
## 1  615   71.767364  155.907801 2022-02-02
```

```python
df.iloc[lambda x: x.index % 2 != 0]
```

```
##     id     weight      height       date
## 1  615  71.767364  155.907801 2022-02-02
## 3  390  74.735465  173.151008 2022-02-04
```
] ]

.pull-right[ .small[

```python
df.iloc[1:3,1:3]
```

```
##       weight      height
## 1  71.767364  155.907801
## 2  76.666059  171.386839
```

```python
df.iloc[0:3, [0,3]]
```

```
##     id       date
## 0  168 2022-02-01
## 1  615 2022-02-02
## 2  346 2022-02-03
```

```python
df.iloc[0:3, [True, True, False, False]]
```

```
##     id      weight
## 0  168  102.915535
## 1  615   71.767364
## 2  346   76.666059
```
] ]

---

## Index by name

.small[

```python
df.index = (["anna","bob","carol", "dave", "erin"])
df
```

```
##         id      weight      height       date
## anna   168  102.915535  188.677769 2022-02-01
## bob    615   71.767364  155.907801 2022-02-02
## carol  346   76.666059  171.386839 2022-02-03
## dave   390   74.735465  173.151008 2022-02-04
## erin   556   50.538488  183.083407 2022-02-05
```
]

.pull-left[ .small[

```python
df.loc["anna"]
```

```
## id                        168
## weight             102.915535
## height             188.677769
## date      2022-02-01 00:00:00
## Name: anna, dtype: object
```

```python
df.loc[["anna"]]
```

```
##        id      weight      height       date
## anna  168  102.915535  188.677769 2022-02-01
```

```python
df.loc["bob":"dave"]
```

```
##         id     weight      height       date
## bob    615  71.767364  155.907801 2022-02-02
## carol  346  76.666059  171.386839 2022-02-03
## dave   390  74.735465  173.151008 2022-02-04
```

```python
df.loc[df.id < 300]
```

```
##        id      weight      height       date
## anna  168  102.915535  188.677769 2022-02-01
```
] ]

.pull-right[ .small[

```python
df.loc[:, "date"]
```

```
## anna    2022-02-01
## bob     2022-02-02
## carol   2022-02-03
## dave    2022-02-04
## erin    2022-02-05
## Name: date, dtype: datetime64[ns]
```

```python
df.loc[["bob","erin"], "weight":"height"]
```

```
##          weight      height
## bob   71.767364  155.907801
## erin  50.538488  183.083407
```

```python
df.loc[0:2, "weight":"height"]
```

```
## TypeError: cannot do slice indexing on Index with these indexers [0] of type int
```
] ]

---

## Views vs. Copies

In general most pandas operations will generate a new object but some will return views, mostly the later occurs with subsetting.

.pull-left[ .small[

```python
d = pd.DataFrame(np.arange(6).reshape(3,2), columns = ["x","y"])
d
```

```
##    x  y
## 0  0  1
## 1  2  3
## 2  4  5
```

```python
v = d.iloc[0:2,0:2]
v
```

```
##    x  y
## 0  0  1
## 1  2  3
```

```python
d.iloc[0,1] = -1
v
```

```
##    x  y
## 0  0 -1
## 1  2  3
```
] ]

.pull-right[ .small[

```python
v.iloc[0,0] = np.pi
v
```

```
##           x  y
## 0  3.141593 -1
## 1  2.000000  3
```

```python
d
```

```
##    x  y
## 0  0 -1
## 1  2  3
## 2  4  5
```
] ]

.footnote[See the documetation [here](http://pandas-docs.github.io/pandas-docs-travis/user_guide/indexing.html#indexing-view-versus-copy) for more details]
---

## Filtering rows

The `query()` method can be used for filtering rows, it evaluates a string expression in the context of the data frame.

.small[

```python
df
```

.pull-left[ .small[

```python
df.query('date == "2022-02-01"')
```

```
##        id      weight      height       date
## anna  168  102.915535  188.677769 2022-02-01
```

```python
df.query('weight > 50')
```

.pull-right[ .small[

```python
df.query('weight > 50 & height < 165')
```

```
##       id     weight      height       date
## bob  615  71.767364  155.907801 2022-02-02
```

```python
qid = 414
df.query('id == @qid')
```

```
## Empty DataFrame
## Columns: [id, weight, height, date]
## Index: []
```
] ]

.footnote[For more details on query syntax see [here](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-query)]

---

## Element access

.small[

```python
df
```

.pull-left[ .small[

```python
df[0,0]
```

```
## KeyError: (0, 0)
```

```python
df.iat[0,0]
```

```
## 168
```

```python
df.id[0]
```

```
## 168
```

```python
df[0:1].id[0]
```

```
## 168
```
] ]

.pull-right[ .small[

```python
df["anna", "id"]
```

```
## KeyError: ('anna', 'id')
```

```python
df.at["anna", "id"]
```

```
## 168
```

```python
df["id"]["anna"]
```

```
## 168
```

```python
df["id"][0]
```

```
## 168
```
] ]

---

## DataFrame properties

```python
df
```

.pull-left[

```python
df.size
```

```
## 20
```

```python
df.shape
```

```
## (5, 4)
```

```python
df.info()
```

```
## <class 'pandas.core.frame.DataFrame'>
## Index: 5 entries, anna to erin
## Data columns (total 4 columns):
##  #   Column  Non-Null Count  Dtype         
## ---  ------  --------------  -----         
##  0   id      5 non-null      int64         
##  1   weight  5 non-null      float64       
##  2   height  5 non-null      float64       
##  3   date    5 non-null      datetime64[ns]
## dtypes: datetime64[ns](1), float64(2), int64(1)
## memory usage: 372.0+ bytes
```
]

.pull-right[

```python
df.dtypes
```

```
## id                 int64
## weight           float64
## height           float64
## date      datetime64[ns]
## dtype: object
```

```python
df.describe()
```

```
##                id      weight      height
## count    5.000000    5.000000    5.000000
## mean   415.000000   75.324582  174.441365
## std    177.676673   18.644111   12.568296
## min    168.000000   50.538488  155.907801
## 25%    346.000000   71.767364  171.386839
## 50%    390.000000   74.735465  173.151008
## 75%    556.000000   76.666059  183.083407
## max    615.000000  102.915535  188.677769
```
]

---

## Selecting Columns

Beyond the use of `loc()` and `iloc()` there is also the `filter()` method which can be used to select columns (or indices) by name with pattern matching

.small[

```python
df
```

.pull-left[ .small[

```python
df.filter(items=["id","weight"])
```

```
##         id      weight
## anna   168  102.915535
## bob    615   71.767364
## carol  346   76.666059
## dave   390   74.735465
## erin   556   50.538488
```

```python
df.filter(like = "i")
```

```
##         id      weight      height
## anna   168  102.915535  188.677769
## bob    615   71.767364  155.907801
## carol  346   76.666059  171.386839
## dave   390   74.735465  173.151008
## erin   556   50.538488  183.083407
```
] ]

.pull-right[ .small[

```python
df.filter(regex="ght$")
```

```
##            weight      height
## anna   102.915535  188.677769
## bob     71.767364  155.907801
## carol   76.666059  171.386839
## dave    74.735465  173.151008
## erin    50.538488  183.083407
```

```python
df.filter(like="o", axis=0)
```

```
##         id     weight      height       date
## bob    615  71.767364  155.907801 2022-02-02
## carol  346  76.666059  171.386839 2022-02-03
```
] ]

---

## Adding columns

Indexing with assignment allows for inplace modification of a DataFrame, while `assign()` creates a new object (but is chainable)

.small[

```python
df
```

.pull-left[ .small[

```python
df['student'] = [True, True, True, False, None]
df['age'] = [19, 22, 25, None, None]
df
```

```
##         id      weight      height       date student   age
## anna   168  102.915535  188.677769 2022-02-01    True  19.0
## bob    615   71.767364  155.907801 2022-02-02    True  22.0
## carol  346   76.666059  171.386839 2022-02-03    True  25.0
## dave   390   74.735465  173.151008 2022-02-04   False   NaN
## erin   556   50.538488  183.083407 2022-02-05    None   NaN
```
] ]

.pull-right[ .small[

```python
df.assign(
  student = lambda x: np.where(x.student, "yes", "no"),
  rand = np.random.rand(5)
)
```

```
##         id      weight      height       date student   age      rand
## anna   168  102.915535  188.677769 2022-02-01     yes  19.0  0.607726
## bob    615   71.767364  155.907801 2022-02-02     yes  22.0  0.541657
## carol  346   76.666059  171.386839 2022-02-03     yes  25.0  0.562392
## dave   390   74.735465  173.151008 2022-02-04      no   NaN  0.849173
## erin   556   50.538488  183.083407 2022-02-05      no   NaN  0.440295
```

```python
df
```

---

## Removing columns (and rows)

Columns can be dropped via the `drop()` method,

.small[

```python
df
```

.pull-left[ .small[

```python
df.drop(['student'])
```

```
## KeyError: "['student'] not found in axis"
```

```python
df.drop(['student'], axis=1)
```

```
##         id      weight      height       date   age
## anna   168  102.915535  188.677769 2022-02-01  19.0
## bob    615   71.767364  155.907801 2022-02-02  22.0
## carol  346   76.666059  171.386839 2022-02-03  25.0
## dave   390   74.735465  173.151008 2022-02-04   NaN
## erin   556   50.538488  183.083407 2022-02-05   NaN
```

```python
df.drop(['anna','dave'])
```

```
##         id     weight      height       date student   age
## bob    615  71.767364  155.907801 2022-02-02    True  22.0
## carol  346  76.666059  171.386839 2022-02-03    True  25.0
## erin   556  50.538488  183.083407 2022-02-05    None   NaN
```
] ]

.pull-right[ .small[

```python
df.drop(columns = df.columns == "age")
```

```
## KeyError: '[False, False, False, False, False, True] not found in axis'
```

```python
df.drop(columns = df.columns[df.columns == "age"])
```

```
##         id      weight      height       date student
## anna   168  102.915535  188.677769 2022-02-01    True
## bob    615   71.767364  155.907801 2022-02-02    True
## carol  346   76.666059  171.386839 2022-02-03    True
## dave   390   74.735465  173.151008 2022-02-04   False
## erin   556   50.538488  183.083407 2022-02-05    None
```

```python
df.drop(columns = df.columns[df.columns.str.contains("ght")])
```

```
##         id       date student   age
## anna   168 2022-02-01    True  19.0
## bob    615 2022-02-02    True  22.0
## carol  346 2022-02-03    True  25.0
## dave   390 2022-02-04   False   NaN
## erin   556 2022-02-05    None   NaN
```
] ]

---

## Dropping missing values

Columns can be dropped via the `drop()` method,

.small[

```python
df
```

.pull-left[ .small[

```python
df.dropna()
```

```python
df.dropna(how="all")
```

.pull-right[ .small[

```python
df.dropna(axis=1)
```

```python
df.dropna(axis=1, thresh=4)
```

---

## Sorting

DataFrames can be sorted on one or more axes via `sort_values()`,

```python
df
```

```python
df.sort_values(by=["student","id"], ascending=[True,False])
```

```
##         id      weight      height       date student   age
## dave   390   74.735465  173.151008 2022-02-04   False   NaN
## bob    615   71.767364  155.907801 2022-02-02    True  22.0
## carol  346   76.666059  171.386839 2022-02-03    True  25.0
## anna   168  102.915535  188.677769 2022-02-01    True  19.0
## erin   556   50.538488  183.083407 2022-02-05    None   NaN
```

---

## Row binds

DataFrames can have their rows joined via the the `concat()` function (`append()` is also available but deprecated),

.pull-left[ .small[

```python
df1 = pd.DataFrame(np.arange(6).reshape(3,2), columns=list("xy"))
df1
```

```
##    x  y
## 0  0  1
## 1  2  3
## 2  4  5
```
] ]

.pull-right[ .small[

```python
df2 = pd.DataFrame(np.arange(12,6,-1).reshape(3,2), columns=list("xy"))
df2
```

```
##     x   y
## 0  12  11
## 1  10   9
## 2   8   7
```
] ]

<div>

.pull-left[ .small[

```python
pd.concat([df1,df2])
```

```
##     x   y
## 0   0   1
## 1   2   3
## 2   4   5
## 0  12  11
## 1  10   9
## 2   8   7
```
] ]

.pull-right[ .small[

```python
pd.concat([df1.loc[:,["y","x"]],df2])
```

```
##     y   x
## 0   1   0
## 1   3   2
## 2   5   4
## 0  11  12
## 1   9  10
## 2   7   8
```
] ]

</div>

---

## Imputing columns

```python
df3 = pd.DataFrame(np.ones((3,3)), columns=list("xbz"))
df3
```

```
##      x    b    z
## 0  1.0  1.0  1.0
## 1  1.0  1.0  1.0
## 2  1.0  1.0  1.0
```

```python
pd.concat([df1,df3,df2])
```

```
##       x     y    b    z
## 0   0.0   1.0  NaN  NaN
## 1   2.0   3.0  NaN  NaN
## 2   4.0   5.0  NaN  NaN
## 0   1.0   NaN  1.0  1.0
## 1   1.0   NaN  1.0  1.0
## 2   1.0   NaN  1.0  1.0
## 0  12.0  11.0  NaN  NaN
## 1  10.0   9.0  NaN  NaN
## 2   8.0   7.0  NaN  NaN
```

---

## Column binds

Similarly, columns can be joined with `concat()` where `axis=1`,

.pull-left[ .small[

```python
df1 = pd.DataFrame(np.arange(6).reshape(3,2), columns=list("xy"), index=list("abc"))
df1
```

```
##    x  y
## a  0  1
## b  2  3
## c  4  5
```
] ]

.pull-right[ .small[

```python
df2 = pd.DataFrame(np.arange(10,6,-1).reshape(2,2), columns=list("mn"), index=list("ac"))
df2
```

```
##     m  n
## a  10  9
## c   8  7
```
] ]

<div>

.pull-left[ .small[

```python
pd.concat([df1,df2], axis=1)
```

```
##    x  y     m    n
## a  0  1  10.0  9.0
## b  2  3   NaN  NaN
## c  4  5   8.0  7.0
```
] ]

.pull-right[ .small[

```python
pd.concat([df1,df2], axis=1, join="inner")
```

```
##    x  y   m  n
## a  0  1  10  9
## c  4  5   8  7
```
] ]

</div>

---

## Joining DataFrames

Table joins are implemented via the `merge()` function or method,

.pull-left[ .small[

```python
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df1
```

```
##      a  b
## 0  foo  1
## 1  bar  2
```
] ]

.pull-right[ .small[

```python
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
df2
```

```
##      a  c
## 0  foo  3
## 1  baz  4
```
] ]

<div>

.pull-left[ .small[

```python
pd.merge(df1,df2, how="inner")
```

```
##      a  b  c
## 0  foo  1  3
```

```python
pd.merge(df1,df2, how="outer", on="a")
```

```
##      a    b    c
## 0  foo  1.0  3.0
## 1  bar  2.0  NaN
## 2  baz  NaN  4.0
```
] ]

.pull-right[ .small[

```python
df1.merge(df2, how="left")
```

```
##      a  b    c
## 0  foo  1  3.0
## 1  bar  2  NaN
```

```python
df1.merge(df2, how="right")
```

```
##      a    b  c
## 0  foo  1.0  3
## 1  baz  NaN  4
```
] ]

</div>

---

## join vs merge vs concat

All three can be used to accomplish the same thing, in terms of "column bind" type operations.

* `concat()` stacks DataFrames on either axis, with basic alignment based on (row) indexes. `join` argument only supports "inner" and "outer".

* `merge()` aligns based on one or more shared columns. `how` supports "inner", "outer", "left", "right", and "cross".

* `join()` uses `merge()` behind the scenes, but prefers to join based on (row) indexes. Also has different default `how` compared to `merge()`, "left" vs "inner".

---

## groupby and agg

Groups can be created within a DataFrame via `groupby()` - these groups are then used by the standard summary methods (e.g. `sum()`, `mean()`, `std()`, etc.).

```python
df
```

```python
df.groupby("student")
```

```
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x14c653e20>
```

.pull-left[

```python
df.groupby("student").groups
```

```
## {False: ['dave'], True: ['anna', 'bob', 'carol']}
```

```python
df.groupby("student").mean()
```

```
##                  id     weight      height   age
## student                                         
## False    390.000000  74.735465  173.151008   NaN
## True     376.333333  83.782986  171.990803  22.0
```
]

.pull-right[

```python
df.groupby("student", dropna=False).groups
```

```
## ValueError: Categorical categories cannot be null
```

```python
df.groupby("student", dropna=False).mean()
```

```
##                  id     weight      height   age
## student                                         
## False    390.000000  74.735465  173.151008   NaN
## True     376.333333  83.782986  171.990803  22.0
## NaN      556.000000  50.538488  183.083407   NaN
```
]

---

## Selecting groups

```python
df
```

```python
df.groupby("student").get_group(True)
```

```python
df.groupby("student").get_group(False)
```

```
##        id     weight      height       date student  age
## dave  390  74.735465  173.151008 2022-02-04   False  NaN
```

```python
df.groupby("student", dropna=False).get_group(np.nan)
```

```
## KeyError: nan
```

---

## Aggregation

.small[

```python
df = df.drop("date", axis=1)
```
]

.small[

```python
df.groupby("student").agg("mean")
```

```
##                  id     weight      height   age
## student                                         
## False    390.000000  74.735465  173.151008   NaN
## True     376.333333  83.782986  171.990803  22.0
```

```python
df.groupby("student").agg([np.mean, np.std])
```

```
##                  id                 weight  ...    height   age     
##                mean         std       mean  ...       std  mean  std
## student                                     ...                     
## False    390.000000         NaN  74.735465  ...       NaN   NaN  NaN
## True     376.333333  225.038515  83.782986  ...  16.39333  22.0  3.0
## 
## [2 rows x 8 columns]
```
]

--
.small[

```python
df.groupby("student").agg([np.mean, np.std]).columns
```

```
## MultiIndex([(    'id', 'mean'),
##             (    'id',  'std'),
##             ('weight', 'mean'),
##             ('weight',  'std'),
##             ('height', 'mean'),
##             ('height',  'std'),
##             (   'age', 'mean'),
##             (   'age',  'std')],
##            )
```
]

.footnote[More on multindexes and other aggregation/summary methods next time.]