class: center, middle, inverse, title-slide # Lec 09 - more pandas ##
Statistical Computing and Computation ### Sta 663 | Spring 2022 ###
Dr. Colin Rundel --- exclude: true ```python import scipy import numpy as np import matplotlib.pyplot as plt import pandas as pd np.set_printoptions(edgeitems=3, linewidth=180) ``` ```r local({ hook_err_old <- knitr::knit_hooks$get("error") # save the old hook knitr::knit_hooks$set(error = function(x, options) { # now do whatever you want to do with x, and pass # the new x to the old hook x = sub("## \n## Detailed traceback:\n.*$", "", x) x = sub("Error in py_call_impl\\(.*?\\)\\: ", "", x) hook_err_old(x, options) }) hook_warn_old <- knitr::knit_hooks$get("warning") # save the old hook knitr::knit_hooks$set(warning = function(x, options) { x = sub("<string>:1: ", "", x) hook_warn_old(x, options) }) }) ``` --- class: center, middle # Index objects --- ## Columns and Indexes When constructing a DataFrame we can specify the indexes for both the rows (`index`) and columns (`index`), .pull-left[ ```python df = pd.DataFrame( np.random.randn(5, 3), columns=['A', 'B', 'C'] ) df ``` ``` ## A B C ## 0 -0.099711 -0.381847 -2.392402 ## 1 0.544186 1.175953 1.237503 ## 2 -0.605081 1.869954 0.618847 ## 3 -0.988612 0.656876 -0.179668 ## 4 -0.400453 0.555089 1.098572 ``` ```python df.columns ``` ``` ## Index(['A', 'B', 'C'], dtype='object') ``` ```python df.index ``` ``` ## RangeIndex(start=0, stop=5, step=1) ``` ] .pull-right[ ```python df = pd.DataFrame( np.random.randn(3, 3), index=['x','y','z'], columns=['A', 'B', 'C'] ) df ``` ``` ## A B C ## x 0.612553 1.120081 -0.891120 ## y 0.990790 -1.405966 0.544711 ## z -1.958477 0.750223 0.655311 ``` ```python df.columns ``` ``` ## Index(['A', 'B', 'C'], dtype='object') ``` ```python df.index ``` ``` ## Index(['x', 'y', 'z'], dtype='object') ``` ] --- ## Index objects pandas' `Index` class and its subclasses provide the infrastructure necessary for lookups, data alignment, and other related tasks. You can think of them as being an immutable *multiset* (duplicate values are allowed). ```python pd.Index(['A','B','C']) ``` ``` ## Index(['A', 'B', 'C'], dtype='object') ``` ```python pd.Index(['A','B','C','A']) ``` ``` ## Index(['A', 'B', 'C', 'A'], dtype='object') ``` ```python pd.Index(range(5)) ``` ``` ## RangeIndex(start=0, stop=5, step=1) ``` ```python pd.Index(list(range(5))) ``` ``` ## Int64Index([0, 1, 2, 3, 4], dtype='int64') ``` --- ## Indexes as sets While it is not something you will need to do very often, since Indexs are "sets" the various set operations and methods are available. ```python a = pd.Index(['c', 'b', 'a']) b = pd.Index(['c', 'e', 'd']) ``` .pull-left[ ```python a.union(b) ``` ``` ## Index(['a', 'b', 'c', 'd', 'e'], dtype='object') ``` ```python a.intersection(b) ``` ``` ## Index(['c'], dtype='object') ``` ] .pull-right[ ```python a.difference(b) ``` ``` ## Index(['a', 'b'], dtype='object') ``` ```python a.symmetric_difference(b) ``` ``` ## Index(['a', 'b', 'd', 'e'], dtype='object') ``` ] -- <div> .pull-left[ ```python c = pd.Index([1.0, 1.5, 2.0]) d = pd.Index(range(5)) c.union(d) ``` ``` ## Float64Index([0.0, 1.0, 1.5, 2.0, 3.0, 4.0], dtype='float64') ``` ] .pull-right[ ```python e = pd.Index(["A","B","C"]) f = pd.Index(range(5)) e.union(f) ``` ``` ## Index(['A', 'B', 'C', 0, 1, 2, 3, 4], dtype='object') ``` ] </div> --- ## Index metadata You can attach names to an index, which will then show when displaying the DataFrame or Index, .pull-left[ ```python df = pd.DataFrame( np.random.randn(3, 3), index=pd.Index(['x','y','z'], name="rows"), columns=pd.Index(['A', 'B', 'C'], name="cols") ) df ``` ``` ## cols A B C ## rows ## x 0.300764 0.557300 0.067900 ## y -0.445863 -2.123754 -0.430039 ## z -0.963107 0.451464 -1.386723 ``` ```python df.columns ``` ``` ## Index(['A', 'B', 'C'], dtype='object', name='cols') ``` ```python df.index ``` ``` ## Index(['x', 'y', 'z'], dtype='object', name='rows') ``` ] -- .pull-right[ ```python df.columns.rename("m") ``` ``` ## Index(['A', 'B', 'C'], dtype='object', name='m') ``` ```python df.index.set_names("n") ``` ``` ## Index(['x', 'y', 'z'], dtype='object', name='n') ``` ```python df ``` ``` ## cols A B C ## rows ## x 0.300764 0.557300 0.067900 ## y -0.445863 -2.123754 -0.430039 ## z -0.963107 0.451464 -1.386723 ``` ```python df.columns.name = "o" df.index.rename("p", inplace=True) df ``` ``` ## o A B C ## p ## x 0.300764 0.557300 0.067900 ## y -0.445863 -2.123754 -0.430039 ## z -0.963107 0.451464 -1.386723 ``` ] --- ## Indexes and missing values It is possible for an index to contain missing values (e.g. `np.nan`) but this is generally a bad idea and should be avoided. ```python pd.Index([1,2,3,np.nan,5]) ``` ``` ## Float64Index([1.0, 2.0, 3.0, nan, 5.0], dtype='float64') ``` ```python pd.Index(["A","B",np.nan,"D"]) ``` ``` ## Index(['A', 'B', nan, 'D'], dtype='object') ``` Missing values can be replaced via the `fillna()` method, ```python pd.Index([1,2,3,np.nan,5]).fillna(0) ``` ``` ## Float64Index([1.0, 2.0, 3.0, 0.0, 5.0], dtype='float64') ``` ```python pd.Index(["A","B",np.nan,"D"]).fillna("Z") ``` ``` ## Index(['A', 'B', 'Z', 'D'], dtype='object') ``` --- ## Changing a DataFrame's index Existing columns can used as an index via `set_index()` and removed via `reset_index()`, ```python data ``` ``` ## a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` .pull-left[ ```python data.set_index('a') ``` ``` ## b c d ## a ## bar one z 1 ## bar two y 2 ## foo one x 3 ## foo two w 4 ``` ```python data.set_index('c', drop=False) ``` ``` ## a b c d ## c ## z bar one z 1 ## y bar two y 2 ## x foo one x 3 ## w foo two w 4 ``` ] .pull-right[ ```python data.set_index('a').reset_index() ``` ``` ## a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` ```python data.set_index('c').reset_index(drop=True) ``` ``` ## a b d ## 0 bar one 1 ## 1 bar two 2 ## 2 foo one 3 ## 3 foo two 4 ``` ] --- ## Creating a new index New index values can be attached to a DataFrame via `reindex()`, ```python data ``` ``` ## a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` .pull-left[ ```python data.reindex(["w","x","y","z"]) ``` ``` ## a b c d ## w NaN NaN NaN NaN ## x NaN NaN NaN NaN ## y NaN NaN NaN NaN ## z NaN NaN NaN NaN ``` ```python data.reindex(range(5,-1,-1)) ``` ``` ## a b c d ## 5 NaN NaN NaN NaN ## 4 NaN NaN NaN NaN ## 3 foo two w 4.0 ## 2 foo one x 3.0 ## 1 bar two y 2.0 ## 0 bar one z 1.0 ``` ] -- .pull-right[ ```python data.reindex(columns = ["a","b","c","d","e"]) ``` ``` ## a b c d e ## 0 bar one z 1 NaN ## 1 bar two y 2 NaN ## 2 foo one x 3 NaN ## 3 foo two w 4 NaN ``` ```python data.index = ["w","x","y","z"] data ``` ``` ## a b c d ## w bar one z 1 ## x bar two y 2 ## y foo one x 3 ## z foo two w 4 ``` ] --- ## Renaming levels Alternatively, row or column index levels can be renamed via `rename()`, ```python data ``` ``` ## a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` .pull-left[ ```python data.rename(index = pd.Series(["m","n","o","p"])) ``` ``` ## a b c d ## m bar one z 1 ## n bar two y 2 ## o foo one x 3 ## p foo two w 4 ``` ```python data.rename_axis(index="rows") ``` ``` ## a b c d ## rows ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` ] .pull-right[ ```python data.rename(columns = {"a":"w", "b":"x", "c":"y", "d":"z"}) ``` ``` ## w x y z ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` ```python data.rename_axis(columns="cols") ``` ``` ## cols a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` ] --- class: center, middle # MultiIndexes --- ## MultiIndex objects These are a hierarchical analog of standard Index objects, there are a number of methods for constructing them based on the initial object .pull-left[ ```python tuples = [('A','x'), ('A','y'), ('B','x'), ('B','y'), ('C','x'), ('C','y')] pd.MultiIndex.from_tuples(tuples, names=["1st","2nd"]) ``` ``` ## MultiIndex([('A', 'x'), ## ('A', 'y'), ## ('B', 'x'), ## ('B', 'y'), ## ('C', 'x'), ## ('C', 'y')], ## names=['1st', '2nd']) ``` ```python pd.MultiIndex.from_product([["A","B","C"],["x","y"]], names=["1st","2nd"]) ``` ``` ## MultiIndex([('A', 'x'), ## ('A', 'y'), ## ('B', 'x'), ## ('B', 'y'), ## ('C', 'x'), ## ('C', 'y')], ## names=['1st', '2nd']) ``` ] .pull-right[ ```python idx = pd.MultiIndex.from_tuples(tuples, names=["1st","2nd"]) pd.DataFrame(np.random.rand(6,2), index = idx, columns=["m","n"]) ``` ``` ## m n ## 1st 2nd ## A x 0.756919 0.890441 ## y 0.337214 0.719306 ## B x 0.422858 0.619788 ## y 0.299592 0.991226 ## C x 0.810381 0.157312 ## y 0.098790 0.996070 ``` ] --- ## Column MultiIndex ```python cidx = pd.MultiIndex.from_product([["A","B"],["x","y"]], names=["c1","c2"]) pd.DataFrame(np.random.rand(4,4), columns = cidx) ``` ``` ## c1 A B ## c2 x y x y ## 0 0.929432 0.824582 0.621965 0.072779 ## 1 0.136231 0.818250 0.100441 0.326754 ## 2 0.120870 0.580064 0.506232 0.764804 ## 3 0.639309 0.478697 0.318579 0.790524 ``` ```python ridx = pd.MultiIndex.from_product([["m","n"],["l","p"]], names=["r1","r2"]) pd.DataFrame(np.random.rand(4,4), index= ridx, columns = cidx) ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.182841 0.847452 0.770413 0.103425 ## p 0.494563 0.004254 0.297815 0.282648 ## n l 0.821943 0.999260 0.413331 0.046545 ## p 0.337158 0.984532 0.418133 0.358786 ``` --- ## MultiIndex indexing .pull-left[ ```python data ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ```python data["A"] ``` ``` ## c2 x y ## r1 r2 ## m l 0.019149 0.519056 ## p 0.219535 0.537471 ## n l 0.020447 0.817611 ## p 0.432398 0.854118 ``` ```python data["x"] ``` ``` ## KeyError: 'x' ``` ```python data["m"] ``` ``` ## KeyError: 'm' ``` ] -- .pull-right[ ```python data["m","A"] ``` ``` ## KeyError: ('m', 'A') ``` ```python data["A","x"] ``` ``` ## r1 r2 ## m l 0.019149 ## p 0.219535 ## n l 0.020447 ## p 0.432398 ## Name: (A, x), dtype: float64 ``` ```python data["A"]["x"] ``` ``` ## r1 r2 ## m l 0.019149 ## p 0.219535 ## n l 0.020447 ## p 0.432398 ## Name: x, dtype: float64 ``` ] --- ## MultiIndex indexing via `iloc` .small[ ```python data ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ] .pull-left[ .small[ ```python data.iloc[0] ``` ``` ## c1 c2 ## A x 0.019149 ## y 0.519056 ## B x 0.924092 ## y 0.996320 ## Name: (m, l), dtype: float64 ``` ```python data.iloc[(0,1)] ``` ``` ## 0.519055710819791 ``` ```python data.iloc[[0,1]] ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ``` ] ] .pull-right[ .small[ ```python data.iloc[:,0] ``` ``` ## r1 r2 ## m l 0.019149 ## p 0.219535 ## n l 0.020447 ## p 0.432398 ## Name: (A, x), dtype: float64 ``` ```python data.iloc[0,1] ``` ``` ## 0.519055710819791 ``` ```python data.iloc[0,[0,1]] ``` ``` ## c1 c2 ## A x 0.019149 ## y 0.519056 ## Name: (m, l), dtype: float64 ``` ] ] .footnote[Note that tuples and lists are not treated the same by pandas when it comes to indexing] --- ## MultiIndex indexing via `loc` .small[ ```python data ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ] .pull-left[ .small[ ```python data.loc["m"] ``` ``` ## c1 A B ## c2 x y x y ## r2 ## l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ``` ```python data.loc["l"] ``` ``` ## KeyError: 'l' ``` ```python data.loc[:,"A"] ``` ``` ## c2 x y ## r1 r2 ## m l 0.019149 0.519056 ## p 0.219535 0.537471 ## n l 0.020447 0.817611 ## p 0.432398 0.854118 ``` ] ] .pull-right[ .small[ ```python data.loc[("m","l")] ``` ``` ## c1 c2 ## A x 0.019149 ## y 0.519056 ## B x 0.924092 ## y 0.996320 ## Name: (m, l), dtype: float64 ``` ```python data.loc[:,("A","y")] ``` ``` ## r1 r2 ## m l 0.519056 ## p 0.537471 ## n l 0.817611 ## p 0.854118 ## Name: (A, y), dtype: float64 ``` ] ] --- ## Fancier indexing with `loc` Index slices can also be used with combinations of indexes and index tuples, .small[ ```python data ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ] .pull-left[ .small[ ```python data.loc["m":"n"] ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ```python data.loc[("m","l"):("n","l")] ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ``` ] ] .pull-right[ .small[ ```python data.loc[("m","p"):"n"] ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ```python data.loc[[("m","p"),("n","l")]] ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ``` ] ] --- ## Selecting nested levels The previous methods don't give easy access to indexing on nested index levels, this is possible via the cross-section method `xs()`, .small[ ```python data ``` ``` ## c1 A B ## c2 x y x y ## r1 r2 ## m l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ## n l 0.020447 0.817611 0.493241 0.632190 ## p 0.432398 0.854118 0.774252 0.838321 ``` ] .pull-left[ .small[ ```python data.xs("p", level="r2") ``` ``` ## c1 A B ## c2 x y x y ## r1 ## m 0.219535 0.537471 0.962619 0.968074 ## n 0.432398 0.854118 0.774252 0.838321 ``` ```python data.xs("m", level="r1") ``` ``` ## c1 A B ## c2 x y x y ## r2 ## l 0.019149 0.519056 0.924092 0.996320 ## p 0.219535 0.537471 0.962619 0.968074 ``` ] ] .pull-right[ .small[ ```python data.xs("y", level="c2", axis=1) ``` ``` ## c1 A B ## r1 r2 ## m l 0.519056 0.996320 ## p 0.537471 0.968074 ## n l 0.817611 0.632190 ## p 0.854118 0.838321 ``` ```python data.xs("B", level="c1", axis=1) ``` ``` ## c2 x y ## r1 r2 ## m l 0.924092 0.996320 ## p 0.962619 0.968074 ## n l 0.493241 0.632190 ## p 0.774252 0.838321 ``` ] ] --- ## Setting MultiIndexes It is also possible to construct a MultiIndex or modify an existing one using `set_index()` and `reset_index()`, ```python data ``` ``` ## a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` .pull-left[ ```python data.set_index(['a','b']) ``` ``` ## c d ## a b ## bar one z 1 ## two y 2 ## foo one x 3 ## two w 4 ``` ```python data.set_index('c', append=True) ``` ``` ## a b d ## c ## 0 z bar one 1 ## 1 y bar two 2 ## 2 x foo one 3 ## 3 w foo two 4 ``` ] .pull-right[ ```python data.set_index(['a','b']).reset_index() ``` ``` ## a b c d ## 0 bar one z 1 ## 1 bar two y 2 ## 2 foo one x 3 ## 3 foo two w 4 ``` ```python data.set_index(['a','b']).reset_index(level=1) ``` ``` ## b c d ## a ## bar one z 1 ## bar two y 2 ## foo one x 3 ## foo two w 4 ``` ] --- class: center, middle # Reshaping data --- ## Long to wide (pivot) .pull-left[ ```python df ``` ``` ## country year type count ## 0 A 1999 cases 0.7K ## 1 A 1999 pop 19M ## 2 A 2000 cases 2K ## 3 A 2000 pop 20M ## 4 B 1999 cases 37K ## 5 B 1999 pop 172M ## 6 B 2000 cases 80K ## 7 B 2000 pop 174M ## 8 C 1999 cases 212K ## 9 C 1999 pop 1T ## 10 C 2000 cases 213K ## 11 C 2000 pop 1T ``` ] -- .pull-right[ ```python df_wide = df.pivot( index=["country","year"], columns="type", values="count" ) df_wide ``` ``` ## type cases pop ## country year ## A 1999 0.7K 19M ## 2000 2K 20M ## B 1999 37K 172M ## 2000 80K 174M ## C 1999 212K 1T ## 2000 213K 1T ``` ] -- <div> .pull-left[ .small[ ```python df_wide.index ``` ``` ## MultiIndex([('A', 1999), ## ('A', 2000), ## ('B', 1999), ## ('B', 2000), ## ('C', 1999), ## ('C', 2000)], ## names=['country', 'year']) ``` ```python df_wide.columns ``` ``` ## Index(['cases', 'pop'], dtype='object', name='type') ``` ] ] -- .pull-right[ ```python df_wide.reset_index().rename_axis(columns=None) ``` ``` ## country year cases pop ## 0 A 1999 0.7K 19M ## 1 A 2000 2K 20M ## 2 B 1999 37K 172M ## 3 B 2000 80K 174M ## 4 C 1999 212K 1T ## 5 C 2000 213K 1T ``` ] </div> --- ## Wide to long (melt) .pull-left[ ```python df ``` ``` ## country 1999 2000 ## 0 A 0.7K 2K ## 1 B 37K 80K ## 2 C 212K 213K ``` ] -- .pull-right[ ```python df_long = df.melt( id_vars="country", var_name="year" ) df_long ``` ``` ## country year value ## 0 A 1999 0.7K ## 1 B 1999 37K ## 2 C 1999 212K ## 3 A 2000 2K ## 4 B 2000 80K ## 5 C 2000 213K ``` ] --- ## Separate Example - splits and explosions .pull-left[ .small[ ```python df ``` ``` ## country year rate ## 0 A 1999 0.7K/19M ## 1 A 2000 2K/20M ## 2 B 1999 37K/172M ## 3 B 2000 80K/174M ## 4 C 1999 212K/1T ## 5 C 2000 213K/1T ``` ] ] -- .pull-right[ .small[ ```python df.assign( rate = lambda d: d.rate.str.split("/") ) ``` ``` ## country year rate ## 0 A 1999 [0.7K, 19M] ## 1 A 2000 [2K, 20M] ## 2 B 1999 [37K, 172M] ## 3 B 2000 [80K, 174M] ## 4 C 1999 [212K, 1T] ## 5 C 2000 [213K, 1T] ``` ] ] -- <div> .pull-left[ .small[ ```python ( df .assign( rate = lambda d: d.rate.str.split("/") ) .explode("rate") .assign( type = lambda d: ["cases", "pop"] * int(d.shape[0]/2) ) ) ``` ``` ## country year rate type ## 0 A 1999 0.7K cases ## 0 A 1999 19M pop ## 1 A 2000 2K cases ## 1 A 2000 20M pop ## 2 B 1999 37K cases ## 2 B 1999 172M pop ## 3 B 2000 80K cases ## 3 B 2000 174M pop ## 4 C 1999 212K cases ## 4 C 1999 1T pop ## 5 C 2000 213K cases ## 5 C 2000 1T pop ``` ] ] -- .pull-right[ .small[ ```python ( df .assign( rate = lambda d: d.rate.str.split("/") ) .explode("rate") .assign( type = lambda d: ["cases", "pop"] * int(d.shape[0]/2) ) .pivot(index=["country","year"], columns="type", values="rate") .reset_index() ) ``` ``` ## type country year cases pop ## 0 A 1999 0.7K 19M ## 1 A 2000 2K 20M ## 2 B 1999 37K 172M ## 3 B 2000 80K 174M ## 4 C 1999 212K 1T ## 5 C 2000 213K 1T ``` ] ] </div> --- ## Separate Example - A better way .pull-left[ ```python df ``` ``` ## country year rate ## 0 A 1999 0.7K/19M ## 1 A 2000 2K/20M ## 2 B 1999 37K/172M ## 3 B 2000 80K/174M ## 4 C 1999 212K/1T ## 5 C 2000 213K/1T ``` ] -- .pull-right[ ```python df.assign( counts = lambda d: d.rate.str.split("/").str[0], pop = lambda d: d.rate.str.split("/").str[1] ) ``` ``` ## country year rate counts pop ## 0 A 1999 0.7K/19M 0.7K 19M ## 1 A 2000 2K/20M 2K 20M ## 2 B 1999 37K/172M 37K 172M ## 3 B 2000 80K/174M 80K 174M ## 4 C 1999 212K/1T 212K 1T ## 5 C 2000 213K/1T 213K 1T ``` ] -- If you dont want to repeat the split, <div> .pull-left[ ```python df.assign( rate = lambda d: d.rate.str.split("/"), counts = lambda d: d.rate.str[0], pop = lambda d: d.rate.str[1] ).drop("rate", axis=1) ``` ``` ## country year counts pop ## 0 A 1999 0.7K 19M ## 1 A 2000 2K 20M ## 2 B 1999 37K 172M ## 3 B 2000 80K 174M ## 4 C 1999 212K 1T ## 5 C 2000 213K 1T ``` ] </div> --- ## Exercise 1 Create a DataFrame from the data available at https://sta663-sp22.github.io/slides/data/us_rent.csv using `pd.read_csv()`. These data come from the 2017 American Community Survey and reflect the following values: * `name` - name of state * `variable` - Variable name: income = median yearly income, rent = median monthly rent * `estimate` - Estimated value * `moe` - 90% margin of error Using these data find the state(s) with the lowest income to rent ratio. --- class: center, middle ## Split-Apply-Combine --- ## groupby Groups can be created within a DataFrame via `groupby()` - these groups are then used by the standard summary methods (e.g. `sum()`, `mean()`, `std()`, etc.). .small[ ```python cereal = pd.read_csv("https://sta663-sp22.github.io/slides/data/cereal.csv") cereal ``` ``` ## name mfr ... sugars rating ## 0 100% Bran Nabisco ... 6 68.402973 ## 1 100% Natural Bran Quaker Oats ... 8 33.983679 ## 2 All-Bran Kellogg's ... 5 59.425505 ## 3 All-Bran with Extra Fiber Kellogg's ... 0 93.704912 ## 4 Almond Delight Ralston Purina ... 8 34.384843 ## .. ... ... ... ... ... ## 72 Triples General Mills ... 3 39.106174 ## 73 Trix General Mills ... 12 27.753301 ## 74 Wheat Chex Ralston Purina ... 3 49.787445 ## 75 Wheaties General Mills ... 3 51.592193 ## 76 Wheaties Honey Gold General Mills ... 8 36.187559 ## ## [77 rows x 6 columns] ``` ```python cereal.groupby("type") ``` ``` ## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x143e2a460> ``` ] -- .pull-left[ .small[ ```python cereal.groupby("type").groups ``` ``` ## {'Cold': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76], 'Hot': [20, 43, 57]} ``` ```python cereal.groupby("type").mean() ``` ``` ## calories sugars rating ## type ## Cold 107.162162 7.175676 42.095218 ## Hot 100.000000 1.333333 56.737708 ``` ] ] -- .pull-right[ .small[ ```python cereal.groupby("mfr").groups ``` ``` ## {'General Mills': [5, 7, 11, 12, 13, 14, 18, 22, 31, 36, 40, 42, 47, 51, 59, 69, 70, 71, 72, 73, 75, 76], 'Kellogg's': [2, 3, 6, 16, 17, 19, 21, 24, 25, 26, 28, 38, 39, 46, 48, 49, 50, 53, 58, 60, 62, 66, 67], 'Maltex': [43], 'Nabisco': [0, 20, 63, 64, 65, 68], 'Post': [9, 27, 29, 30, 32, 33, 34, 37, 52], 'Quaker Oats': [1, 10, 35, 41, 54, 55, 56, 57], 'Ralston Purina': [4, 8, 15, 23, 44, 45, 61, 74]} ``` ```python cereal.groupby("mfr").size() ``` ``` ## mfr ## General Mills 22 ## Kellogg's 23 ## Maltex 1 ## Nabisco 6 ## Post 9 ## Quaker Oats 8 ## Ralston Purina 8 ## dtype: int64 ``` ] ] --- ## Selecting and iterating groups Groups can be accessed via `get_group()` or the DataFrameGroupBy can be iterated over, .pull-left[ .small[ ```python cereal.groupby("type").get_group("Hot") ``` ``` ## name mfr type calories sugars rating ## 20 Cream of Wheat (Quick) Nabisco Hot 100 0 64.533816 ## 43 Maypo Maltex Hot 100 3 54.850917 ## 57 Quaker Oatmeal Quaker Oats Hot 100 1 50.828392 ``` ```python cereal.groupby("mfr").get_group("Post") ``` ``` ## name mfr ... sugars rating ## 9 Bran Flakes Post ... 5 53.313813 ## 27 Fruit & Fibre Dates; Walnuts; and Oats Post ... 10 40.917047 ## 29 Fruity Pebbles Post ... 12 28.025765 ## 30 Golden Crisp Post ... 15 35.252444 ## 32 Grape Nuts Flakes Post ... 5 52.076897 ## 33 Grape-Nuts Post ... 3 53.371007 ## 34 Great Grains Pecan Post ... 4 45.811716 ## 37 Honey-comb Post ... 11 28.742414 ## 52 Post Nat. Raisin Bran Post ... 14 37.840594 ## ## [9 rows x 6 columns] ``` ] ] -- .pull-right[ .small[ ```python for name, group in cereal.groupby("type"): print(name) print(group) print("") ``` ``` ## Cold ## name mfr ... sugars rating ## 0 100% Bran Nabisco ... 6 68.402973 ## 1 100% Natural Bran Quaker Oats ... 8 33.983679 ## 2 All-Bran Kellogg's ... 5 59.425505 ## 3 All-Bran with Extra Fiber Kellogg's ... 0 93.704912 ## 4 Almond Delight Ralston Purina ... 8 34.384843 ## .. ... ... ... ... ... ## 72 Triples General Mills ... 3 39.106174 ## 73 Trix General Mills ... 12 27.753301 ## 74 Wheat Chex Ralston Purina ... 3 49.787445 ## 75 Wheaties General Mills ... 3 51.592193 ## 76 Wheaties Honey Gold General Mills ... 8 36.187559 ## ## [74 rows x 6 columns] ## ## Hot ## name mfr type calories sugars rating ## 20 Cream of Wheat (Quick) Nabisco Hot 100 0 64.533816 ## 43 Maypo Maltex Hot 100 3 54.850917 ## 57 Quaker Oatmeal Quaker Oats Hot 100 1 50.828392 ``` ] ] --- ## Aggregation The `aggregate()` function or `agg()` method can be used to compute summary statistics for each group, .pull-left[ .small[ ```python cereal.groupby("mfr").agg("mean") ``` ``` ## calories sugars rating ## mfr ## General Mills 111.363636 7.954545 34.485852 ## Kellogg's 108.695652 7.565217 44.038462 ## Maltex 100.000000 3.000000 54.850917 ## Nabisco 86.666667 1.833333 67.968567 ## Post 108.888889 8.777778 41.705744 ## Quaker Oats 95.000000 5.500000 42.915990 ## Ralston Purina 115.000000 6.125000 41.542997 ``` ```python cereal.groupby("mfr").agg([np.mean, np.std]) ``` ``` ## calories sugars rating ## mean std mean std mean std ## mfr ## General Mills 111.363636 10.371873 7.954545 3.872704 34.485852 8.946704 ## Kellogg's 108.695652 22.218818 7.565217 4.500768 44.038462 14.457434 ## Maltex 100.000000 NaN 3.000000 NaN 54.850917 NaN ## Nabisco 86.666667 10.327956 1.833333 2.857738 67.968567 5.509326 ## Post 108.888889 10.540926 8.777778 4.576510 41.705744 10.047647 ## Quaker Oats 95.000000 29.277002 5.500000 4.780914 42.915990 16.797673 ## Ralston Purina 115.000000 22.677868 6.125000 3.563205 41.542997 6.080841 ## ## <string>:1: FutureWarning: ['name', 'type'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning. ``` ] ] .pull-right[ .small[ ```python cereal.groupby("mfr").agg({ "calories": ['min', 'max'], "sugars": ['mean', 'median'], "rating": ['sum', 'count'] }) ``` ``` ## calories sugars rating ## min max mean median sum count ## mfr ## General Mills 100 140 7.954545 8.5 758.688737 22 ## Kellogg's 50 160 7.565217 7.0 1012.884634 23 ## Maltex 100 100 3.000000 3.0 54.850917 1 ## Nabisco 70 100 1.833333 0.0 407.811403 6 ## Post 90 120 8.777778 10.0 375.351697 9 ## Quaker Oats 50 120 5.500000 6.0 343.327919 8 ## Ralston Purina 90 150 6.125000 5.5 332.343977 8 ``` ] ] .footnote[ Think `summarize()` from dplyr. ] --- ## Named aggregation It is also possible to use special syntax to aggregate specific columns into a named output column, ```python cereal.groupby("mfr", as_index=False).agg( min_cal = ("calories", "min"), max_cal = ("calories", "max"), med_sugar = ("sugars", "median"), avg_rating = ("rating", "mean") ) ``` ``` ## mfr min_cal max_cal med_sugar avg_rating ## 0 General Mills 100 140 8.5 34.485852 ## 1 Kellogg's 50 160 7.0 44.038462 ## 2 Maltex 100 100 3.0 54.850917 ## 3 Nabisco 70 100 0.0 67.968567 ## 4 Post 90 120 10.0 41.705744 ## 5 Quaker Oats 50 120 6.0 42.915990 ## 6 Ralston Purina 90 150 5.5 41.542997 ``` .footnote[Tuples can also be passed using `pd.NamedAgg()` but this offers no additional functionality.] --- ## Transformation The `transform()` method returns a DataFrame with the aggregated result matching the size (or length 1) of the input group(s), .pull-left[ ```python cereal.groupby("mfr").transform(np.mean) ``` ``` ## calories sugars rating ## 0 86.666667 1.833333 67.968567 ## 1 95.000000 5.500000 42.915990 ## 2 108.695652 7.565217 44.038462 ## 3 108.695652 7.565217 44.038462 ## 4 115.000000 6.125000 41.542997 ## .. ... ... ... ## 72 111.363636 7.954545 34.485852 ## 73 111.363636 7.954545 34.485852 ## 74 115.000000 6.125000 41.542997 ## 75 111.363636 7.954545 34.485852 ## 76 111.363636 7.954545 34.485852 ## ## [77 rows x 3 columns] ## ## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function. ``` ] .pull-right[ ```python cereal.groupby("type").transform("mean") ``` ``` ## calories sugars rating ## 0 107.162162 7.175676 42.095218 ## 1 107.162162 7.175676 42.095218 ## 2 107.162162 7.175676 42.095218 ## 3 107.162162 7.175676 42.095218 ## 4 107.162162 7.175676 42.095218 ## .. ... ... ... ## 72 107.162162 7.175676 42.095218 ## 73 107.162162 7.175676 42.095218 ## 74 107.162162 7.175676 42.095218 ## 75 107.162162 7.175676 42.095218 ## 76 107.162162 7.175676 42.095218 ## ## [77 rows x 3 columns] ## ## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function. ``` ] .footnote[Note that we have lost the non-numeric columns] --- ## Practical transformation `transform()` will generally be most useful via a user defined function, the lambda argument is each column of each group. ```python ( cereal .groupby("mfr") .transform( lambda x: (x - np.mean(x))/np.std(x) ) ) ``` ``` ## calories sugars rating ## 0 -1.767767 1.597191 0.086375 ## 1 0.912871 0.559017 -0.568474 ## 2 -1.780712 -0.582760 1.088220 ## 3 -2.701081 -1.718649 3.512566 ## 4 -0.235702 0.562544 -1.258442 ## .. ... ... ... ## 72 -0.134568 -1.309457 0.528580 ## 73 -0.134568 1.069190 -0.770226 ## 74 -0.707107 -0.937573 1.449419 ## 75 -1.121403 -1.309457 1.957022 ## 76 -0.134568 0.012013 0.194681 ## ## [77 rows x 3 columns] ## ## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function. ``` .footnote[ Above we are standardizing each numerical column of each manufacturer ] --- ## Filtering groups `filter()` also respects groups and allows for the inclusion / exclusion of groups based on user specified criteria, .small[ ```python cereal.groupby("mfr").size() ``` ``` ## mfr ## General Mills 22 ## Kellogg's 23 ## Maltex 1 ## Nabisco 6 ## Post 9 ## Quaker Oats 8 ## Ralston Purina 8 ## dtype: int64 ``` ] -- .pull-left[ .small[ ```python cereal.groupby("mfr").filter(lambda x: len(x) > 10) ``` ``` ## name mfr ... sugars rating ## 2 All-Bran Kellogg's ... 5 59.425505 ## 3 All-Bran with Extra Fiber Kellogg's ... 0 93.704912 ## 5 Apple Cinnamon Cheerios General Mills ... 10 29.509541 ## 6 Apple Jacks Kellogg's ... 14 33.174094 ## 7 Basic 4 General Mills ... 8 37.038562 ## 11 Cheerios General Mills ... 1 50.764999 ## 12 Cinnamon Toast Crunch General Mills ... 9 19.823573 ## 13 Clusters General Mills ... 7 40.400208 ## 14 Cocoa Puffs General Mills ... 13 22.736446 ## 16 Corn Flakes Kellogg's ... 2 45.863324 ## 17 Corn Pops Kellogg's ... 12 35.782791 ## 18 Count Chocula General Mills ... 13 22.396513 ## 19 Cracklin' Oat Bran Kellogg's ... 7 40.448772 ## 21 Crispix Kellogg's ... 3 46.895644 ## 22 Crispy Wheat & Raisins General Mills ... 10 36.176196 ## 24 Froot Loops Kellogg's ... 13 32.207582 ## 25 Frosted Flakes Kellogg's ... 11 31.435973 ## 26 Frosted Mini-Wheats Kellogg's ... 7 58.345141 ## 28 Fruitful Bran Kellogg's ... 12 41.015492 ## 31 Golden Grahams General Mills ... 9 23.804043 ## 36 Honey Nut Cheerios General Mills ... 10 31.072217 ## 38 Just Right Crunchy Nuggets Kellogg's ... 6 36.523683 ## 39 Just Right Fruit & Nut Kellogg's ... 9 36.471512 ## 40 Kix General Mills ... 3 39.241114 ## 42 Lucky Charms General Mills ... 12 26.734515 ## 46 Mueslix Crispy Blend Kellogg's ... 13 30.313351 ## 47 Multi-Grain Cheerios General Mills ... 6 40.105965 ## 48 Nut&Honey Crunch Kellogg's ... 9 29.924285 ## 49 Nutri-Grain Almond-Raisin Kellogg's ... 7 40.692320 ## 50 Nutri-grain Wheat Kellogg's ... 2 59.642837 ## 51 Oatmeal Raisin Crisp General Mills ... 10 30.450843 ## 53 Product 19 Kellogg's ... 3 41.503540 ## 58 Raisin Bran Kellogg's ... 12 39.259197 ## 59 Raisin Nut Bran General Mills ... 8 39.703400 ## 60 Raisin Squares Kellogg's ... 6 55.333142 ## 62 Rice Krispies Kellogg's ... 3 40.560159 ## 66 Smacks Kellogg's ... 15 31.230054 ## 67 Special K Kellogg's ... 3 53.131324 ## 69 Total Corn Flakes General Mills ... 3 38.839746 ## 70 Total Raisin Bran General Mills ... 14 28.592785 ## 71 Total Whole Grain General Mills ... 3 46.658844 ## 72 Triples General Mills ... 3 39.106174 ## 73 Trix General Mills ... 12 27.753301 ## 75 Wheaties General Mills ... 3 51.592193 ## 76 Wheaties Honey Gold General Mills ... 8 36.187559 ## ## [45 rows x 6 columns] ``` ] ] .pull-right[ .small[ ```python ( cereal .groupby("mfr") .filter(lambda x: len(x) > 10) .groupby("mfr") .size() ) ``` ``` ## mfr ## General Mills 22 ## Kellogg's 23 ## dtype: int64 ``` ] ]