Table of Contents

1 Building Criteria

  • Dynamically reduce a list of criteria using a binary operators
In [19]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ....:
Out[19]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [20]: Crit1 = df.AAA <= 5.5

In [21]: Crit2 = df.BBB == 10.0

In [22]: Crit3 = df.CCC > -40.0
In [24]: CritList = [Crit1,Crit2,Crit3]

In [25]: AllCrit = functools.reduce(lambda x,y: x & y, CritList)

In [26]: df[AllCrit]
Out[26]:
   AAA  BBB  CCC
0    4   10  100

2 Selection

2.1 Using both row labels and value conditionals

In [27]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ....:
Out[27]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [28]: df[(df.AAA <= 6) & (df.index.isin([0,2,4]))]
Out[28]:
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30

There are 2 explicit slicing methods, with a third general case

Positional-oriented (Python slicing style : exclusive of end) Label-oriented (Non-Python slicing style : inclusive of end) General (Either slicing style : depends on if the slice contains labels or positions)

In [35]: df2.iloc[1:3] #Position-oriented
Out[35]:
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

In [36]: df2.loc[1:3] #Label-oriented
Out[36]:
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

-New Columns Efficiently and dynamically creating new columns using applymap.

In [48]: df = pd.DataFrame(
   ....:      {'AAA' : [1,2,1,3], 'BBB' : [1,1,2,2], 'CCC' : [2,1,3,1]}); df
   ....:
Out[48]:
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

In [49]: source_cols = df.columns # or some subset would work too.

In [50]: new_cols = [str(x) + "_cat" for x in source_cols]

In [51]: categories = {1 : 'Alpha', 2 : 'Beta', 3 : 'Charlie' }

In [52]: df[new_cols] = df[source_cols].applymap(categories.get);df
Out[52]:
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha
  • append a new row:
df_analysis = df_analysis.append(
        {
            'datetime': datetime,
            'score': score
        }, ignore_index=True)

3 MultiIndexing

Creating a multi-index from a labeled frame

In [56]: df = pd.DataFrame({'row' : [0,1,2],
   ....:                    'One_X' : [1.1,1.1,1.1],
   ....:                    'One_Y' : [1.2,1.2,1.2],
   ....:                    'Two_X' : [1.11,1.11,1.11],
   ....:                    'Two_Y' : [1.22,1.22,1.22]}); df
   ....:
Out[56]:
   One_X  One_Y  Two_X  Two_Y  row
0    1.1    1.2   1.11   1.22    0
1    1.1    1.2   1.11   1.22    1
2    1.1    1.2   1.11   1.22    2

# As Labelled Index
In [57]: df = df.set_index('row');df
Out[57]:
     One_X  One_Y  Two_X  Two_Y
row
0      1.1    1.2   1.11   1.22
1      1.1    1.2   1.11   1.22
2      1.1    1.2   1.11   1.22

# With Hierarchical Columns
In [58]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns]);df
Out[58]:
     One        Two
       X    Y     X     Y
row
0    1.1  1.2  1.11  1.22
1    1.1  1.2  1.11  1.22
2    1.1  1.2  1.11  1.22
# Now stack & Reset
In [59]: df = df.stack(0).reset_index(1);df
Out[59]:
    level_1     X     Y
row
0       One  1.10  1.20
0       Two  1.11  1.22
1       One  1.10  1.20
1       Two  1.11  1.22
2       One  1.10  1.20
2       Two  1.11  1.22
# And fix the labels (Notice the label 'level_1' got added automatically)
In [60]: df.columns = ['Sample','All_X','All_Y'];df
Out[60]:
    Sample  All_X  All_Y
row
0      One   1.10   1.20
0      Two   1.11   1.22
1      One   1.10   1.20
1      Two   1.11   1.22
2      One   1.10   1.20
2      Two   1.11   1.22
  • Slicing

Slicing a multi-index with xs

In [64]: coords = [('AA','one'),('AA','six'),('BB','one'),('BB','two'),('BB','six')]

In [65]: index = pd.MultiIndex.from_tuples(coords)

In [66]: df = pd.DataFrame([11,22,33,44,55],index,['MyData']); df
Out[66]:
        MyData
AA one      11
   six      22
BB one      33
   two      44
   six      55

To take the cross section of the 1st level and 1st axis the index:

In [67]: df.xs('BB',level=0,axis=0)  #Note : level and axis are optional, and default to zero
Out[67]:
     MyData
one      33
two      44
six      55

Slicing a multi-index with xs, method #2

In [69]: index = list(itertools.product(['Ada','Quinn','Violet'],['Comp','Math','Sci']))

In [70]: headr = list(itertools.product(['Exams','Labs'],['I','II']))

In [71]: indx = pd.MultiIndex.from_tuples(index,names=['Student','Course'])

In [72]: cols = pd.MultiIndex.from_tuples(headr) #Notice these are un-named

In [73]: data = [[70+x+y+(x*y)%3 for x in range(4)] for y in range(9)]

In [74]: df = pd.DataFrame(data,indx,cols); df
Out[74]:
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Comp      70  71   72  73
        Math      71  73   75  74
        Sci       72  75   75  75
Quinn   Comp      73  74   75  76
        Math      74  76   78  77
        Sci       75  78   78  78
Violet  Comp      76  77   78  79
        Math      77  79   81  80
        Sci       78  81   81  81

In [75]: All = slice(None)

In [76]: df.loc['Violet']
Out[76]:
       Exams     Labs
           I  II    I  II
Course
Comp      76  77   78  79
Math      77  79   81  80
Sci       78  81   81  81

In [77]: df.loc[(All,'Math'),All]
Out[77]:
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
Violet  Math      77  79   81  80

# get I
df.loc[:, (slice(None), 'I')]
In [78]: df.loc[(slice('Ada','Quinn'),'Math'),All]
Out[78]:
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77

In [79]: df.loc[(All,'Math'),('Exams')]
Out[79]:
                 I  II
Student Course
Ada     Math    71  73
Quinn   Math    74  76
Violet  Math    77  79

In [80]: df.loc[(All,'Math'),(All,'II')]
Out[80]:
               Exams Labs
                  II   II
Student Course
Ada     Math      73   74
Quinn   Math      76   77
Violet  Math      79   80
  • Sorting

Sort by specific column or an ordered list of columns, with a multi-index.


  • pivoting a table to multi-index dataframe:
# get a pivot table, setting industry and symbol for two levels on the column
df_pivot_industries_asset_weights = pd.pivot_table(
    df_industries_asset_weight, values='value', index=['date'],
    columns=['industry', 'symbol'])

# pivot the original dataframe to multi-index dataframe
# level 0 value: industry
# level 1 value: assets, the order of assets are changed.
df_pivot_industries_asset_weights = df_pivot_industries_asset_weights.fillna(0)
idx_level_0_value = df_pivot_industries_asset_weights.columns.get_level_values(0)
idx_level_0_value = idx_level_0_value.drop_duplicates()
idx_level_1_value = df_pivot_industries_asset_weights.columns.get_level_values(1)

4 Grouping

split-apply-combine

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • ##Combining## the results into a data structure

4.1 Splitting an object into groups

# default is axis=0
>>> grouped = obj.groupby(key)
>>> grouped = obj.groupby(key, axis=1)
>>> grouped = obj.groupby([key1, key2])

5 MultiIndex / Advanced Indexing

5.1 Hierarchical indexing (MultiIndex)

5.1.1 Creating a MultiIndex (hierarchical index) object

A MultiIndex can be created from

  • a list of arrays (using MultiIndex.from_arrays)
  • an array of tuples (using MultiIndex.from_tuples)
  • a crossed set of iterables (using MultiIndex.from_product).
level0 a a b b
level1 aa ab bb ba

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. The method get_level_values will return a vector of the labels for each location at a particular level:

  1. Basic indexing on axis with MultiIndex
    In [26]: df['bar', 'one']
    Out[26]:
    A    0.895717
    B    0.410835
    C   -1.413681
    Name: (bar, one), dtype: float64
    

5.2 Advanced indexing

In [38]: df = df.T

In [39]: df
Out[39]:
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [40]: df.loc['bar']
Out[40]:
               A         B         C
second
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920

In [41]: df.loc['bar', 'two']
Out[41]:
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64
In [43]: df.loc[('baz', 'two'):('qux', 'one')]
Out[43]:
                     A         B         C
first second
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

5.2.1 Using slicers

In [51]: dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :]
Out[51]:
lvl0           a         b
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
...          ...  ...  ...  ...
A3 B0 C1 D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

5.2.2 group by

5.2.3 pivoting

5.2.4 reshaping data

6 Panels

  • Extend a panel frame by transposing, adding a new dimension, and transposing back to the original dimensions.
In [43]: pf = pd.Panel({'df1':df1,'df2':df2,'df3':df3});pf
Out[43]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 4 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to D
  • Create panels from dictionary:
panel_model = pd.Panel({target_date: get_factor_exposure(model, factors, target_date,
                        asset_weight.columns).T for target_date in asset_weight.index})
  • panel multipying:
datetimeindex = multiplicand.index.intersection(multiplier_panel.items)

product = pd.DataFrame(data=np.nan, index=datetimeindex, columns=multiplier_panel.major_axis)
for target_date in datetimeindex:
product.ix[target_date] = multiplier_panel[target_date].loc[:,multiplicand.columns].fillna(0).dot(multiplicand.ix[target_date].fillna(0))
  • selcting&slicing:

usually: items: datetimeindex

Major_axis: factors

Minor_axis: symbols

panel.loc[datetime]
panel.major_xs(panel.major_axis[2:5])

panel.minor_axis

panel.minor_xs(['A', 'B', 'C'])

7 xarray

7.1 Overview: Why xarray?

7.1.1 Features

Adding dimensions names and coordinate indexes to numpy’s ndarray.

  • Apply operations over dimensions by name: x.sum('time').
  • Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
  • Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
  • Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
  • Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

7.1.2 Core data structures

xarray has two core data structures. Both are fundamentally N-dimensional:

  • DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
  • Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.

7.1.3 Goals and aspirations

pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in.

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data.

Importantly, xarray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.

7.2 Examples

7.2.1 Quick overview

import xarray as xr
  1. Create a DataArray

    You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:

    In [4]: xr.DataArray(np.random.randn(2, 3))
    Out[4]:
    <xarray.DataArray (dim_0: 2, dim_1: 3)>
    array([[ 1.643563, -1.469388,  0.357021],
           [-0.6746  , -1.776904, -0.968914]])
    Dimensions without coordinates: dim_0, dim_1
    
    In [5]: data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y'))
    
    In [6]: data
    Out[6]:
    <xarray.DataArray (x: 2, y: 3)>
    array([[-1.294524,  0.413738,  0.276662],
           [-0.472035, -0.01396 , -0.362543]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
    Dimensions without coordinates: y
    

    the key properties for a DataArray:

    # like in pandas, values is a numpy array that you can modify in-place
    In [8]: data.values
    Out[8]:
    array([[-1.295,  0.414,  0.277],
           [-0.472, -0.014, -0.363]])
    
    In [9]: data.dims
    Out[9]: ('x', 'y')
    
    In [10]: data.coords
    Out[10]:
    Coordinates:
      * x        (x) <U1 'a' 'b'
    
    # you can use this dictionary to store arbitrary metadata
    In [11]: data.attrs
    Out[11]: OrderedDict()
    
  2. Indexing
    # positional and by integer label, like numpy
    In [12]: data[[0, 1]]
    Out[12]:
    <xarray.DataArray (x: 2, y: 3)>
    array([[-1.294524,  0.413738,  0.276662],
           [-0.472035, -0.01396 , -0.362543]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
    Dimensions without coordinates: y
    
    # positional and by coordinate label, like pandas
    In [13]: data.loc['a':'b']
    Out[13]:
    <xarray.DataArray (x: 2, y: 3)>
    array([[-1.294524,  0.413738,  0.276662],
           [-0.472035, -0.01396 , -0.362543]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
    Dimensions without coordinates: y
    
    # by dimension name and integer label
    In [14]: data.isel(x=slice(2))
    Out[14]:
    <xarray.DataArray (x: 2, y: 3)>
    array([[-1.294524,  0.413738,  0.276662],
           [-0.472035, -0.01396 , -0.362543]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
    Dimensions without coordinates: y
    
    # by dimension name and coordinate label
    In [15]: data.sel(x=['a', 'b'])
    Out[15]:
    <xarray.DataArray (x: 2, y: 3)>
    array([[-1.294524,  0.413738,  0.276662],
           [-0.472035, -0.01396 , -0.362543]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
    Dimensions without coordinates: y
    
  3. Computation
  4. GroupBy
    In [28]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
    
    In [29]: labels
    Out[29]:
    <xarray.DataArray 'labels' (y: 3)>
    array(['E', 'F', 'E'],
          dtype='<U1')
    Coordinates:
      * y        (y) int64 0 1 2
    
    In [30]: data.groupby(labels).mean('y')
    Out[30]:
    <xarray.DataArray (x: 2, labels: 2)>
    array([[-0.508931,  0.413738],
           [-0.417289, -0.01396 ]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
      * labels   (labels) object 'E' 'F'
    
    In [31]: data.groupby(labels).apply(lambda x: x - x.min())
    Out[31]:
    <xarray.DataArray (x: 2, y: 3)>
    array([[ 0.      ,  0.427698,  1.571185],
           [ 0.822489,  0.      ,  0.931981]])
    Coordinates:
      * x        (x) <U1 'a' 'b'
      * y        (y) int64 0 1 2
        labels   (y) <U1 'E' 'F' 'E'
    
  5. pandas
  6. Datasets
  7. NetCDF

7.2.2 Toy weather data

  1. Examine a dataset with pandas and seaborn
  2. Probability of freeze by calendar month
  3. Monthly averaging
  4. Calculate monthly anomalies
  5. Fill missing values with climatology

7.2.3 Calculating Seasonal Averages from Timeseries of Monthly Means

  1. Some calendar information so we can support any netCDF calendar.
  2. A few calendar functions to determine the number of days in each month
  3. Open the Dataset
  4. Now for the heavy lifting:

7.2.4 Working with Multidimensional Coordinates

  1. Plotting
  2. Multidimensional Groupby

7.2.5 Recipes

7.3 Installation

7.3.1 For accelerating xarray

bottleneck: speeds up NaN-skipping and rolling window aggregations by a large factor (1.0 or later)

7.3.2 For parallel computing

dask.array (0.9.0 or later): required for Out of core computation with dask.

7.4 Data Structures

7.5 Indexing and selecting data

Dimension lookup Index lookup DataArray syntax Dataset syntax
Positional By integer arr[:, 0] not available
Positional By label arr.loc[:, 'IA'] not available
By name By integer arr.isel(space=0) or arr[dict(space=0)] ds.isel(space=0) or ds[dict(space=0)]
By name By label arr.sel(space='IA') or arr.loc[dict(space='IA')] ds.sel(space='IA') or ds.loc[dict(space='IA')]

7.6 Computation

7.6.1 Rolling window operations

7.7 GroupBy: split-apply-combine

7.8 Reshaping and reorganizing data

7.9 Combining data

7.10 Time series data

7.10.1 Datetime indexing

In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24)

In [9]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})

In [10]: ds.sel(time='2000-01')
Out[10]:
<xarray.Dataset>
Dimensions:  (time: 744)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
    foo      (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10'))
Out[11]:
<xarray.Dataset>
Dimensions:  (time: 240)
Coordinates:
  * time     (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ...
Data variables:
    foo      (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...

7.11 Working with pandas

7.12 Serialization and IO

7.13 Out of core computation with dask

7.14 Plotting

7.15 API reference

7.16 xarray Internals