Table of Contents
- 1. Building Criteria
- 2. Selection
- 3. MultiIndexing
- 4. Grouping
- 5. MultiIndex / Advanced Indexing
- 6. Panels
- 7. xarray
- 7.1. Overview: Why xarray?
- 7.2. Examples
- 7.3. Installation
- 7.4. Data Structures
- 7.5. Indexing and selecting data
- 7.6. Computation
- 7.7. GroupBy: split-apply-combine
- 7.8. Reshaping and reorganizing data
- 7.9. Combining data
- 7.10. Time series data
- 7.11. Working with pandas
- 7.12. Serialization and IO
- 7.13. Out of core computation with dask
- 7.14. Plotting
- 7.15. API reference
- 7.16. xarray Internals
1 Building Criteria
- Dynamically reduce a list of criteria using a binary operators
In [19]: df = pd.DataFrame( ....: {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df ....: Out[19]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [20]: Crit1 = df.AAA <= 5.5 In [21]: Crit2 = df.BBB == 10.0 In [22]: Crit3 = df.CCC > -40.0 In [24]: CritList = [Crit1,Crit2,Crit3] In [25]: AllCrit = functools.reduce(lambda x,y: x & y, CritList) In [26]: df[AllCrit] Out[26]: AAA BBB CCC 0 4 10 100
2 Selection
2.1 Using both row labels and value conditionals
In [27]: df = pd.DataFrame( ....: {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df ....: Out[27]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [28]: df[(df.AAA <= 6) & (df.index.isin([0,2,4]))] Out[28]: AAA BBB CCC 0 4 10 100 2 6 30 -30
There are 2 explicit slicing methods, with a third general case
Positional-oriented (Python slicing style : exclusive of end) Label-oriented (Non-Python slicing style : inclusive of end) General (Either slicing style : depends on if the slice contains labels or positions)
In [35]: df2.iloc[1:3] #Position-oriented Out[35]: AAA BBB CCC 2 5 20 50 3 6 30 -30 In [36]: df2.loc[1:3] #Label-oriented Out[36]: AAA BBB CCC 1 4 10 100 2 5 20 50 3 6 30 -30
-New Columns Efficiently and dynamically creating new columns using applymap.
In [48]: df = pd.DataFrame( ....: {'AAA' : [1,2,1,3], 'BBB' : [1,1,2,2], 'CCC' : [2,1,3,1]}); df ....: Out[48]: AAA BBB CCC 0 1 1 2 1 2 1 1 2 1 2 3 3 3 2 1 In [49]: source_cols = df.columns # or some subset would work too. In [50]: new_cols = [str(x) + "_cat" for x in source_cols] In [51]: categories = {1 : 'Alpha', 2 : 'Beta', 3 : 'Charlie' } In [52]: df[new_cols] = df[source_cols].applymap(categories.get);df Out[52]: AAA BBB CCC AAA_cat BBB_cat CCC_cat 0 1 1 2 Alpha Alpha Beta 1 2 1 1 Beta Alpha Alpha 2 1 2 3 Alpha Beta Charlie 3 3 2 1 Charlie Beta Alpha
- append a new row:
df_analysis = df_analysis.append( { 'datetime': datetime, 'score': score }, ignore_index=True)
3 MultiIndexing
Creating a multi-index from a labeled frame
In [56]: df = pd.DataFrame({'row' : [0,1,2], ....: 'One_X' : [1.1,1.1,1.1], ....: 'One_Y' : [1.2,1.2,1.2], ....: 'Two_X' : [1.11,1.11,1.11], ....: 'Two_Y' : [1.22,1.22,1.22]}); df ....: Out[56]: One_X One_Y Two_X Two_Y row 0 1.1 1.2 1.11 1.22 0 1 1.1 1.2 1.11 1.22 1 2 1.1 1.2 1.11 1.22 2 # As Labelled Index In [57]: df = df.set_index('row');df Out[57]: One_X One_Y Two_X Two_Y row 0 1.1 1.2 1.11 1.22 1 1.1 1.2 1.11 1.22 2 1.1 1.2 1.11 1.22 # With Hierarchical Columns In [58]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns]);df Out[58]: One Two X Y X Y row 0 1.1 1.2 1.11 1.22 1 1.1 1.2 1.11 1.22 2 1.1 1.2 1.11 1.22 # Now stack & Reset In [59]: df = df.stack(0).reset_index(1);df Out[59]: level_1 X Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22 # And fix the labels (Notice the label 'level_1' got added automatically) In [60]: df.columns = ['Sample','All_X','All_Y'];df Out[60]: Sample All_X All_Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22
- Slicing
Slicing a multi-index with xs
In [64]: coords = [('AA','one'),('AA','six'),('BB','one'),('BB','two'),('BB','six')] In [65]: index = pd.MultiIndex.from_tuples(coords) In [66]: df = pd.DataFrame([11,22,33,44,55],index,['MyData']); df Out[66]: MyData AA one 11 six 22 BB one 33 two 44 six 55
To take the cross section of the 1st level and 1st axis the index:
In [67]: df.xs('BB',level=0,axis=0) #Note : level and axis are optional, and default to zero Out[67]: MyData one 33 two 44 six 55
Slicing a multi-index with xs, method #2
In [69]: index = list(itertools.product(['Ada','Quinn','Violet'],['Comp','Math','Sci'])) In [70]: headr = list(itertools.product(['Exams','Labs'],['I','II'])) In [71]: indx = pd.MultiIndex.from_tuples(index,names=['Student','Course']) In [72]: cols = pd.MultiIndex.from_tuples(headr) #Notice these are un-named In [73]: data = [[70+x+y+(x*y)%3 for x in range(4)] for y in range(9)] In [74]: df = pd.DataFrame(data,indx,cols); df Out[74]: Exams Labs I II I II Student Course Ada Comp 70 71 72 73 Math 71 73 75 74 Sci 72 75 75 75 Quinn Comp 73 74 75 76 Math 74 76 78 77 Sci 75 78 78 78 Violet Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 In [75]: All = slice(None) In [76]: df.loc['Violet'] Out[76]: Exams Labs I II I II Course Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 In [77]: df.loc[(All,'Math'),All] Out[77]: Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 Violet Math 77 79 81 80 # get I df.loc[:, (slice(None), 'I')] In [78]: df.loc[(slice('Ada','Quinn'),'Math'),All] Out[78]: Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 In [79]: df.loc[(All,'Math'),('Exams')] Out[79]: I II Student Course Ada Math 71 73 Quinn Math 74 76 Violet Math 77 79 In [80]: df.loc[(All,'Math'),(All,'II')] Out[80]: Exams Labs II II Student Course Ada Math 73 74 Quinn Math 76 77 Violet Math 79 80
- Sorting
Sort by specific column or an ordered list of columns, with a multi-index.
- pivoting a table to multi-index dataframe:
# get a pivot table, setting industry and symbol for two levels on the column df_pivot_industries_asset_weights = pd.pivot_table( df_industries_asset_weight, values='value', index=['date'], columns=['industry', 'symbol']) # pivot the original dataframe to multi-index dataframe # level 0 value: industry # level 1 value: assets, the order of assets are changed. df_pivot_industries_asset_weights = df_pivot_industries_asset_weights.fillna(0) idx_level_0_value = df_pivot_industries_asset_weights.columns.get_level_values(0) idx_level_0_value = idx_level_0_value.drop_duplicates() idx_level_1_value = df_pivot_industries_asset_weights.columns.get_level_values(1)
4 Grouping
split-apply-combine
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- ##Combining## the results into a data structure
4.1 Splitting an object into groups
# default is axis=0 >>> grouped = obj.groupby(key) >>> grouped = obj.groupby(key, axis=1) >>> grouped = obj.groupby([key1, key2])
5 MultiIndex / Advanced Indexing
5.1 Hierarchical indexing (MultiIndex)
5.1.1 Creating a MultiIndex (hierarchical index) object
A MultiIndex can be created from
- a list of arrays (using MultiIndex.from_arrays)
- an array of tuples (using MultiIndex.from_tuples)
- a crossed set of iterables (using MultiIndex.from_product).
level0 | a | a | b | b |
level1 | aa | ab | bb | ba |
All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. The method get_level_values will return a vector of the labels for each location at a particular level:
5.2 Advanced indexing
In [38]: df = df.T In [39]: df Out[39]: A B C first second bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747 In [40]: df.loc['bar'] Out[40]: A B C second one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 In [41]: df.loc['bar', 'two'] Out[41]: A 0.805244 B 0.813850 C 1.607920 Name: (bar, two), dtype: float64 In [43]: df.loc[('baz', 'two'):('qux', 'one')] Out[43]: A B C first second baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466
5.2.1 Using slicers
In [51]: dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :] Out[51]: lvl0 a b lvl1 bar foo bah foo A1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 D1 109 108 111 110 C3 D0 121 120 123 122 ... ... ... ... ... A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254 [24 rows x 4 columns]
5.2.2 group by
5.2.3 pivoting
5.2.4 reshaping data
6 Panels
- Extend a panel frame by transposing, adding a new dimension, and transposing back to the original dimensions.
In [43]: pf = pd.Panel({'df1':df1,'df2':df2,'df3':df3});pf Out[43]: <class 'pandas.core.panel.Panel'> Dimensions: 3 (items) x 100 (major_axis) x 4 (minor_axis) Items axis: df1 to df3 Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00 Minor_axis axis: A to D
- Create panels from dictionary:
panel_model = pd.Panel({target_date: get_factor_exposure(model, factors, target_date, asset_weight.columns).T for target_date in asset_weight.index})
- panel multipying:
datetimeindex = multiplicand.index.intersection(multiplier_panel.items) product = pd.DataFrame(data=np.nan, index=datetimeindex, columns=multiplier_panel.major_axis) for target_date in datetimeindex: product.ix[target_date] = multiplier_panel[target_date].loc[:,multiplicand.columns].fillna(0).dot(multiplicand.ix[target_date].fillna(0))
- selcting&slicing:
usually: items: datetimeindex
Major_axis: factors
Minor_axis: symbols
panel.loc[datetime] panel.major_xs(panel.major_axis[2:5]) panel.minor_axis panel.minor_xs(['A', 'B', 'C'])
7 xarray
7.1 Overview: Why xarray?
7.1.1 Features
Adding dimensions names and coordinate indexes to numpy’s ndarray.
- Apply operations over dimensions by name: x.sum('time').
- Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
- Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
- Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
- Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
- Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.
7.1.2 Core data structures
xarray has two core data structures. Both are fundamentally N-dimensional:
- DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.
The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.
7.1.3 Goals and aspirations
pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in.
xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data.
Importantly, xarray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.
7.2 Examples
7.2.1 Quick overview
import xarray as xr
- Create a DataArray
You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:
In [4]: xr.DataArray(np.random.randn(2, 3)) Out[4]: <xarray.DataArray (dim_0: 2, dim_1: 3)> array([[ 1.643563, -1.469388, 0.357021], [-0.6746 , -1.776904, -0.968914]]) Dimensions without coordinates: dim_0, dim_1 In [5]: data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y')) In [6]: data Out[6]: <xarray.DataArray (x: 2, y: 3)> array([[-1.294524, 0.413738, 0.276662], [-0.472035, -0.01396 , -0.362543]]) Coordinates: * x (x) <U1 'a' 'b' Dimensions without coordinates: y
the key properties for a DataArray:
# like in pandas, values is a numpy array that you can modify in-place In [8]: data.values Out[8]: array([[-1.295, 0.414, 0.277], [-0.472, -0.014, -0.363]]) In [9]: data.dims Out[9]: ('x', 'y') In [10]: data.coords Out[10]: Coordinates: * x (x) <U1 'a' 'b' # you can use this dictionary to store arbitrary metadata In [11]: data.attrs Out[11]: OrderedDict()
- Indexing
# positional and by integer label, like numpy In [12]: data[[0, 1]] Out[12]: <xarray.DataArray (x: 2, y: 3)> array([[-1.294524, 0.413738, 0.276662], [-0.472035, -0.01396 , -0.362543]]) Coordinates: * x (x) <U1 'a' 'b' Dimensions without coordinates: y # positional and by coordinate label, like pandas In [13]: data.loc['a':'b'] Out[13]: <xarray.DataArray (x: 2, y: 3)> array([[-1.294524, 0.413738, 0.276662], [-0.472035, -0.01396 , -0.362543]]) Coordinates: * x (x) <U1 'a' 'b' Dimensions without coordinates: y # by dimension name and integer label In [14]: data.isel(x=slice(2)) Out[14]: <xarray.DataArray (x: 2, y: 3)> array([[-1.294524, 0.413738, 0.276662], [-0.472035, -0.01396 , -0.362543]]) Coordinates: * x (x) <U1 'a' 'b' Dimensions without coordinates: y # by dimension name and coordinate label In [15]: data.sel(x=['a', 'b']) Out[15]: <xarray.DataArray (x: 2, y: 3)> array([[-1.294524, 0.413738, 0.276662], [-0.472035, -0.01396 , -0.362543]]) Coordinates: * x (x) <U1 'a' 'b' Dimensions without coordinates: y
- Computation
- GroupBy
In [28]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels') In [29]: labels Out[29]: <xarray.DataArray 'labels' (y: 3)> array(['E', 'F', 'E'], dtype='<U1') Coordinates: * y (y) int64 0 1 2 In [30]: data.groupby(labels).mean('y') Out[30]: <xarray.DataArray (x: 2, labels: 2)> array([[-0.508931, 0.413738], [-0.417289, -0.01396 ]]) Coordinates: * x (x) <U1 'a' 'b' * labels (labels) object 'E' 'F' In [31]: data.groupby(labels).apply(lambda x: x - x.min()) Out[31]: <xarray.DataArray (x: 2, y: 3)> array([[ 0. , 0.427698, 1.571185], [ 0.822489, 0. , 0.931981]]) Coordinates: * x (x) <U1 'a' 'b' * y (y) int64 0 1 2 labels (y) <U1 'E' 'F' 'E'
- pandas
- Datasets
- NetCDF
7.2.2 Toy weather data
7.2.3 Calculating Seasonal Averages from Timeseries of Monthly Means
7.2.5 Recipes
7.3 Installation
7.3.1 For accelerating xarray
bottleneck: speeds up NaN-skipping and rolling window aggregations by a large factor (1.0 or later)
7.3.2 For parallel computing
dask.array (0.9.0 or later): required for Out of core computation with dask.
7.4 Data Structures
7.5 Indexing and selecting data
Dimension lookup | Index lookup | DataArray syntax | Dataset syntax |
Positional | By integer | arr[:, 0] | not available |
Positional | By label | arr.loc[:, 'IA'] | not available |
By name | By integer | arr.isel(space=0) or arr[dict(space=0)] | ds.isel(space=0) or ds[dict(space=0)] |
By name | By label | arr.sel(space='IA') or arr.loc[dict(space='IA')] | ds.sel(space='IA') or ds.loc[dict(space='IA')] |
7.6 Computation
7.6.1 Rolling window operations
7.7 GroupBy: split-apply-combine
7.8 Reshaping and reorganizing data
7.9 Combining data
7.10 Time series data
7.10.1 Datetime indexing
In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24) In [9]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time}) In [10]: ds.sel(time='2000-01') Out[10]: <xarray.Dataset> Dimensions: (time: 744) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ... Data variables: foo (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10')) Out[11]: <xarray.Dataset> Dimensions: (time: 240) Coordinates: * time (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ... Data variables: foo (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...