这是一个简短的为新手入门的pandas教程,更高级的用法请查看ccokbook
这里安装numpy, mayplotlit 和 pandas省略.首先导入如下包
1 2 3 4 5
| In [1]: import pandas as pd In [2]: import numpy as np In [3]: import matplotlib.pyplot as plt
|
对象创建
传递一个list来创建一个 Series 对象, pandas会自动创建索引
1 2 3 4 5 6 7 8 9 10 11
| In [4]: s = pd.Series([1,3,5,np.nan, 6, 8]) In [5]: s Out[5]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
|
传递一个numpy的array创建一个DataFrame对象, 同时以datetime为索引和带label的列
DataFrame 是有多个列的数据表,每个列拥有一个 label,当然,DataFrame 也有索引
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| In [6]: dates = pd.date_range('20170608', periods=6) In [7]: dates Out[7]: DatetimeIndex(['2017-06-08', '2017-06-09', '2017-06-10', '2017-06-11', '2017-06-12', '2017-06-13'], dtype='datetime64[ns]', freq='D') In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) In [9]: df Out[9]: A B C D 2017-06-08 -0.561773 0.881191 -2.697783 -0.034672 2017-06-09 -0.053409 0.814811 0.294231 -1.753744 2017-06-10 -1.699722 0.971518 -2.592852 1.088473 2017-06-11 -2.411028 0.312229 -1.879164 1.388484 2017-06-12 -0.173929 -0.572149 2.044024 -0.101223 2017-06-13 -0.906777 -0.207889 -0.776134 2.327599
|
传递一个dict来创建一个DataFrame, 每个 dict 的 value 会被转化成一个 Series,
可以认为,DataFrame 是由多个 Series 组成的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| In [10]: df2 = pd.DataFrame({ ...: 'A': 1., ...: 'B': pd.Timestamp('20170608'), ...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'), ...: 'D': np.array([3]*4, dtype='int32'), ...: 'E': pd.Categorical(['test', 'train', 'test', 'train']), ...: 'F': 'foo'}) In [11]: df2 Out[11]: A B C D E F 0 1.0 2017-06-08 1.0 3 test foo 1 1.0 2017-06-08 1.0 3 train foo 2 1.0 2017-06-08 1.0 3 test foo 3 1.0 2017-06-08 1.0 3 train foo
|
每列数据的格式用 dtypes 查看
1 2 3 4 5 6 7 8 9
| In [12]: df2.dtypes Out[12]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
|
数据查看
查看指定列的数据
1 2 3 4 5 6 7
| In [15]: df2.A Out[15]: 0 1.0 1 1.0 2 1.0 3 1.0 Name: A, dtype: float64
|
用 head 和 tail 查看顶端和底端的几列, head()
和tail()
函数默认值都是5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| In [16]: df.head() Out[16]: A B C D 2017-06-08 -0.561773 0.881191 -2.697783 -0.034672 2017-06-09 -0.053409 0.814811 0.294231 -1.753744 2017-06-10 -1.699722 0.971518 -2.592852 1.088473 2017-06-11 -2.411028 0.312229 -1.879164 1.388484 2017-06-12 -0.173929 -0.572149 2.044024 -0.101223 In [17]: df.tail(3) Out[17]: A B C D 2017-06-11 -2.411028 0.312229 -1.879164 1.388484 2017-06-12 -0.173929 -0.572149 2.044024 -0.101223 2017-06-13 -0.906777 -0.207889 -0.776134 2.327599 In [18]: df.head(2) Out[18]: A B C D 2017-06-08 -0.561773 0.881191 -2.697783 -0.034672 2017-06-09 -0.053409 0.814811 0.294231 -1.753744 In [19]: df.tail() Out[19]: A B C D 2017-06-09 -0.053409 0.814811 0.294231 -1.753744 2017-06-10 -1.699722 0.971518 -2.592852 1.088473 2017-06-11 -2.411028 0.312229 -1.879164 1.388484 2017-06-12 -0.173929 -0.572149 2.044024 -0.101223 2017-06-13 -0.906777 -0.207889 -0.776134 2.327599
|
单独查看 index 和 columns 和 数据, DataFrame 内部用 numpy 格式存储数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| In [20]: df.index Out[20]: DatetimeIndex(['2017-06-08', '2017-06-09', '2017-06-10', '2017-06-11', '2017-06-12', '2017-06-13'], dtype='datetime64[ns]', freq='D') In [21]: df.columns Out[21]: Index(['A', 'B', 'C', 'D'], dtype='object') In [22]: df.values Out[22]: array([[-0.56177253, 0.88119144, -2.6977834 , -0.03467225], [-0.05340873, 0.81481114, 0.29423114, -1.75374372], [-1.69972161, 0.97151768, -2.59285248, 1.08847275], [-2.41102846, 0.3122287 , -1.87916396, 1.38848363], [-0.17392908, -0.57214913, 2.04402398, -0.10122313], [-0.90677665, -0.20788934, -0.77613381, 2.32759919]])
|
describe() 显示数据的概要。
1 2 3 4 5 6 7 8 9 10 11
| In [23]: df.describe() Out[23]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean -0.967773 0.366618 -0.934613 0.485819 std 0.922339 0.639667 1.852466 1.428378 min -2.411028 -0.572149 -2.697783 -1.753744 25% -1.501485 -0.077860 -2.414430 -0.084585 50% -0.734275 0.563520 -1.327649 0.526900 75% -0.270890 0.864596 0.026640 1.313481 max -0.053409 0.971518 2.044024 2.327599
|
和 numpy 一样,可以方便的得到转置,就是行列转换
1 2 3 4 5 6 7
| In [24]: df.T Out[24]: 2017-06-08 2017-06-09 2017-06-10 2017-06-11 2017-06-12 2017-06-13 A -0.561773 -0.053409 -1.699722 -2.411028 -0.173929 -0.906777 B 0.881191 0.814811 0.971518 0.312229 -0.572149 -0.207889 C -2.697783 0.294231 -2.592852 -1.879164 2.044024 -0.776134 D -0.034672 -1.753744 1.088473 1.388484 -0.101223 2.327599
|
对 axis 按照 index 排序(axis=1 是指根据列名来排序, axis=0 是根据行名来排序)
1 2 3 4 5 6 7 8 9
| In [26]: df.sort_index(axis=1, ascending=False) Out[26]: D C B A 2017-06-08 -0.034672 -2.697783 0.881191 -0.561773 2017-06-09 -1.753744 0.294231 0.814811 -0.053409 2017-06-10 1.088473 -2.592852 0.971518 -1.699722 2017-06-11 1.388484 -1.879164 0.312229 -2.411028 2017-06-12 -0.101223 2.044024 -0.572149 -0.173929 2017-06-13 2.327599 -0.776134 -0.207889 -0.906777
|
按值排序
1 2 3 4 5 6 7 8 9
| In [30]: df.sort_values(by='B') Out[30]: A B C D 2017-06-12 -0.173929 -0.572149 2.044024 -0.101223 2017-06-13 -0.906777 -0.207889 -0.776134 2.327599 2017-06-11 -2.411028 0.312229 -1.879164 1.388484 2017-06-09 -0.053409 0.814811 0.294231 -1.753744 2017-06-08 -0.561773 0.881191 -2.697783 -0.034672 2017-06-10 -1.699722 0.971518 -2.592852 1.088473
|
选择
注意: 以下这些对交互式环境很友好,但是作为 production code 请用优化过的 .at, .iat, .loc, .iloc, .ix
等.
获取行/列
从 DataFrame 选择一个列,就得到了 Series
1 2 3 4 5 6 7 8 9
| In [31]: df['A'] Out[31]: 2017-06-08 -0.561773 2017-06-09 -0.053409 2017-06-10 -1.699722 2017-06-11 -2.411028 2017-06-12 -0.173929 2017-06-13 -0.906777 Freq: D, Name: A, dtype: float64
|
使用[] 对行切片
1 2 3 4 5 6
| In [35]: df[0:3] Out[35]: A B C D 2017-06-08 -0.561773 0.881191 -2.697783 -0.034672 2017-06-09 -0.053409 0.814811 0.294231 -1.753744 2017-06-10 -1.699722 0.971518 -2.592852 1.088473
|
通过标签选择
通过时间戳的下标(dates[0] = Timestamp(‘20130101’))来访问
1 2 3 4 5 6 7
| In [36]: df.loc[dates[1]] Out[36]: A -0.053409 B 0.814811 C 0.294231 D -1.753744 Name: 2017-06-09 00:00:00, dtype: float64
|
选择多个标签
1 2 3 4 5 6 7 8 9
| In [38]: df.loc[:,['A', 'B']] Out[38]: A B 2017-06-08 -0.561773 0.881191 2017-06-09 -0.053409 0.814811 2017-06-10 -1.699722 0.971518 2017-06-11 -2.411028 0.312229 2017-06-12 -0.173929 -0.572149 2017-06-13 -0.906777 -0.207889
|
注意那个冒号,用法和 MATLAB 或 NumPy 是一样的!所以也可以这样
1 2 3 4 5 6
| In [39]: df.loc[dates[0]:dates[2], ['A', 'B']] Out[39]: A B 2017-06-08 -0.561773 0.881191 2017-06-09 -0.053409 0.814811 2017-06-10 -1.699722 0.971518
|
依旧和 MATLAB 一样,当有一个维度是标量(而不是范围或序列)的时候,选择出的矩阵维度会减少
1 2 3 4 5
| In [40]: df.loc[dates[0], ['A', 'B']] Out[40]: A -0.561773 B 0.881191 Name: 2017-06-08 00:00:00, dtype: float64
|
如果对所有的维度都写了标量,不就是选出一个元素吗?
1 2
| In [41]: df.loc[dates[0], 'A'] Out[41]: -0.56177252662051747
|
这种情况通常用 at ,速度更快
1 2
| In [42]: df.at[dates[0], 'A'] Out[42]: -0.56177252662051747
|
通过位置选择,即整数下标选择, 和 MATLAB 完全一样
这个就和数组类似啦,直接看例子。选出第4行:
1 2 3 4 5 6 7
| In [43]: df.iloc[3] Out[43]: A -2.411028 B 0.312229 C -1.879164 D 1.388484 Name: 2017-06-11 00:00:00, dtype: float64
|
选出3~4行,0~1列:
1 2 3 4 5
| In [45]: df.iloc[3:5, 0:2] Out[45]: A B 2017-06-11 -2.411028 0.312229 2017-06-12 -0.173929 -0.572149
|
也能用 list 选择, 挑出指定行和列
1 2 3 4 5 6
| In [46]: df.iloc[[1, 2, 4], [0, 3]] Out[46]: A D 2017-06-09 -0.053409 -1.753744 2017-06-10 -1.699722 1.088473 2017-06-12 -0.173929 -0.101223
|
也可以用slice
1 2 3 4 5
| In [47]: df.iloc[1:3, 1:3] Out[47]: B C 2017-06-09 0.814811 0.294231 2017-06-10 0.971518 -2.592852
|
选择单个元素
1 2 3 4 5
| In [48]: df.iloc[0,0] Out[48]: -0.56177252662051747 In [49]: df.iat[0,0] Out[49]: -0.56177252662051747
|
布尔值索引
根据单列的值来选择数据
1 2 3 4
| In [7]: df[df.A > 0] Out[7]: A B C D 2017-06-13 0.909448 -0.302722 -2.198783 -0.47542
|
从DataFrame中选择符合条件的值, 其中不符合条件的用NaN填充
1 2 3 4 5 6 7 8 9
| In [8]: df[df > 0] Out[8]: A B C D 2017-06-08 NaN NaN NaN NaN 2017-06-09 NaN 1.297100 NaN 0.231742 2017-06-10 NaN 2.380203 NaN NaN 2017-06-11 NaN 2.262221 1.135382 1.166144 2017-06-12 NaN NaN NaN NaN 2017-06-13 0.909448 NaN NaN NaN
|
isin() 函数:是否在集合中, 用来过滤数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| In [9]: df2 = df.copy() In [10]: df2['E'] = ['one', 'two', 'three', 'four', 'three', 'one'] In [11]: df2 Out[11]: A B C D E 2017-06-08 -0.993188 -0.098497 -0.898984 -0.060261 one 2017-06-09 -0.262396 1.297100 -0.161798 0.231742 two 2017-06-10 -1.792578 2.380203 -0.116943 -1.486425 three 2017-06-11 -2.126554 2.262221 1.135382 1.166144 four 2017-06-12 -0.421283 -1.725798 -0.588929 -0.191101 three 2017-06-13 0.909448 -0.302722 -2.198783 -0.475420 one In [12]: df2[df2['E'].isin(['two', 'four'])] Out[12]: A B C D E 2017-06-09 -0.262396 1.297100 -0.161798 0.231742 two 2017-06-11 -2.126554 2.262221 1.135382 1.166144 four
|
设置 setting
为 DataFrame 增加新的列,按 index 对应
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| In [13]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20170608', periods=6)) In [14]: s1 Out[14]: 2017-06-08 1 2017-06-09 2 2017-06-10 3 2017-06-11 4 2017-06-12 5 2017-06-13 6 Freq: D, dtype: int64 In [15]: df['F'] = s1 In [16]: df Out[16]: A B C D F 2017-06-08 -0.993188 -0.098497 -0.898984 -0.060261 1 2017-06-09 -0.262396 1.297100 -0.161798 0.231742 2 2017-06-10 -1.792578 2.380203 -0.116943 -1.486425 3 2017-06-11 -2.126554 2.262221 1.135382 1.166144 4 2017-06-12 -0.421283 -1.725798 -0.588929 -0.191101 5 2017-06-13 0.909448 -0.302722 -2.198783 -0.475420 6
|
通过label, 下标和numpy数组设置值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| In [17]: df.at[dates[0], 'A'] = 0 In [19]: df.iat[0,1] = 0 In [21]: df.loc[:, 'D'] = np.array([5] * len(df)) In [23]: df Out[23]: A B C D F 2017-06-08 0.000000 0.000000 -0.898984 5 1 2017-06-09 -0.262396 1.297100 -0.161798 5 2 2017-06-10 -1.792578 2.380203 -0.116943 5 3 2017-06-11 -2.126554 2.262221 1.135382 5 4 2017-06-12 -0.421283 -1.725798 -0.588929 5 5 2017-06-13 0.909448 -0.302722 -2.198783 5 6
|
通过布尔值设置值
1 2 3 4 5 6 7 8 9 10 11
| In [28]: df[df>0] = -df In [29]: df Out[29]: A B C D F 2017-06-08 0.000000 0.000000 -0.898984 -5 -1 2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 2017-06-12 -0.421283 -1.725798 -0.588929 -5 -5 2017-06-13 -0.909448 -0.302722 -2.198783 -5 -6
|
缺失值
pandas 用 np.nan 表示缺失值。通常它不会被计算。
Reindexing 允许你改变/增加/删除 指定轴的index, 并返回数据的拷贝
1 2 3 4 5 6 7 8 9 10 11
| In [30]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) In [31]: df1.loc[dates[0]:dates[1], 'E'] = 1 In [32]: df1 Out[32]: A B C D F E 2017-06-08 0.000000 0.000000 -0.898984 -5 -1 1.0 2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 1.0 2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 NaN 2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 NaN
|
丢弃有NaN的行
1 2 3 4 5
| In [33]: df1.dropna() Out[33]: A B C D F E 2017-06-08 0.000000 0.0000 -0.898984 -5 -1 1.0 2017-06-09 -0.262396 -1.2971 -0.161798 -5 -2 1.0
|
填充缺失值
1 2 3 4 5 6 7
| In [35]: df1.fillna(value=5) Out[35]: A B C D F E 2017-06-08 0.000000 0.000000 -0.898984 -5 -1 1.0 2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 1.0 2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 5.0 2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 5.0
|
获取布尔值的 mask:哪些值是 NaN, 则为true,否则为false
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| In [36]: pd.isnull(df1) Out[36]: A B C D F E 2017-06-08 False False False False False False 2017-06-09 False False False False False False 2017-06-10 False False False False False True 2017-06-11 False False False False False True In [37]: df1 Out[37]: A B C D F E 2017-06-08 0.000000 0.000000 -0.898984 -5 -1 1.0 2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 1.0 2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 NaN 2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 NaN
|
注意: 这里的df1数据其实没有改变, 返回的是运算后的数据拷贝
操作
说明: 操作都会把NaN(缺失值)排除在外
统计
平均值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| In [38]: df.mean() Out[38]: A -0.918710 B -1.328007 C -0.850137 D -5.000000 F -3.500000 dtype: float64 In [40]: df.mean(1) Out[40]: 2017-06-08 -1.379797 2017-06-09 -1.744259 2017-06-10 -2.457945 2017-06-11 -2.904831 2017-06-12 -2.547202 2017-06-13 -2.882190 Freq: D, dtype: float64
|
操作具有不同维度的对象需要对齐。pandas会沿着指定的维度自动广播
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| In [42]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) In [43]: s Out[43]: 2017-06-08 NaN 2017-06-09 NaN 2017-06-10 1.0 2017-06-11 3.0 2017-06-12 5.0 2017-06-13 NaN Freq: D, dtype: float64 In [45]: df.sub(s, axis='index') Out[45]: A B C D F 2017-06-08 NaN NaN NaN NaN NaN 2017-06-09 NaN NaN NaN NaN NaN 2017-06-10 -2.792578 -3.380203 -1.116943 -6.0 -4.0 2017-06-11 -5.126554 -5.262221 -4.135382 -8.0 -7.0 2017-06-12 -5.421283 -6.725798 -5.588929 -10.0 -10.0 2017-06-13 NaN NaN NaN NaN NaN In [46]: df Out[46]: A B C D F 2017-06-08 0.000000 0.000000 -0.898984 -5 -1 2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 2017-06-12 -0.421283 -1.725798 -0.588929 -5 -5 2017-06-13 -0.909448 -0.302722 -2.198783 -5 -6
|
Apply
对数据(行或列) Apply 函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| In [47]: df.apply(np.cumsum) Out[47]: A B C D F 2017-06-08 0.000000 0.000000 -0.898984 -5 -1 2017-06-09 -0.262396 -1.297100 -1.060782 -10 -3 2017-06-10 -2.054974 -3.677303 -1.177726 -15 -6 2017-06-11 -4.181527 -5.939523 -2.313108 -20 -10 2017-06-12 -4.602811 -7.665321 -2.902037 -25 -15 2017-06-13 -5.512259 -7.968043 -5.100820 -30 -21 In [48]: df.apply(lambda x: x.max() -x.min()) Out[48]: A 2.126554 B 2.380203 C 2.081839 D 0.000000 F 5.000000 dtype: float64
|
直方图
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| In [49]: s = pd.Series(np.random.randint(0, 7, size=10)) In [50]: s Out[50]: 0 0 1 4 2 6 3 2 4 6 5 6 6 0 7 0 8 3 9 1 dtype: int32 In [52]: s.value_counts() Out[52]: 6 3 0 3 4 1 3 1 2 1 1 1 dtype: int64
|
字符串方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| In [53]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) In [54]: s.str.lower() Out[54]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
|
Merge
concat
简单地按行拼接
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| In [55]: df = pd.DataFrame(np.random.randn(10, 4)) In [56]: df Out[56]: 0 1 2 3 0 1.279835 0.488299 0.122145 1.066159 1 -0.566047 -0.625790 -0.941786 -0.209994 2 0.364785 1.185549 -0.381762 1.752895 3 -0.568627 -0.235160 -1.602423 0.603979 4 0.698751 -1.656823 -0.306150 2.019342 5 -0.423725 1.321606 0.894416 -0.249282 6 -0.125866 -2.315650 0.376551 1.050506 7 -0.189071 -0.933617 -0.051930 -0.375252 8 0.478909 -2.041329 1.217890 -1.020701 9 -1.287622 -0.173968 0.387218 -0.004477 In [57]: pieces = [df[:3], df[3:7], df[7:]] In [58]: pieces Out[58]: [ 0 1 2 3 0 1.279835 0.488299 0.122145 1.066159 1 -0.566047 -0.625790 -0.941786 -0.209994 2 0.364785 1.185549 -0.381762 1.752895, 0 1 2 3 3 -0.568627 -0.235160 -1.602423 0.603979 4 0.698751 -1.656823 -0.306150 2.019342 5 -0.423725 1.321606 0.894416 -0.249282 6 -0.125866 -2.315650 0.376551 1.050506, 0 1 2 3 7 -0.189071 -0.933617 -0.051930 -0.375252 8 0.478909 -2.041329 1.217890 -1.020701 9 -1.287622 -0.173968 0.387218 -0.004477] In [59]: pd.concat(pieces) Out[59]: 0 1 2 3 0 1.279835 0.488299 0.122145 1.066159 1 -0.566047 -0.625790 -0.941786 -0.209994 2 0.364785 1.185549 -0.381762 1.752895 3 -0.568627 -0.235160 -1.602423 0.603979 4 0.698751 -1.656823 -0.306150 2.019342 5 -0.423725 1.321606 0.894416 -0.249282 6 -0.125866 -2.315650 0.376551 1.050506 7 -0.189071 -0.933617 -0.051930 -0.375252 8 0.478909 -2.041329 1.217890 -1.020701 9 -1.287622 -0.173968 0.387218 -0.004477
|
join
和 SQL 的 join 是一个意思
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| In [60]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) In [61]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) In [62]: left Out[62]: key lval 0 foo 1 1 foo 2 In [63]: right Out[63]: key rval 0 foo 4 1 foo 5 In [64]: pd.merge(left, right, on='key') Out[64]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
|
Append
向 DataFrame 增加新的数据行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| In [65]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D']) In [66]: df Out[66]: A B C D 0 -0.303120 0.419378 0.111816 -0.160104 1 0.924466 2.385372 -1.575671 -1.789283 2 1.106048 -0.087699 1.714677 -0.058831 3 0.765727 0.101523 -1.204472 0.011180 4 -2.084695 -0.517287 -0.497699 0.153516 5 -0.252460 0.044590 -0.481047 0.617817 6 -0.121789 -1.663595 -0.464836 -0.858457 7 -0.246298 0.687569 -2.081519 -1.529134 In [67]: s = df.iloc[3] In [68]: df.append(s, ignore_index=True) Out[68]: A B C D 0 -0.303120 0.419378 0.111816 -0.160104 1 0.924466 2.385372 -1.575671 -1.789283 2 1.106048 -0.087699 1.714677 -0.058831 3 0.765727 0.101523 -1.204472 0.011180 4 -2.084695 -0.517287 -0.497699 0.153516 5 -0.252460 0.044590 -0.481047 0.617817 6 -0.121789 -1.663595 -0.464836 -0.858457 7 -0.246298 0.687569 -2.081519 -1.529134 8 0.765727 0.101523 -1.204472 0.011180
|
Grouping
和 SQL 中的 GROUP BY 类似,包括以下这几步:
- 根据某些规则,把数据分组
- 对每组应用一个函数
- 合并结果到一个数据结构中
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| In [69]: df = pd.DataFrame({ ...: 'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], ...: 'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], ...: 'C' : np.random.randn(8), ...: 'D' : np.random.randn(8)}) In [70]: df Out[70]: A B C D 0 foo one -1.203566 -1.199038 1 bar one 2.368016 -1.059656 2 foo two -0.704544 -1.047794 3 bar three -0.511092 1.121859 4 foo two -0.723414 -1.225536 5 bar two -0.363631 0.111177 6 foo one 0.271759 0.500798 7 foo three -0.248960 2.035166 In [71]: df.groupby('A').sum() Out[71]: C D A bar 1.493293 0.173380 foo -2.608725 -0.936404 In [72]: df.groupby(['A', 'B']).sum() Out[72]: C D A B bar one 2.368016 -1.059656 three -0.511092 1.121859 two -0.363631 0.111177 foo one -0.931807 -0.698240 three -0.248960 2.035166 two -1.427958 -2.273330
|
Reshape
Stack 叠层
准备数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| In [73]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', ...: 'foo', 'foo', 'qux', 'qux'], ...: ['one', 'two', 'one', 'two', ...: 'one', 'two', 'one', 'two']])) In [75]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) In [76]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A','B']) In [77]: df2 = df[:4] In [78]: df2 Out[78]: A B first second bar one 0.181096 -0.610853 two -1.326709 0.269153 baz one -0.455180 -0.070055 two -0.953993 -0.235160 ``` stack() 把 DataFrame 的所有列“压缩”到 index 里去 ```python In [79]: stacked = df2.stack() In [80]: stacked Out[80]: first second bar one A 0.181096 B -0.610853 two A -1.326709 B 0.269153 baz one A -0.455180 B -0.070055 two A -0.953993 B -0.235160 dtype: float64
|
反之,只要是 MultiIndex 都可以用 unstack() 恢复出列,默认把最后一个 index 解开
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| In [81]: stacked.unstack() Out[81]: A B first second bar one 0.181096 -0.610853 two -1.326709 0.269153 baz one -0.455180 -0.070055 two -0.953993 -0.235160 In [82]: stacked.unstack(1) Out[82]: second one two first bar A 0.181096 -1.326709 B -0.610853 0.269153 baz A -0.455180 -0.953993 B -0.070055 -0.235160 In [83]: stacked.unstack(0) Out[83]: first bar baz second one A 0.181096 -0.455180 B -0.610853 -0.070055 two A -1.326709 -0.953993 B 0.269153 -0.235160
|
Pivot Table 旋转
准备数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| In [84]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, ...: 'B' : ['A', 'B', 'C'] * 4, ...: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, ...: 'D' : np.random.randn(12), ...: 'E' : np.random.randn(12)}) In [85]: df Out[85]: A B C D E 0 one A foo -0.379428 -0.050681 1 one B foo 1.650807 0.024530 2 two C foo 0.349554 1.202692 3 three A bar -1.516381 -0.163382 4 one B bar 0.360722 -0.241622 5 one C bar -0.276398 0.581192 6 two A foo 0.304563 -0.663271 7 three B foo 1.328499 0.485223 8 one C foo 1.665213 -0.577843 9 one A bar -0.229248 -0.335329 10 two B bar -0.133112 1.119350 11 three C bar -0.383992 0.787800
|
pivot 是把原来的数据(values)作为新表的行(index)、列(columns)
1 2 3 4 5 6 7 8 9 10 11 12 13
| In [86]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[86]: C bar foo A B one A -0.229248 -0.379428 B 0.360722 1.650807 C -0.276398 1.665213 three A -1.516381 NaN B NaN 1.328499 C -0.383992 NaN two A NaN 0.304563 B -0.133112 NaN C NaN 0.349554
|
时间序列
pandas 的时间序列功能在金融应用中很有用。
resample功能
1 2 3 4 5 6 7 8
| In [88]: rng = pd.date_range('9/6/2017', periods=100, freq='S') In [89]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) In [90]: ts.resample('5Min').sum() Out[90]: 2017-09-06 26085 Freq: 5T, dtype: int32
|
时区表示
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| In [94]: rng = pd.date_range('9/6/2017 00:00', periods=5, freq='D') In [95]: ts = pd.Series(np.random.randn(len(rng)), rng) In [96]: ts Out[96]: 2017-09-06 -0.715837 2017-09-07 1.021448 2017-09-08 1.186508 2017-09-09 0.606852 2017-09-10 0.566530 Freq: D, dtype: float64 In [97]: ts_utc = ts.tz_localize('UTC') In [98]: ts_utc Out[98]: 2017-09-06 00:00:00+00:00 -0.715837 2017-09-07 00:00:00+00:00 1.021448 2017-09-08 00:00:00+00:00 1.186508 2017-09-09 00:00:00+00:00 0.606852 2017-09-10 00:00:00+00:00 0.566530 Freq: D, dtype: float64
|
时区转换
1 2 3 4 5 6 7 8
| In [100]: ts_utc.tz_convert('US/Eastern') Out[100]: 2017-09-05 20:00:00-04:00 -0.715837 2017-09-06 20:00:00-04:00 1.021448 2017-09-07 20:00:00-04:00 1.186508 2017-09-08 20:00:00-04:00 0.606852 2017-09-09 20:00:00-04:00 0.566530 Freq: D, dtype: float64
|
在时间跨度表示之间进行转换
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
| In [101]: rng = pd.date_range('9/6/2017', periods=5, freq='M') In [102]: ts = pd.Series(np.random.randn(len(rng)), index=rng) In [103]: ts Out[103]: 2017-09-30 -0.337867 2017-10-31 0.546883 2017-11-30 0.063004 2017-12-31 0.419636 2018-01-31 2.562404 Freq: M, dtype: float64 In [104]: ps = ts.to_period() In [105]: ps Out[105]: 2017-09 -0.337867 2017-10 0.546883 2017-11 0.063004 2017-12 0.419636 2018-01 2.562404 Freq: M, dtype: float64 In [106]: ps.to_timestamp() Out[106]: 2017-09-01 -0.337867 2017-10-01 0.546883 2017-11-01 0.063004 2017-12-01 0.419636 2018-01-01 2.562404 Freq: MS, dtype: float64 In [107]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV') In [108]: ts = pd.Series(np.random.randn(len(prng)), prng) In [109]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9 In [110]: ts.head() Out[110]: 1990-03-01 09:00 -0.315663 1990-06-01 09:00 2.092510 1990-09-01 09:00 0.337302 1990-12-01 09:00 0.037893 1991-03-01 09:00 0.463572 Freq: H, dtype: float64
|
类别
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| In [111]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) In [112]: df Out[112]: id raw_grade 0 1 a 1 2 b 2 3 b 3 4 a 4 5 a 5 6 e In [113]: df['grade'] = df['raw_grade'].astype('category') In [114]: df['grade'] Out[114]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]
|
类别可以 inplace 地赋值:(只是改一下对应的字符串嘛,类别是用 Index 对象存储的)
In [115]: df['grade'].cat.categories = ["very good", "good", "very bad"]
修改类别时,如果有新的类别,会自动加进去
1 2 3 4 5 6 7 8 9 10 11 12
| In [117]: df['grade'] = df['grade'].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) In [118]: df['grade'] Out[118]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
|
根据类别排序
1 2 3 4 5 6 7 8 9
| In [119]: df.sort_values(by='grade') Out[119]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good
|
做 group by 的时候,空的类别也会被呈现出来
1 2 3 4 5 6 7 8 9
| In [120]: df.groupby('grade').size() Out[120]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
|
绘图
1 2 3 4 5 6
| In [121]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) In [122]: ts = ts.cumsum() In [123]: ts.plot() Out[123]: <matplotlib.axes._subplots.AxesSubplot at 0x6a0be10>
|

对于DtaFrame,可以直接plot
1 2 3 4 5 6
| In [124]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D']) In [125]: df = df.cumsum() In [126]: plt.figure(); df.plot(); plt.legend(loc='best') Out[126]: <matplotlib.legend.Legend at 0x6c3e8f0>
|

读写数据
CSV
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
| In [127]: df.to_csv('foo.csv') In [128]: pd.read_csv('foo.csv') Out[128]: Unnamed: 0 A B C D 0 2000-01-01 1.239452 1.983851 0.578987 1.155658 1 2000-01-02 0.295568 3.704782 1.962402 0.885559 2 2000-01-03 -0.139800 3.872238 2.793025 0.972079 3 2000-01-04 -0.804336 3.259410 2.159552 1.785479 4 2000-01-05 0.171984 3.027446 2.253854 0.779544 5 2000-01-06 1.781807 1.836517 4.558960 1.999731 6 2000-01-07 0.650927 0.154262 3.650160 0.685396 7 2000-01-08 0.492898 -0.198763 3.646034 -0.073848 8 2000-01-09 -0.100468 -0.623716 3.217366 0.220008 9 2000-01-10 -0.905394 0.282200 3.153474 -0.189584 10 2000-01-11 -0.962155 -1.147225 3.353251 1.283288 11 2000-01-12 -1.853063 -1.539308 5.246351 0.222400 12 2000-01-13 -3.385665 -2.982384 3.952838 0.319281 13 2000-01-14 -4.411986 -2.914887 4.775386 0.597495 14 2000-01-15 -5.938706 -3.437068 6.882886 1.105879 15 2000-01-16 -6.537699 -1.947668 8.142251 0.948407 16 2000-01-17 -6.786304 -2.735740 8.484593 1.034685 17 2000-01-18 -7.145344 -1.666703 9.420653 -0.919941 18 2000-01-19 -8.283291 0.101307 10.654933 -0.936284 19 2000-01-20 -8.330414 0.820054 11.207165 -2.622354 20 2000-01-21 -7.942291 1.559753 9.201008 -2.923220 21 2000-01-22 -4.675543 2.337827 10.364670 -3.112916 22 2000-01-23 -5.774614 0.408297 9.441821 -2.040018 23 2000-01-24 -5.993547 -0.480329 10.311053 0.852373 24 2000-01-25 -4.137167 0.099883 10.224100 0.576367 25 2000-01-26 -2.866354 -0.772702 11.812000 1.671981 26 2000-01-27 -1.764316 -2.534161 11.634361 1.638750 27 2000-01-28 -3.220852 -3.353006 11.399995 -0.562762 28 2000-01-29 -0.889241 -2.794083 10.445614 0.473105 29 2000-01-30 0.428342 -3.857681 10.930199 -0.118981 .. ... ... ... ... ... 970 2002-08-28 -2.853884 -2.041226 28.387424 17.129529 971 2002-08-29 -5.058352 0.041153 28.095666 16.654813 972 2002-08-30 -4.478272 -1.613275 26.989764 17.653338 973 2002-08-31 -2.815382 -2.764649 26.197949 16.442647 974 2002-09-01 -2.621575 -2.785604 28.007243 14.962121 975 2002-09-02 -2.476145 -3.128888 27.541079 12.853070 976 2002-09-03 -0.483923 -3.061629 27.130099 13.401077 977 2002-09-04 -0.885055 -2.059356 26.139260 12.725815 978 2002-09-05 -3.438688 -3.060238 26.267361 13.601928 979 2002-09-06 -3.646110 -2.908451 27.639157 13.749199 980 2002-09-07 -3.742097 -5.843492 27.138204 14.621900 981 2002-09-08 -4.367495 -5.523435 26.429433 14.372776 982 2002-09-09 -6.481100 -7.691100 26.325775 14.060133 983 2002-09-10 -4.406432 -5.925796 25.808105 13.717117 984 2002-09-11 -5.747746 -6.226884 26.334882 13.539911 985 2002-09-12 -5.075507 -5.976265 25.582403 13.955776 986 2002-09-13 -5.957833 -4.514403 26.154568 15.948706 987 2002-09-14 -4.822016 -3.510714 26.770429 14.903106 988 2002-09-15 -5.413908 -3.678965 29.255640 15.903795 989 2002-09-16 -7.305733 -4.970404 30.726133 17.075891 990 2002-09-17 -6.921962 -3.342561 32.499852 15.309124 991 2002-09-18 -7.650811 -3.539989 31.823857 15.100428 992 2002-09-19 -7.680999 -2.289270 31.098525 12.891011 993 2002-09-20 -7.849671 -1.712562 31.364746 13.909086 994 2002-09-21 -8.739466 -2.122690 32.004201 14.095981 995 2002-09-22 -8.740361 -0.656488 29.674406 12.590248 996 2002-09-23 -6.309331 -0.986256 30.370786 13.450941 997 2002-09-24 -6.205371 -0.419119 30.260205 12.028273 998 2002-09-25 -7.573782 -2.492400 31.063298 12.361099 999 2002-09-26 -7.317950 -2.840246 31.560766 12.502660
|
HDF5
1 2 3
| In [129]: df.to_hdf('foo.h5','df') In [130]: pd.read_hdf('foo.h5','df')
|
Excel
1 2 3
| In [131]: df.to_excel('foo.xlsx', sheet_name='Sheet1') In [132]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
|
坑
如下,不能直接把返回值当作布尔值。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| In [131]: if pd.Series([False, True, False]): ...: print('I was true') ...: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-131-ec15f6b92216> in <module>() ----> 1 if pd.Series([False, True, False]): 2 print('I was true') 3 c:\python34\lib\site-packages\pandas\core\generic.py in __nonzero__(self) 953 raise ValueError("The truth value of a {0} is ambiguous. " 954 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." --> 955 .format(self.__class__.__name__)) 956 957 __bool__ = __nonzero__ ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
|
参考