这是一个简短的为新手入门的pandas教程,更高级的用法请查看ccokbook
这里安装numpy, mayplotlit 和 pandas省略.首先导入如下包

1
2
3
4
5
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import matplotlib.pyplot as plt

对象创建

传递一个list来创建一个 Series 对象, pandas会自动创建索引

1
2
3
4
5
6
7
8
9
10
11
In [4]: s = pd.Series([1,3,5,np.nan, 6, 8])
In [5]: s
Out[5]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

传递一个numpy的array创建一个DataFrame对象, 同时以datetime为索引和带label的列
DataFrame 是有多个列的数据表,每个列拥有一个 label,当然,DataFrame 也有索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 创建datetime索引
In [6]: dates = pd.date_range('20170608', periods=6)
In [7]: dates
Out[7]:
DatetimeIndex(['2017-06-08', '2017-06-09', '2017-06-10', '2017-06-11',
'2017-06-12', '2017-06-13'],
dtype='datetime64[ns]', freq='D')
In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
In [9]: df
Out[9]:
A B C D
2017-06-08 -0.561773 0.881191 -2.697783 -0.034672
2017-06-09 -0.053409 0.814811 0.294231 -1.753744
2017-06-10 -1.699722 0.971518 -2.592852 1.088473
2017-06-11 -2.411028 0.312229 -1.879164 1.388484
2017-06-12 -0.173929 -0.572149 2.044024 -0.101223
2017-06-13 -0.906777 -0.207889 -0.776134 2.327599

传递一个dict来创建一个DataFrame, 每个 dict 的 value 会被转化成一个 Series,
可以认为,DataFrame 是由多个 Series 组成的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [10]: df2 = pd.DataFrame({
...: 'A': 1.,
...: 'B': pd.Timestamp('20170608'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3]*4, dtype='int32'),
...: 'E': pd.Categorical(['test', 'train', 'test', 'train']),
...: 'F': 'foo'})
In [11]: df2
Out[11]:
A B C D E F
0 1.0 2017-06-08 1.0 3 test foo
1 1.0 2017-06-08 1.0 3 train foo
2 1.0 2017-06-08 1.0 3 test foo
3 1.0 2017-06-08 1.0 3 train foo

每列数据的格式用 dtypes 查看

1
2
3
4
5
6
7
8
9
In [12]: df2.dtypes
Out[12]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object

数据查看

查看指定列的数据

1
2
3
4
5
6
7
In [15]: df2.A
Out[15]:
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64

用 head 和 tail 查看顶端和底端的几列, head()tail()函数默认值都是5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
In [16]: df.head()
Out[16]:
A B C D
2017-06-08 -0.561773 0.881191 -2.697783 -0.034672
2017-06-09 -0.053409 0.814811 0.294231 -1.753744
2017-06-10 -1.699722 0.971518 -2.592852 1.088473
2017-06-11 -2.411028 0.312229 -1.879164 1.388484
2017-06-12 -0.173929 -0.572149 2.044024 -0.101223
In [17]: df.tail(3)
Out[17]:
A B C D
2017-06-11 -2.411028 0.312229 -1.879164 1.388484
2017-06-12 -0.173929 -0.572149 2.044024 -0.101223
2017-06-13 -0.906777 -0.207889 -0.776134 2.327599
In [18]: df.head(2)
Out[18]:
A B C D
2017-06-08 -0.561773 0.881191 -2.697783 -0.034672
2017-06-09 -0.053409 0.814811 0.294231 -1.753744
In [19]: df.tail()
Out[19]:
A B C D
2017-06-09 -0.053409 0.814811 0.294231 -1.753744
2017-06-10 -1.699722 0.971518 -2.592852 1.088473
2017-06-11 -2.411028 0.312229 -1.879164 1.388484
2017-06-12 -0.173929 -0.572149 2.044024 -0.101223
2017-06-13 -0.906777 -0.207889 -0.776134 2.327599

单独查看 index 和 columns 和 数据, DataFrame 内部用 numpy 格式存储数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [20]: df.index
Out[20]:
DatetimeIndex(['2017-06-08', '2017-06-09', '2017-06-10', '2017-06-11',
'2017-06-12', '2017-06-13'],
dtype='datetime64[ns]', freq='D')
In [21]: df.columns
Out[21]: Index(['A', 'B', 'C', 'D'], dtype='object')
In [22]: df.values
Out[22]:
array([[-0.56177253, 0.88119144, -2.6977834 , -0.03467225],
[-0.05340873, 0.81481114, 0.29423114, -1.75374372],
[-1.69972161, 0.97151768, -2.59285248, 1.08847275],
[-2.41102846, 0.3122287 , -1.87916396, 1.38848363],
[-0.17392908, -0.57214913, 2.04402398, -0.10122313],
[-0.90677665, -0.20788934, -0.77613381, 2.32759919]])

describe() 显示数据的概要。

1
2
3
4
5
6
7
8
9
10
11
In [23]: df.describe()
Out[23]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.967773 0.366618 -0.934613 0.485819
std 0.922339 0.639667 1.852466 1.428378
min -2.411028 -0.572149 -2.697783 -1.753744
25% -1.501485 -0.077860 -2.414430 -0.084585
50% -0.734275 0.563520 -1.327649 0.526900
75% -0.270890 0.864596 0.026640 1.313481
max -0.053409 0.971518 2.044024 2.327599

和 numpy 一样,可以方便的得到转置,就是行列转换

1
2
3
4
5
6
7
In [24]: df.T
Out[24]:
2017-06-08 2017-06-09 2017-06-10 2017-06-11 2017-06-12 2017-06-13
A -0.561773 -0.053409 -1.699722 -2.411028 -0.173929 -0.906777
B 0.881191 0.814811 0.971518 0.312229 -0.572149 -0.207889
C -2.697783 0.294231 -2.592852 -1.879164 2.044024 -0.776134
D -0.034672 -1.753744 1.088473 1.388484 -0.101223 2.327599

对 axis 按照 index 排序(axis=1 是指根据列名来排序, axis=0 是根据行名来排序)

1
2
3
4
5
6
7
8
9
In [26]: df.sort_index(axis=1, ascending=False)
Out[26]:
D C B A
2017-06-08 -0.034672 -2.697783 0.881191 -0.561773
2017-06-09 -1.753744 0.294231 0.814811 -0.053409
2017-06-10 1.088473 -2.592852 0.971518 -1.699722
2017-06-11 1.388484 -1.879164 0.312229 -2.411028
2017-06-12 -0.101223 2.044024 -0.572149 -0.173929
2017-06-13 2.327599 -0.776134 -0.207889 -0.906777

按值排序

1
2
3
4
5
6
7
8
9
In [30]: df.sort_values(by='B')
Out[30]:
A B C D
2017-06-12 -0.173929 -0.572149 2.044024 -0.101223
2017-06-13 -0.906777 -0.207889 -0.776134 2.327599
2017-06-11 -2.411028 0.312229 -1.879164 1.388484
2017-06-09 -0.053409 0.814811 0.294231 -1.753744
2017-06-08 -0.561773 0.881191 -2.697783 -0.034672
2017-06-10 -1.699722 0.971518 -2.592852 1.088473

选择

注意: 以下这些对交互式环境很友好,但是作为 production code 请用优化过的 .at, .iat, .loc, .iloc, .ix等.

获取行/列
从 DataFrame 选择一个列,就得到了 Series

1
2
3
4
5
6
7
8
9
In [31]: df['A']
Out[31]:
2017-06-08 -0.561773
2017-06-09 -0.053409
2017-06-10 -1.699722
2017-06-11 -2.411028
2017-06-12 -0.173929
2017-06-13 -0.906777
Freq: D, Name: A, dtype: float64

使用[] 对行切片

1
2
3
4
5
6
In [35]: df[0:3]
Out[35]:
A B C D
2017-06-08 -0.561773 0.881191 -2.697783 -0.034672
2017-06-09 -0.053409 0.814811 0.294231 -1.753744
2017-06-10 -1.699722 0.971518 -2.592852 1.088473

通过标签选择
通过时间戳的下标(dates[0] = Timestamp(‘20130101’))来访问

1
2
3
4
5
6
7
In [36]: df.loc[dates[1]]
Out[36]:
A -0.053409
B 0.814811
C 0.294231
D -1.753744
Name: 2017-06-09 00:00:00, dtype: float64

选择多个标签

1
2
3
4
5
6
7
8
9
In [38]: df.loc[:,['A', 'B']]
Out[38]:
A B
2017-06-08 -0.561773 0.881191
2017-06-09 -0.053409 0.814811
2017-06-10 -1.699722 0.971518
2017-06-11 -2.411028 0.312229
2017-06-12 -0.173929 -0.572149
2017-06-13 -0.906777 -0.207889

注意那个冒号,用法和 MATLAB 或 NumPy 是一样的!所以也可以这样

1
2
3
4
5
6
In [39]: df.loc[dates[0]:dates[2], ['A', 'B']]
Out[39]:
A B
2017-06-08 -0.561773 0.881191
2017-06-09 -0.053409 0.814811
2017-06-10 -1.699722 0.971518

依旧和 MATLAB 一样,当有一个维度是标量(而不是范围或序列)的时候,选择出的矩阵维度会减少

1
2
3
4
5
In [40]: df.loc[dates[0], ['A', 'B']]
Out[40]:
A -0.561773
B 0.881191
Name: 2017-06-08 00:00:00, dtype: float64

如果对所有的维度都写了标量,不就是选出一个元素吗?

1
2
In [41]: df.loc[dates[0], 'A']
Out[41]: -0.56177252662051747

这种情况通常用 at ,速度更快

1
2
In [42]: df.at[dates[0], 'A']
Out[42]: -0.56177252662051747

通过位置选择,即整数下标选择, 和 MATLAB 完全一样
这个就和数组类似啦,直接看例子。选出第4行:

1
2
3
4
5
6
7
In [43]: df.iloc[3]
Out[43]:
A -2.411028
B 0.312229
C -1.879164
D 1.388484
Name: 2017-06-11 00:00:00, dtype: float64

选出3~4行,0~1列:

1
2
3
4
5
In [45]: df.iloc[3:5, 0:2]
Out[45]:
A B
2017-06-11 -2.411028 0.312229
2017-06-12 -0.173929 -0.572149

也能用 list 选择, 挑出指定行和列

1
2
3
4
5
6
In [46]: df.iloc[[1, 2, 4], [0, 3]]
Out[46]:
A D
2017-06-09 -0.053409 -1.753744
2017-06-10 -1.699722 1.088473
2017-06-12 -0.173929 -0.101223

也可以用slice

1
2
3
4
5
In [47]: df.iloc[1:3, 1:3]
Out[47]:
B C
2017-06-09 0.814811 0.294231
2017-06-10 0.971518 -2.592852

选择单个元素

1
2
3
4
5
In [48]: df.iloc[0,0]
Out[48]: -0.56177252662051747
In [49]: df.iat[0,0]
Out[49]: -0.56177252662051747

布尔值索引

根据单列的值来选择数据

1
2
3
4
In [7]: df[df.A > 0]
Out[7]:
A B C D
2017-06-13 0.909448 -0.302722 -2.198783 -0.47542

从DataFrame中选择符合条件的值, 其中不符合条件的用NaN填充

1
2
3
4
5
6
7
8
9
In [8]: df[df > 0]
Out[8]:
A B C D
2017-06-08 NaN NaN NaN NaN
2017-06-09 NaN 1.297100 NaN 0.231742
2017-06-10 NaN 2.380203 NaN NaN
2017-06-11 NaN 2.262221 1.135382 1.166144
2017-06-12 NaN NaN NaN NaN
2017-06-13 0.909448 NaN NaN NaN

isin() 函数:是否在集合中, 用来过滤数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [9]: df2 = df.copy()
In [10]: df2['E'] = ['one', 'two', 'three', 'four', 'three', 'one']
In [11]: df2
Out[11]:
A B C D E
2017-06-08 -0.993188 -0.098497 -0.898984 -0.060261 one
2017-06-09 -0.262396 1.297100 -0.161798 0.231742 two
2017-06-10 -1.792578 2.380203 -0.116943 -1.486425 three
2017-06-11 -2.126554 2.262221 1.135382 1.166144 four
2017-06-12 -0.421283 -1.725798 -0.588929 -0.191101 three
2017-06-13 0.909448 -0.302722 -2.198783 -0.475420 one
In [12]: df2[df2['E'].isin(['two', 'four'])]
Out[12]:
A B C D E
2017-06-09 -0.262396 1.297100 -0.161798 0.231742 two
2017-06-11 -2.126554 2.262221 1.135382 1.166144 four

设置 setting

为 DataFrame 增加新的列,按 index 对应

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
In [13]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20170608', periods=6))
In [14]: s1
Out[14]:
2017-06-08 1
2017-06-09 2
2017-06-10 3
2017-06-11 4
2017-06-12 5
2017-06-13 6
Freq: D, dtype: int64
In [15]: df['F'] = s1
In [16]: df
Out[16]:
A B C D F
2017-06-08 -0.993188 -0.098497 -0.898984 -0.060261 1
2017-06-09 -0.262396 1.297100 -0.161798 0.231742 2
2017-06-10 -1.792578 2.380203 -0.116943 -1.486425 3
2017-06-11 -2.126554 2.262221 1.135382 1.166144 4
2017-06-12 -0.421283 -1.725798 -0.588929 -0.191101 5
2017-06-13 0.909448 -0.302722 -2.198783 -0.475420 6

通过label, 下标和numpy数组设置值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 通过label设置值
In [17]: df.at[dates[0], 'A'] = 0
# 通过下标设置值
In [19]: df.iat[0,1] = 0
# 用 numpy 数组设置值
In [21]: df.loc[:, 'D'] = np.array([5] * len(df))
In [23]: df
Out[23]:
A B C D F
2017-06-08 0.000000 0.000000 -0.898984 5 1
2017-06-09 -0.262396 1.297100 -0.161798 5 2
2017-06-10 -1.792578 2.380203 -0.116943 5 3
2017-06-11 -2.126554 2.262221 1.135382 5 4
2017-06-12 -0.421283 -1.725798 -0.588929 5 5
2017-06-13 0.909448 -0.302722 -2.198783 5 6

通过布尔值设置值

1
2
3
4
5
6
7
8
9
10
11
In [28]: df[df>0] = -df
In [29]: df
Out[29]:
A B C D F
2017-06-08 0.000000 0.000000 -0.898984 -5 -1
2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2
2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3
2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4
2017-06-12 -0.421283 -1.725798 -0.588929 -5 -5
2017-06-13 -0.909448 -0.302722 -2.198783 -5 -6

缺失值

pandas 用 np.nan 表示缺失值。通常它不会被计算。
Reindexing 允许你改变/增加/删除 指定轴的index, 并返回数据的拷贝

1
2
3
4
5
6
7
8
9
10
11
In [30]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [31]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [32]: df1
Out[32]:
A B C D F E
2017-06-08 0.000000 0.000000 -0.898984 -5 -1 1.0
2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 1.0
2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 NaN
2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 NaN

丢弃有NaN的行

1
2
3
4
5
In [33]: df1.dropna() # 或者 df1.dropna(how='any')
Out[33]:
A B C D F E
2017-06-08 0.000000 0.0000 -0.898984 -5 -1 1.0
2017-06-09 -0.262396 -1.2971 -0.161798 -5 -2 1.0

填充缺失值

1
2
3
4
5
6
7
In [35]: df1.fillna(value=5)
Out[35]:
A B C D F E
2017-06-08 0.000000 0.000000 -0.898984 -5 -1 1.0
2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 1.0
2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 5.0
2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 5.0

获取布尔值的 mask:哪些值是 NaN, 则为true,否则为false

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [36]: pd.isnull(df1)
Out[36]:
A B C D F E
2017-06-08 False False False False False False
2017-06-09 False False False False False False
2017-06-10 False False False False False True
2017-06-11 False False False False False True
In [37]: df1
Out[37]:
A B C D F E
2017-06-08 0.000000 0.000000 -0.898984 -5 -1 1.0
2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2 1.0
2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3 NaN
2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4 NaN

注意: 这里的df1数据其实没有改变, 返回的是运算后的数据拷贝

操作

说明: 操作都会把NaN(缺失值)排除在外

统计

平均值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 对每一列做平均值
In [38]: df.mean() # df.mean(0)
Out[38]:
A -0.918710
B -1.328007
C -0.850137
D -5.000000
F -3.500000
dtype: float64
# 对每一行做平均值
In [40]: df.mean(1)
Out[40]:
2017-06-08 -1.379797
2017-06-09 -1.744259
2017-06-10 -2.457945
2017-06-11 -2.904831
2017-06-12 -2.547202
2017-06-13 -2.882190
Freq: D, dtype: float64

操作具有不同维度的对象需要对齐。pandas会沿着指定的维度自动广播

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [42]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [43]: s
Out[43]:
2017-06-08 NaN
2017-06-09 NaN
2017-06-10 1.0
2017-06-11 3.0
2017-06-12 5.0
2017-06-13 NaN
Freq: D, dtype: float64
In [45]: df.sub(s, axis='index')
Out[45]:
A B C D F
2017-06-08 NaN NaN NaN NaN NaN
2017-06-09 NaN NaN NaN NaN NaN
2017-06-10 -2.792578 -3.380203 -1.116943 -6.0 -4.0
2017-06-11 -5.126554 -5.262221 -4.135382 -8.0 -7.0
2017-06-12 -5.421283 -6.725798 -5.588929 -10.0 -10.0
2017-06-13 NaN NaN NaN NaN NaN
In [46]: df
Out[46]:
A B C D F
2017-06-08 0.000000 0.000000 -0.898984 -5 -1
2017-06-09 -0.262396 -1.297100 -0.161798 -5 -2
2017-06-10 -1.792578 -2.380203 -0.116943 -5 -3
2017-06-11 -2.126554 -2.262221 -1.135382 -5 -4
2017-06-12 -0.421283 -1.725798 -0.588929 -5 -5
2017-06-13 -0.909448 -0.302722 -2.198783 -5 -6

Apply

对数据(行或列) Apply 函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [47]: df.apply(np.cumsum)
Out[47]:
A B C D F
2017-06-08 0.000000 0.000000 -0.898984 -5 -1
2017-06-09 -0.262396 -1.297100 -1.060782 -10 -3
2017-06-10 -2.054974 -3.677303 -1.177726 -15 -6
2017-06-11 -4.181527 -5.939523 -2.313108 -20 -10
2017-06-12 -4.602811 -7.665321 -2.902037 -25 -15
2017-06-13 -5.512259 -7.968043 -5.100820 -30 -21
In [48]: df.apply(lambda x: x.max() -x.min())
Out[48]:
A 2.126554
B 2.380203
C 2.081839
D 0.000000
F 5.000000
dtype: float64

直方图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [49]: s = pd.Series(np.random.randint(0, 7, size=10))
In [50]: s
Out[50]:
0 0
1 4
2 6
3 2
4 6
5 6
6 0
7 0
8 3
9 1
dtype: int32
In [52]: s.value_counts()
Out[52]:
6 3
0 3
4 1
3 1
2 1
1 1
dtype: int64

字符串方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [53]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [54]: s.str.lower()
Out[54]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object

Merge

concat

简单地按行拼接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
In [55]: df = pd.DataFrame(np.random.randn(10, 4))
In [56]: df
Out[56]:
0 1 2 3
0 1.279835 0.488299 0.122145 1.066159
1 -0.566047 -0.625790 -0.941786 -0.209994
2 0.364785 1.185549 -0.381762 1.752895
3 -0.568627 -0.235160 -1.602423 0.603979
4 0.698751 -1.656823 -0.306150 2.019342
5 -0.423725 1.321606 0.894416 -0.249282
6 -0.125866 -2.315650 0.376551 1.050506
7 -0.189071 -0.933617 -0.051930 -0.375252
8 0.478909 -2.041329 1.217890 -1.020701
9 -1.287622 -0.173968 0.387218 -0.004477
# 拆分成piece
In [57]: pieces = [df[:3], df[3:7], df[7:]]
In [58]: pieces
Out[58]:
[ 0 1 2 3
0 1.279835 0.488299 0.122145 1.066159
1 -0.566047 -0.625790 -0.941786 -0.209994
2 0.364785 1.185549 -0.381762 1.752895,
0 1 2 3
3 -0.568627 -0.235160 -1.602423 0.603979
4 0.698751 -1.656823 -0.306150 2.019342
5 -0.423725 1.321606 0.894416 -0.249282
6 -0.125866 -2.315650 0.376551 1.050506,
0 1 2 3
7 -0.189071 -0.933617 -0.051930 -0.375252
8 0.478909 -2.041329 1.217890 -1.020701
9 -1.287622 -0.173968 0.387218 -0.004477]
In [59]: pd.concat(pieces)
Out[59]:
0 1 2 3
0 1.279835 0.488299 0.122145 1.066159
1 -0.566047 -0.625790 -0.941786 -0.209994
2 0.364785 1.185549 -0.381762 1.752895
3 -0.568627 -0.235160 -1.602423 0.603979
4 0.698751 -1.656823 -0.306150 2.019342
5 -0.423725 1.321606 0.894416 -0.249282
6 -0.125866 -2.315650 0.376551 1.050506
7 -0.189071 -0.933617 -0.051930 -0.375252
8 0.478909 -2.041329 1.217890 -1.020701
9 -1.287622 -0.173968 0.387218 -0.004477

join

和 SQL 的 join 是一个意思

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
In [60]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [61]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [62]: left
Out[62]:
key lval
0 foo 1
1 foo 2
In [63]: right
Out[63]:
key rval
0 foo 4
1 foo 5
In [64]: pd.merge(left, right, on='key')
Out[64]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

Append

向 DataFrame 增加新的数据行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
In [65]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [66]: df
Out[66]:
A B C D
0 -0.303120 0.419378 0.111816 -0.160104
1 0.924466 2.385372 -1.575671 -1.789283
2 1.106048 -0.087699 1.714677 -0.058831
3 0.765727 0.101523 -1.204472 0.011180
4 -2.084695 -0.517287 -0.497699 0.153516
5 -0.252460 0.044590 -0.481047 0.617817
6 -0.121789 -1.663595 -0.464836 -0.858457
7 -0.246298 0.687569 -2.081519 -1.529134
In [67]: s = df.iloc[3]
In [68]: df.append(s, ignore_index=True)
Out[68]:
A B C D
0 -0.303120 0.419378 0.111816 -0.160104
1 0.924466 2.385372 -1.575671 -1.789283
2 1.106048 -0.087699 1.714677 -0.058831
3 0.765727 0.101523 -1.204472 0.011180
4 -2.084695 -0.517287 -0.497699 0.153516
5 -0.252460 0.044590 -0.481047 0.617817
6 -0.121789 -1.663595 -0.464836 -0.858457
7 -0.246298 0.687569 -2.081519 -1.529134
8 0.765727 0.101523 -1.204472 0.011180

Grouping

和 SQL 中的 GROUP BY 类似,包括以下这几步:

  • 根据某些规则,把数据分组
  • 对每组应用一个函数
  • 合并结果到一个数据结构中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
In [69]: df = pd.DataFrame({
...: 'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
In [70]: df
Out[70]:
A B C D
0 foo one -1.203566 -1.199038
1 bar one 2.368016 -1.059656
2 foo two -0.704544 -1.047794
3 bar three -0.511092 1.121859
4 foo two -0.723414 -1.225536
5 bar two -0.363631 0.111177
6 foo one 0.271759 0.500798
7 foo three -0.248960 2.035166
# 做 Group 操作并对每组求和
In [71]: df.groupby('A').sum()
Out[71]:
C D
A
bar 1.493293 0.173380
foo -2.608725 -0.936404
# 可以对两列进行 Group by 并求和
In [72]: df.groupby(['A', 'B']).sum()
Out[72]:
C D
A B
bar one 2.368016 -1.059656
three -0.511092 1.121859
two -0.363631 0.111177
foo one -0.931807 -0.698240
three -0.248960 2.035166
two -1.427958 -2.273330

Reshape

Stack 叠层

准备数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
In [73]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
...: 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two',
...: 'one', 'two', 'one', 'two']]))
In [75]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [76]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A','B'])
In [77]: df2 = df[:4]
In [78]: df2
Out[78]:
A B
first second
bar one 0.181096 -0.610853
two -1.326709 0.269153
baz one -0.455180 -0.070055
two -0.953993 -0.235160
```
stack() 把 DataFrame 的所有列“压缩”到 index 里去
```python
In [79]: stacked = df2.stack()
In [80]: stacked
Out[80]:
first second
bar one A 0.181096
B -0.610853
two A -1.326709
B 0.269153
baz one A -0.455180
B -0.070055
two A -0.953993
B -0.235160
dtype: float64

反之,只要是 MultiIndex 都可以用 unstack() 恢复出列,默认把最后一个 index 解开

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
In [81]: stacked.unstack()
Out[81]:
A B
first second
bar one 0.181096 -0.610853
two -1.326709 0.269153
baz one -0.455180 -0.070055
two -0.953993 -0.235160
In [82]: stacked.unstack(1)
Out[82]:
second one two
first
bar A 0.181096 -1.326709
B -0.610853 0.269153
baz A -0.455180 -0.953993
B -0.070055 -0.235160
In [83]: stacked.unstack(0)
Out[83]:
first bar baz
second
one A 0.181096 -0.455180
B -0.610853 -0.070055
two A -1.326709 -0.953993
B 0.269153 -0.235160

Pivot Table 旋转

准备数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
In [84]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
...: 'B' : ['A', 'B', 'C'] * 4,
...: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
...: 'D' : np.random.randn(12),
...: 'E' : np.random.randn(12)})
In [85]: df
Out[85]:
A B C D E
0 one A foo -0.379428 -0.050681
1 one B foo 1.650807 0.024530
2 two C foo 0.349554 1.202692
3 three A bar -1.516381 -0.163382
4 one B bar 0.360722 -0.241622
5 one C bar -0.276398 0.581192
6 two A foo 0.304563 -0.663271
7 three B foo 1.328499 0.485223
8 one C foo 1.665213 -0.577843
9 one A bar -0.229248 -0.335329
10 two B bar -0.133112 1.119350
11 three C bar -0.383992 0.787800

pivot 是把原来的数据(values)作为新表的行(index)、列(columns)

1
2
3
4
5
6
7
8
9
10
11
12
13
In [86]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[86]:
C bar foo
A B
one A -0.229248 -0.379428
B 0.360722 1.650807
C -0.276398 1.665213
three A -1.516381 NaN
B NaN 1.328499
C -0.383992 NaN
two A NaN 0.304563
B -0.133112 NaN
C NaN 0.349554

时间序列

pandas 的时间序列功能在金融应用中很有用。

resample功能

1
2
3
4
5
6
7
8
In [88]: rng = pd.date_range('9/6/2017', periods=100, freq='S')
In [89]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [90]: ts.resample('5Min').sum()
Out[90]:
2017-09-06 26085
Freq: 5T, dtype: int32

时区表示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
In [94]: rng = pd.date_range('9/6/2017 00:00', periods=5, freq='D')
In [95]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [96]: ts
Out[96]:
2017-09-06 -0.715837
2017-09-07 1.021448
2017-09-08 1.186508
2017-09-09 0.606852
2017-09-10 0.566530
Freq: D, dtype: float64
In [97]: ts_utc = ts.tz_localize('UTC')
In [98]: ts_utc
Out[98]:
2017-09-06 00:00:00+00:00 -0.715837
2017-09-07 00:00:00+00:00 1.021448
2017-09-08 00:00:00+00:00 1.186508
2017-09-09 00:00:00+00:00 0.606852
2017-09-10 00:00:00+00:00 0.566530
Freq: D, dtype: float64

时区转换

1
2
3
4
5
6
7
8
In [100]: ts_utc.tz_convert('US/Eastern')
Out[100]:
2017-09-05 20:00:00-04:00 -0.715837
2017-09-06 20:00:00-04:00 1.021448
2017-09-07 20:00:00-04:00 1.186508
2017-09-08 20:00:00-04:00 0.606852
2017-09-09 20:00:00-04:00 0.566530
Freq: D, dtype: float64

在时间跨度表示之间进行转换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
In [101]: rng = pd.date_range('9/6/2017', periods=5, freq='M')
In [102]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [103]: ts
Out[103]:
2017-09-30 -0.337867
2017-10-31 0.546883
2017-11-30 0.063004
2017-12-31 0.419636
2018-01-31 2.562404
Freq: M, dtype: float64
In [104]: ps = ts.to_period()
In [105]: ps
Out[105]:
2017-09 -0.337867
2017-10 0.546883
2017-11 0.063004
2017-12 0.419636
2018-01 2.562404
Freq: M, dtype: float64
In [106]: ps.to_timestamp()
Out[106]:
2017-09-01 -0.337867
2017-10-01 0.546883
2017-11-01 0.063004
2017-12-01 0.419636
2018-01-01 2.562404
Freq: MS, dtype: float64
In [107]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
In [108]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [109]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
In [110]: ts.head()
Out[110]:
1990-03-01 09:00 -0.315663
1990-06-01 09:00 2.092510
1990-09-01 09:00 0.337302
1990-12-01 09:00 0.037893
1991-03-01 09:00 0.463572
Freq: H, dtype: float64

类别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
In [111]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
In [112]: df
Out[112]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
In [113]: df['grade'] = df['raw_grade'].astype('category')
In [114]: df['grade']
Out[114]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

类别可以 inplace 地赋值:(只是改一下对应的字符串嘛,类别是用 Index 对象存储的)
In [115]: df['grade'].cat.categories = ["very good", "good", "very bad"]
修改类别时,如果有新的类别,会自动加进去

1
2
3
4
5
6
7
8
9
10
11
12
In [117]: df['grade'] = df['grade'].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [118]: df['grade']
Out[118]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

根据类别排序

1
2
3
4
5
6
7
8
9
In [119]: df.sort_values(by='grade')
Out[119]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

做 group by 的时候,空的类别也会被呈现出来

1
2
3
4
5
6
7
8
9
In [120]: df.groupby('grade').size()
Out[120]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64

绘图

1
2
3
4
5
6
In [121]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
In [122]: ts = ts.cumsum()
In [123]: ts.plot()
Out[123]: <matplotlib.axes._subplots.AxesSubplot at 0x6a0be10>

对于DtaFrame,可以直接plot

1
2
3
4
5
6
In [124]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
In [125]: df = df.cumsum()
In [126]: plt.figure(); df.plot(); plt.legend(loc='best')
Out[126]: <matplotlib.legend.Legend at 0x6c3e8f0>

读写数据

CSV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
In [127]: df.to_csv('foo.csv')
In [128]: pd.read_csv('foo.csv')
Out[128]:
Unnamed: 0 A B C D
0 2000-01-01 1.239452 1.983851 0.578987 1.155658
1 2000-01-02 0.295568 3.704782 1.962402 0.885559
2 2000-01-03 -0.139800 3.872238 2.793025 0.972079
3 2000-01-04 -0.804336 3.259410 2.159552 1.785479
4 2000-01-05 0.171984 3.027446 2.253854 0.779544
5 2000-01-06 1.781807 1.836517 4.558960 1.999731
6 2000-01-07 0.650927 0.154262 3.650160 0.685396
7 2000-01-08 0.492898 -0.198763 3.646034 -0.073848
8 2000-01-09 -0.100468 -0.623716 3.217366 0.220008
9 2000-01-10 -0.905394 0.282200 3.153474 -0.189584
10 2000-01-11 -0.962155 -1.147225 3.353251 1.283288
11 2000-01-12 -1.853063 -1.539308 5.246351 0.222400
12 2000-01-13 -3.385665 -2.982384 3.952838 0.319281
13 2000-01-14 -4.411986 -2.914887 4.775386 0.597495
14 2000-01-15 -5.938706 -3.437068 6.882886 1.105879
15 2000-01-16 -6.537699 -1.947668 8.142251 0.948407
16 2000-01-17 -6.786304 -2.735740 8.484593 1.034685
17 2000-01-18 -7.145344 -1.666703 9.420653 -0.919941
18 2000-01-19 -8.283291 0.101307 10.654933 -0.936284
19 2000-01-20 -8.330414 0.820054 11.207165 -2.622354
20 2000-01-21 -7.942291 1.559753 9.201008 -2.923220
21 2000-01-22 -4.675543 2.337827 10.364670 -3.112916
22 2000-01-23 -5.774614 0.408297 9.441821 -2.040018
23 2000-01-24 -5.993547 -0.480329 10.311053 0.852373
24 2000-01-25 -4.137167 0.099883 10.224100 0.576367
25 2000-01-26 -2.866354 -0.772702 11.812000 1.671981
26 2000-01-27 -1.764316 -2.534161 11.634361 1.638750
27 2000-01-28 -3.220852 -3.353006 11.399995 -0.562762
28 2000-01-29 -0.889241 -2.794083 10.445614 0.473105
29 2000-01-30 0.428342 -3.857681 10.930199 -0.118981
.. ... ... ... ... ...
970 2002-08-28 -2.853884 -2.041226 28.387424 17.129529
971 2002-08-29 -5.058352 0.041153 28.095666 16.654813
972 2002-08-30 -4.478272 -1.613275 26.989764 17.653338
973 2002-08-31 -2.815382 -2.764649 26.197949 16.442647
974 2002-09-01 -2.621575 -2.785604 28.007243 14.962121
975 2002-09-02 -2.476145 -3.128888 27.541079 12.853070
976 2002-09-03 -0.483923 -3.061629 27.130099 13.401077
977 2002-09-04 -0.885055 -2.059356 26.139260 12.725815
978 2002-09-05 -3.438688 -3.060238 26.267361 13.601928
979 2002-09-06 -3.646110 -2.908451 27.639157 13.749199
980 2002-09-07 -3.742097 -5.843492 27.138204 14.621900
981 2002-09-08 -4.367495 -5.523435 26.429433 14.372776
982 2002-09-09 -6.481100 -7.691100 26.325775 14.060133
983 2002-09-10 -4.406432 -5.925796 25.808105 13.717117
984 2002-09-11 -5.747746 -6.226884 26.334882 13.539911
985 2002-09-12 -5.075507 -5.976265 25.582403 13.955776
986 2002-09-13 -5.957833 -4.514403 26.154568 15.948706
987 2002-09-14 -4.822016 -3.510714 26.770429 14.903106
988 2002-09-15 -5.413908 -3.678965 29.255640 15.903795
989 2002-09-16 -7.305733 -4.970404 30.726133 17.075891
990 2002-09-17 -6.921962 -3.342561 32.499852 15.309124
991 2002-09-18 -7.650811 -3.539989 31.823857 15.100428
992 2002-09-19 -7.680999 -2.289270 31.098525 12.891011
993 2002-09-20 -7.849671 -1.712562 31.364746 13.909086
994 2002-09-21 -8.739466 -2.122690 32.004201 14.095981
995 2002-09-22 -8.740361 -0.656488 29.674406 12.590248
996 2002-09-23 -6.309331 -0.986256 30.370786 13.450941
997 2002-09-24 -6.205371 -0.419119 30.260205 12.028273
998 2002-09-25 -7.573782 -2.492400 31.063298 12.361099
999 2002-09-26 -7.317950 -2.840246 31.560766 12.502660

HDF5

1
2
3
In [129]: df.to_hdf('foo.h5','df')
In [130]: pd.read_hdf('foo.h5','df')

Excel

1
2
3
In [131]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
In [132]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

如下,不能直接把返回值当作布尔值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [131]: if pd.Series([False, True, False]):
...: print('I was true')
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-131-ec15f6b92216> in <module>()
----> 1 if pd.Series([False, True, False]):
2 print('I was true')
3
c:\python34\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
953 raise ValueError("The truth value of a {0} is ambiguous. "
954 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 955 .format(self.__class__.__name__))
956
957 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

参考