pandas 学习
# 什么是 pandas
pandas (opens new window) : Python 数据分析模块
pandas 是为了解决数据分析任务而创建的,纳入了大量的库和标准数据模型,提供了高效地操作大型数据集所需的工具。
pandas中的数据结构
:
- Series: 一维数组,类似于 python 中的基本数据结构 list,区别是 series 只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。就像数据库中的列数据。
- DataFrame: 二维的表格型数据结构。很多功能与 R 中的 data.frame 类似。可以将 DataFrame 理解为 Series 的容器。
- Panel:三维的数组,可以理解为 DataFrame 的容器。
# 十分钟搞定 pandas(译文+注释)
说明
: 本文是 pandas 官网 10 Minutes to pandas (opens new window) 的翻译。
引入需要的包:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2
3
注
- numpy 是一个 python 实现的科学计算包
- matplotlib 是一个 python 的 2D 绘图库
- 更多章节请查看Cookbook (opens new window)
# 创建对象
详情请查看数据结构介绍 (opens new window)
1.通过传入一个列表来创建Series (opens new window) ,pandas 会创建默认的整形指标:
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0 1
1 3
2 5
3 NaN
4 6
5 8
dtype: float64
2
3
4
5
6
7
8
9
2.通过传递数字数组、时间索引、列标签来创建DataFrame (opens new window):
>>> dates = pd.date_range('20130101',periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0.859619 -0.545903 0.012447 1.257684
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2013-01-03 -0.719234 -0.396174 0.635237 0.216691
2013-01-04 -0.921692 0.876693 -0.670553 1.468060
2013-01-05 -0.300317 -0.011320 -1.376442 1.694740
2013-01-06 -1.903683 0.786785 -0.194179 0.177973
2
3
4
5
6
7
8
9
10
11
12
13
14
15
注
np.random.randn(6,4)
即创建 6 行 4 列的随机数字数组
3.通过传递能被转换成类似结构的字典来创建 DataFrame:
>>>df2 = pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
>>> df2
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
2
3
4
5
6
7
8
9
10
11
12
13
4.查看各列的dtypes (opens new window):
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
2
3
4
5
6
7
8
5.如果使用 IPython,Tab 会自动补全所有的属性和自定义的列,如下所示:
>>> df2.<TAB>
df2.A df2.boxplot
df2.abs df2.C
df2.add df2.clip
df2.add_prefix df2.clip_lower
df2.add_suffix df2.clip_upper
df2.align df2.columns
df2.all df2.combine
df2.any df2.combineAdd
df2.append df2.combine_first
df2.apply df2.combineMult
df2.applymap df2.compound
df2.as_blocks df2.consolidate
df2.asfreq df2.convert_objects
df2.as_matrix df2.copy
df2.astype df2.corr
df2.at df2.corrwith
df2.at_time df2.count
df2.axes df2.cov
df2.B df2.cummax
df2.between_time df2.cummin
df2.bfill df2.cumprod
df2.blocks df2.cumsum
df2.bool df2.D
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
可以看到,A、B、C、D 列均通过 Tab 自动生成
# 查看数据
1.查看 DataFrame 头部&尾部数据:
>>> df.head()
A B C D
2013-01-01 0.859619 -0.545903 0.012447 1.257684
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2013-01-03 -0.719234 -0.396174 0.635237 0.216691
2013-01-04 -0.921692 0.876693 -0.670553 1.468060
013-01-05 -0.300317 -0.011320 -1.376442 1.694740
>>> df.tail(3)
A B C D
2013-01-04 -0.921692 0.876693 -0.670553 1.468060
2013-01-05 -0.300317 -0.011320 -1.376442 1.694740
2013-01-06 -1.903683 0.786785 -0.194179 0.177973
2
3
4
5
6
7
8
9
10
11
12
2.查看索引、列、和数组数据:
>>> df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>> df.values
array([[ 0.85961861, -0.54590304, 0.01244705, 1.25768432],
[ 0.11962178, -0.4840508 , 0.40472795, 0.36088029],
[-0.7192337 , -0.39617432, 0.63523701, 0.21669124],
[-0.92169244, 0.87669275, -0.67055318, 1.46806034],
[-0.30031679, -0.01132035, -1.37644224, 1.69474031],
[-1.90368258, 0.78678454, -0.19417942, 0.17797326]])
2
3
4
5
6
7
8
9
10
11
12
13
3.查看数据的快速统计结果:
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.477614 0.037671 -0.198127 0.862672
std 0.945047 0.643196 0.736736 0.685969
min -1.903683 -0.545903 -1.376442 0.177973
25% -0.871078 -0.462082 -0.551460 0.252739
50% -0.509775 -0.203747 -0.090866 0.809282
75% 0.014637 0.587258 0.306658 1.415466
max 0.859619 0.876693 0.635237 1.694740
2
3
4
5
6
7
8
9
10
4.对数据进行行列转换:
>>> df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.859619 0.119622 -0.719234 -0.921692 -0.300317 -1.903683
B -0.545903 -0.484051 -0.396174 0.876693 -0.011320 0.786785
C 0.012447 0.404728 0.635237 -0.670553 -1.376442 -0.194179
D 1.257684 0.360880 0.216691 1.468060 1.694740 0.177973
2
3
4
5
6
5.按axis (opens new window)排序:
>>> df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 1.257684 0.012447 -0.545903 0.859619
2013-01-02 0.360880 0.404728 -0.484051 0.119622
2013-01-03 0.216691 0.635237 -0.396174 -0.719234
2013-01-04 1.468060 -0.670553 0.876693 -0.921692
2013-01-05 1.694740 -1.376442 -0.011320 -0.300317
2013-01-06 0.177973 -0.194179 0.786785 -1.903683
2
3
4
5
6
7
8
6.按值排序:
>>> df.sort_values(by='B')
A B C D
2013-01-01 0.859619 -0.545903 0.012447 1.257684
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2013-01-03 -0.719234 -0.396174 0.635237 0.216691
2013-01-05 -0.300317 -0.011320 -1.376442 1.694740
2013-01-06 -1.903683 0.786785 -0.194179 0.177973
2013-01-04 -0.921692 0.876693 -0.670553 1.468060
2
3
4
5
6
7
8
# 选择数据
注意
:虽然标准的 Python/Numpy 表达式是直观且可用的,但是我们推荐使用优化后的 pandas 方法,例如:.at,.iat,.loc,.iloc 以及.ix
详情请查看: Indexing and Selecting Data (opens new window)
和 MultiIndex / Advanced Indexing (opens new window)
- 获取
1.选择一列,返回 Series,相当于 df.A:
>>> df['A']
2013-01-01 0.859619
2013-01-02 0.119622
2013-01-03 -0.719234
2013-01-04 -0.921692
2013-01-05 -0.300317
2013-01-06 -1.903683
Freq: D, Name: A, dtype: float64
2
3
4
5
6
7
8
2.通过[]选择,即对行进行切片:
>>> df[0:3]
A B C D
2013-01-01 0.859619 -0.545903 0.012447 1.257684
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2013-01-03 -0.719234 -0.396174 0.635237 0.216691
2
3
4
5
- 标签式选择
1.通过标签获取交叉区域:
>>> df.loc[dates[0]]
A 0.859619
B -0.545903
C 0.012447
D 1.257684
Name: 2013-01-01 00:00:00, dtype: float64
2
3
4
5
6
注
:即获取时间为 2013-01-01 的数据
2.通过标签获取多轴数据:
>>> df.loc[:,['A','B']]
A B
2013-01-01 0.859619 -0.545903
2013-01-02 0.119622 -0.484051
2013-01-03 -0.719234 -0.396174
2013-01-04 -0.921692 0.876693
2013-01-05 -0.300317 -0.011320
2013-01-06 -1.903683 0.786785
2
3
4
5
6
7
8
3.标签切片:
>>> df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 0.119622 -0.484051
2013-01-03 -0.719234 -0.396174
2013-01-04 -0.921692 0.876693
2
3
4
5
4.对返回的对象缩减维度:
>>> df.loc['20130102',['A','B']]
A 0.119622
B -0.484051
Name: 2013-01-02 00:00:00, dtype: float64
2
3
4
5.获取单个值:
>>> df.loc[dates[0],'A']
0.85961861159875042
2
6.快速访问单个标量(同 5):
>>> df.at[dates[0],'A']
0.85961861159875042
2
注
:loc 通过行标签获取行数据,iloc 通过行号获取行数据
- 位置式选择
详情请查看通过位置选择 (opens new window)
1.通过数值选择:
>>> df.iloc[3]
A -0.921692
B 0.876693
C -0.670553
D 1.468060
Name: 2013-01-04 00:00:00, dtype: float64
2
3
4
5
6
2.通过数值切片:
>>> df.iloc[3:5,0:2]
A B
2013-01-04 -0.921692 0.876693
2013-01-05 -0.300317 -0.011320
2
3
4
注
:左开右闭
3.通过指定列表位置:
>>> df.iloc[[1,2,4],[0,2]]
A C
2013-01-02 0.119622 0.404728
2013-01-03 -0.719234 0.635237
2013-01-05 -0.300317 -1.376442
2
3
4
5
4.对行切片:
>>> df.iloc[1:3,:]
A B C D
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2013-01-03 -0.719234 -0.396174 0.635237 0.216691
2
3
4
5.对列切片:
>>> df.iloc[:,1:3]
B C
2013-01-01 -0.545903 0.012447
2013-01-02 -0.484051 0.404728
2013-01-03 -0.396174 0.635237
2013-01-04 0.876693 -0.670553
2013-01-05 -0.011320 -1.376442
2013-01-06 0.786785 -0.194179
2
3
4
5
6
7
8
6.获取特定值:
>>> df.iloc[1,1]
-0.48405080229207309
2
7.快速访问某个标量(同 6):
>>> df.iat[1,1]
-0.48405080229207309
2
- Boolean 索引 1.通过某列选择数据:
>>> df[df.A > 0]
A B C D
2013-01-01 0.859619 -0.545903 0.012447 1.257684
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2
3
4
2.通过 where 选择数据:
>>> df[df > 0]
A B C D
2013-01-01 0.859619 NaN 0.012447 1.257684
2013-01-02 0.119622 NaN 0.404728 0.360880
2013-01-03 NaN NaN 0.635237 0.216691
2013-01-04 NaN 0.876693 NaN 1.468060
2013-01-05 NaN NaN NaN 1.694740
2013-01-06 NaN 0.786785 NaN 0.177973
2
3
4
5
6
7
8
3.通过isin() (opens new window)过滤数据:
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one','two','three','four','three']
>>> df2
A B C D E
2013-01-01 0.859619 -0.545903 0.012447 1.257684 one
2013-01-02 0.119622 -0.484051 0.404728 0.360880 one
2013-01-03 -0.719234 -0.396174 0.635237 0.216691 two
2013-01-04 -0.921692 0.876693 -0.670553 1.468060 three
2013-01-05 -0.300317 -0.011320 -1.376442 1.694740 four
2013-01-06 -1.903683 0.786785 -0.194179 0.177973 three
>>> df2[df2['E'].isin(['two','four'])]
A B C D E
2013-01-03 -0.719234 -0.396174 0.635237 0.216691 two
2013-01-05 -0.300317 -0.011320 -1.376442 1.694740 four
2
3
4
5
6
7
8
9
10
11
12
13
14
- 设置 1.新增一列数据:
>>> s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
>>> df['F'] = s1
2
3
4
5
6
7
8
9
10
2.通过标签更新值:
>>> df.at[dates[0],'A'] = 0
3.通过位置更新值:
>>> df.iat[0,1] = 0
4.通过数组更新一列值:
>>> df.loc[:,'D'] = np.array([5] * len(df))
上面几步操作的结果:
>>> df
A B C D F
2013-01-01 0.000000 0.000000 0.012447 5 NaN
2013-01-02 0.119622 -0.484051 0.404728 5 1
2013-01-03 -0.719234 -0.396174 0.635237 5 2
2013-01-04 -0.921692 0.876693 -0.670553 5 3
2013-01-05 -0.300317 -0.011320 -1.376442 5 4
2013-01-06 -1.903683 0.786785 -0.194179 5 5
2
3
4
5
6
7
8
5.通过 where 更新值:
>>> df2 = df.copy()
>>> df2[df2 > 0] = -df2
>>> df2
A B C D F
2013-01-01 0.000000 0.000000 -0.012447 -5 NaN
2013-01-02 -0.119622 -0.484051 -0.404728 -5 -1
2013-01-03 -0.719234 -0.396174 -0.635237 -5 -2
2013-01-04 -0.921692 -0.876693 -0.670553 -5 -3
2013-01-05 -0.300317 -0.011320 -1.376442 -5 -4
2013-01-06 -1.903683 -0.786785 -0.194179 -5 -5
2
3
4
5
6
7
8
9
10
11
# 缺失数据处理
pandas 用 np.nan 代表缺失数据,详情请查看 Missing Data section (opens new window)
1.reindex()可以修改/增加/删除索引,会返回一个数据的副本:
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
>>> df1.loc[dates[0]:dates[1],'E'] = 1
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 0.012447 5 NaN 1
2013-01-02 0.119622 -0.484051 0.404728 5 1 1
2013-01-03 -0.719234 -0.396174 0.635237 5 2 NaN
2013-01-04 -0.921692 0.876693 -0.670553 5 3 NaN
2
3
4
5
6
7
8
2.丢掉含有缺失项的行:
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 0.119622 -0.484051 0.404728 5 1 1
2
3
3.对缺失项赋值:
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 0.012447 5 5 1
2013-01-02 0.119622 -0.484051 0.404728 5 1 1
2013-01-03 -0.719234 -0.396174 0.635237 5 2 5
2013-01-04 -0.921692 0.876693 -0.670553 5 3 5
2
3
4
5
6
4.对缺失项布尔赋值:
>>> pd.isnull(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
2
3
4
5
6
# 相关操作
详情请查看 Basic section on Binary Ops (opens new window)
- 统计(操作通常情况下不包含缺失项)
1.按列求平均值:
>>> df.mean()
A -0.620884
B 0.128655
C -0.198127
D 5.000000
F 3.000000
dtype: float64
2
3
4
5
6
7
2.按行求平均值:
>>> df.mean(1)
2013-01-01 1.253112
2013-01-02 1.208060
2013-01-03 1.303966
2013-01-04 1.456889
2013-01-05 1.462384
2013-01-06 1.737785
Freq: D, dtype: float64
2
3
4
5
6
7
8
3.操作不同的维度需要先对齐,pandas 会沿着指定维度执行:
>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
>>> s
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1
2013-01-04 3
2013-01-05 5
2013-01-06 NaN
Freq: D, dtype: float64
>>> df.sub(s, axis='index')
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.719234 -1.396174 -0.364763 4 1
2013-01-04 -3.921692 -2.123307 -3.670553 2 0
2013-01-05 -5.300317 -5.011320 -6.376442 0 -1
2013-01-06 NaN NaN NaN NaN NaN
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
注
:
- 这里对齐维度指的对齐时间 index
- shift(2)指沿着时间轴将数据顺移两位
- sub 指减法,与 NaN 进行操作,结果也是 NaN
- 应用
1.对数据应用 function:
>>> df.apply(np.cumsum)
A B C D F
2013-01-01 0.000000 0.000000 0.012447 5 NaN
2013-01-02 0.119622 -0.484051 0.417175 10 1
2013-01-03 -0.599612 -0.880225 1.052412 15 3
2013-01-04 -1.521304 -0.003532 0.381859 20 6
2013-01-05 -1.821621 -0.014853 -0.994583 25 10
2013-01-06 -3.725304 0.771932 -1.188763 30 15
>>> df.apply(lambda x: x.max() - x.min())
A 2.023304
B 1.360744
C 2.011679
D 0.000000
F 4.000000
dtype: float64
2
3
4
5
6
7
8
9
10
11
12
13
14
15
注
: - cumsum 累加
详情请查看 直方图和离散化 (opens new window)
- 直方图:
>>> s = pd.Series(np.random.randint(0, 7, size=10))
>>> s
0 1
1 3
2 5
3 1
4 6
5 1
6 3
7 4
8 0
9 3
dtype: int64
>>> s.value_counts()
3 3
1 3
6 1
5 1
4 1
0 1
dtype: int64
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
pandas 默认配置了一些字符串处理方法,可以方便的操作元素,如下所示:(详情请查看Vectorized String Methods (opens new window))
- 字符串方法:
>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
>>> s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
2
3
4
5
6
7
8
9
10
11
12
# 合并
- 连接 pandas 提供了大量的方法,能轻松的对 Series,DataFrame 和 Panel 执行合并操作。详情请查看Merging section (opens new window)
使用 concat()连接 pandas 对象:
>>> df = pd.DataFrame(np.random.randn(10, 4))
>>> df
0 1 2 3
0 -0.199614 1.914485 0.396383 -0.295306
1 -0.061961 -1.352883 0.266751 -0.874132
2 0.346504 -2.328099 -1.492250 0.095392
3 0.187115 0.562740 -1.677737 -0.224807
4 -1.422599 -1.028044 0.789487 0.806940
5 0.439478 -0.592229 0.736081 1.008404
6 -0.205641 -0.649465 -0.706395 0.578698
7 -2.168725 -2.487189 0.060258 1.965318
8 0.207634 0.512572 0.595373 0.816516
9 0.764893 0.612208 -1.022504 -2.032126
>>> pieces = [df[:3], df[3:7], df[7:]]
>>> pd.concat(pieces)
0 1 2 3
0 -0.199614 1.914485 0.396383 -0.295306
1 -0.061961 -1.352883 0.266751 -0.874132
2 0.346504 -2.328099 -1.492250 0.095392
3 0.187115 0.562740 -1.677737 -0.224807
4 -1.422599 -1.028044 0.789487 0.806940
5 0.439478 -0.592229 0.736081 1.008404
6 -0.205641 -0.649465 -0.706395 0.578698
7 -2.168725 -2.487189 0.060258 1.965318
8 0.207634 0.512572 0.595373 0.816516
9 0.764893 0.612208 -1.022504 -2.032126
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
- Join 类似 SQL 的合并操作,详情请查看 Database style joining (opens new window)
栗子:
>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
>>> left
key lval
0 foo 1
1 foo 2
>>> right
key rval
0 foo 4
1 foo 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
栗子:
>>> left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
>>> right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
>>> left
key lval
0 foo 1
1 bar 2
>>> right
key rval
0 foo 4
1 bar 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 bar 2 5
2
3
4
5
6
7
8
9
10
11
12
13
14
- 追加,详情请查看Appending (opens new window):
>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
>>> df
A B C D
0 -1.710447 2.541720 -0.654403 0.132077
1 0.667796 -1.124769 -0.430752 -0.244731
2 1.555865 -0.483805 0.066114 -0.409518
3 1.171798 0.036219 -0.515065 0.860625
4 -0.834051 -2.178128 -0.345627 0.819392
5 -0.354886 0.161204 1.465532 1.879841
6 0.560888 1.208905 1.301983 0.799084
7 -0.770196 0.307691 1.212200 0.909137
>>> s = df.iloc[3]
>>> df.append(s, ignore_index=True)
A B C D
0 -1.710447 2.541720 -0.654403 0.132077
1 0.667796 -1.124769 -0.430752 -0.244731
2 1.555865 -0.483805 0.066114 -0.409518
3 1.171798 0.036219 -0.515065 0.860625
4 -0.834051 -2.178128 -0.345627 0.819392
5 -0.354886 0.161204 1.465532 1.879841
6 0.560888 1.208905 1.301983 0.799084
7 -0.770196 0.307691 1.212200 0.909137
8 1.171798 0.036219 -0.515065 0.860625
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 分组
group by
: - Splitting 将数据分组 - Applying 对每个分组应用不同的 function - Combining 使用某种数据结果展示结果
详情请查看 Grouping section (opens new window)
举个栗子:
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
>>> df
A B C D
0 foo one -0.655020 -0.671592
1 bar one 0.846428 1.884603
2 foo two -2.280466 0.725070
3 bar three 1.166448 -0.208171
4 foo two -0.257124 -0.850319
5 bar two -0.654609 1.258091
6 foo one -1.624213 -0.383978
7 foo three -0.523944 0.114338
2
3
4
5
6
7
8
9
10
11
12
13
14
分组后 sum 求和:
>>> df.groupby('A').sum()
C D
A
bar 1.358267 2.934523
foo -5.340766 -1.066481
2
3
4
5
对多列分组后 sum:
>>> df.groupby(['A','B']).sum()
C D
A B
bar one 0.846428 1.884603
three 1.166448 -0.208171
two -0.654609 1.258091
foo one -2.279233 -1.055570
three -0.523944 0.114338
two -2.537589 -0.125249
2
3
4
5
6
7
8
9
# 重塑
详情请查看 Hierarchical Indexing (opens new window)和Reshaping (opens new window)
stack
:
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
>>> tuples
[('bar', 'one'), ('bar', 'two'),
('baz', 'one'), ('baz', 'two'),
('foo', 'one'), ('foo', 'two'),
('qux', 'one'), ('qux', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
>>> df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
>>> df
A B
first second
bar one -0.922059 -0.918091
two -0.825565 -0.880527
baz one 0.241927 1.130320
two -0.261823 2.463877
foo one -0.220328 -0.519477
two -1.028038 -0.543191
qux one 0.315674 0.558686
two 0.422296 0.241212
>>> df2 = df[:4]
>>> df2
A B
first second
bar one -0.922059 -0.918091
two -0.825565 -0.880527
baz one 0.241927 1.130320
two -0.261823 2.463877
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
注
:pd.MultiIndex.from_tuples 将包含多个 list 的元组转换为复杂索引
使用 stack()方法为 DataFrame 增加 column:
>>> stacked = df2.stack()
>>> stacked
first second
bar one A -0.922059
B -0.918091
two A -0.825565
B -0.880527
baz one A 0.241927
B 1.130320
two A -0.261823
B 2.463877
dtype: float64
2
3
4
5
6
7
8
9
10
11
12
使用 unstack()方法还原 stack 的 DataFrame,默认还原最后一级,也可以自由指定:
>>> stacked.unstack()
A B
first second
bar one -0.922059 -0.918091
two -0.825565 -0.880527
baz one 0.241927 1.130320
two -0.261823 2.463877
>>> stacked.unstack(1)
second one two
first
bar A -0.922059 -0.825565
B -0.918091 -0.880527
baz A 0.241927 -0.261823
B 1.130320 2.463877
>>> stacked.unstack(0)
first bar baz
second
one A -0.922059 0.241927
B -0.918091 1.130320
two A -0.825565 -0.261823
B -0.880527 2.463877
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
透视表
详情请查看 Pivot Tables (opens new window)
栗子:
>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
2
3
4
5
注
:可以理解为自由组合表的行与列,类似于交叉报表
我们能非常简单的构造透视表:
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A -1.250611 -1.047274
B 1.532134 -0.455948
C 0.125989 -0.500260
three A 0.623716 NaN
B NaN 0.095117
C -0.348707 NaN
two A NaN 0.390363
B -0.743466 NaN
C NaN 0.792279
2
3
4
5
6
7
8
9
10
11
12
# 时间序列
pandas 可以简单高效的进行重新采样通过频率转换(例如:将秒级数据转换成五分钟为单位的数据)。这常见与金融应用中,但是不限于此。详情请查看 Time Series section (opens new window)
栗子:
>>> rng = pd.date_range('1/1/2012', periods=100, freq='S')
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts.resample('5Min').sum()
2012-01-01 24390
Freq: 5T, dtype: int64
2
3
4
5
注
:将随机产生的秒级数据整合成 5min 的数据
时区表现:
>>> rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
>>> ts = pd.Series(np.random.randn(len(rng)), rng)
>>> ts
2012-03-06 0.972202
2012-03-07 -0.839969
2012-03-08 -0.979993
2012-03-09 -0.052460
2012-03-10 -0.487963
Freq: D, dtype: float64
>>> ts_utc = ts.tz_localize('UTC')
>>> ts_utc
2012-03-06 00:00:00+00:00 0.972202
2012-03-07 00:00:00+00:00 -0.839969
2012-03-08 00:00:00+00:00 -0.979993
2012-03-09 00:00:00+00:00 -0.052460
2012-03-10 00:00:00+00:00 -0.487963
Freq: D, dtype: float64
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
时区变换:
>>> ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00 0.972202
2012-03-06 19:00:00-05:00 -0.839969
2012-03-07 19:00:00-05:00 -0.979993
2012-03-08 19:00:00-05:00 -0.052460
2012-03-09 19:00:00-05:00 -0.487963
Freq: D, dtype: float64
2
3
4
5
6
7
在不同的时间跨度表现间变换:
>>> rng = pd.date_range('1/1/2012', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2012-01-31 -0.681068
2012-02-29 -0.263571
2012-03-31 1.268001
2012-04-30 0.331786
2012-05-31 0.663572
Freq: M, dtype: float64
>>> ps = ts.to_period()
>>> ps
2012-01 -0.681068
2012-02 -0.263571
2012-03 1.268001
2012-04 0.331786
2012-05 0.663572
Freq: M, dtype: float64
>>> ps.to_timestamp()
2012-01-01 -0.681068
2012-02-01 -0.263571
2012-03-01 1.268001
2012-04-01 0.331786
2012-05-01 0.663572
Freq: MS, dtype: float64
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
注
:to_period()默认频率为 M,to_period 和 to_timestamp 可以相互转换
在周期和时间戳间转换,下面的栗子将季度时间转换为各季度最后一个月的 09am:
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='int64', freq='Q-NOV')
>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts.head()
1990-03-01 09:00 -0.927090
1990-06-01 09:00 -1.045881
1990-09-01 09:00 -0.837705
1990-12-01 09:00 -0.529390
1991-03-01 09:00 -0.423405
Freq: H, dtype: float64
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 分类
从 0.15 版以后,pandas 可以造 DataFrame 中包含分类数据,详情请查看分类介绍 (opens new window) 和 API 文档 (opens new window):
>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
1.将原始成绩转换为分类数据:
>>> df["grade"] = df["raw_grade"].astype("category")
>>> df["grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
2
3
4
5
6
7
8
9
10
2.重命名分类使其更有意义:
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
3.重新整理类别,并添加缺少的类别:
>>> df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
>>> df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
2
3
4
5
6
7
8
9
10
4.按整理后的类别排序(并非词汇的顺序):
>>> df.sort_values(by="grade")
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
2
3
4
5
6
7
8
5.按类别分组也包括空类别:
>>> df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
2
3
4
5
6
7
8
# 绘图
详情请查看Plotting (opens new window):
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> ts = ts.cumsum()
>>> ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff2ab2af550>
2
3
4
在 DataFrame 中,plot()可以绘制所有带有标签的列:
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
>>> df = df.cumsum()
>>> plt.figure(); df.plot(); plt.legend(loc='best')
<matplotlib.legend.Legend at 0x7ff29c8163d0>
2
3
4
5
# 获取数据 写入导出
- CSV 1. 写入 csv 文件 (opens new window):
>>> df.to_csv('foo.csv')
>>> pd.read_csv('foo.csv')
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409
6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
.. ... ... ... ... ...
993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- HDF5 HDFStores (opens new window)
1.写入 HDF5 Store:
>>> df.to_hdf('foo.h5','df')
2.从 HDF5 Store 读取:
>>> pd.read_hdf('foo.h5','df')
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- Excel MSExcel (opens new window)
1.写入 excel 文件:
>>> df.to_excel('foo.xlsx', sheet_name='Sheet1')
2.从excel
文件读取:
>>> pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 小陷阱
如果你操作时遇到了如下异常:
>>> if pd.Series([False, True, False]):
... print("I was true")
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/core/generic.py", line 730, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
2
3
4
5
6
7
8
请查看Comparisons (opens new window)来处理异常,也可以查看Gotchas (opens new window)
# pandas 实战
# cheetsheet
解释见天秀!一张图就能彻底搞定 Pandas! (opens new window)
# pandas 教程
pandas 教程 - 盖若 (opens new window)