10分鐘入門Pandas

10 Minutes to pandas

http://pandas.pydata.org/pandas-docs/stable/10min.html

安裝

支持的python版本: 2.7, 3.5, 3.6

<code>$ pip install pandas
/<code>

檢查本地的pandas運行環境是否完整,可以運行pandas的單元測試用例

<code>$ pip install pytest

>>> import pandas as pd
>>> pd.test()
/<code>

獲取當前使用pandas的版本信息

<code>>>> import pandas as pd
>>> pd.__version__
'0.21.1'
/<code>

概覽

pandas的基本數據結構:

  • Series: 一維數據
  • DataFrame: 二維數據
  • Panel: 三維數據(從0.20.0版本開始,已經不再推薦使用)
  • Panel4D, PanelND(不再推薦使用)

DataFrame是由Series構成的

創建Series

創建Series最簡單的方法

<code>>>> s = pd.Series(data, index=index)
/<code>

data可以是不同的類型:

  • python字典
  • ndarray
  • 標量(比如: 5)

使用ndarray創建(From ndarray)

如果data是ndarray,那麼index的長度必須和data的長度相同,當沒有明確index參數時,默認使用[0, ... len(data) - 1]作為index。

<code>>>> import pandas as pd

>>> import numpy as np

>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

>>> s
a 0.654385
b 0.055691
c 0.856054
d 0.621810
e 1.802872
dtype: float64

>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

>>> pd.Series(np.random.randn(5))
0 -0.467183
1 -1.333323
2 -0.493813
3 -0.067705
4 -1.310332
dtype: float64
/<code>

需要注意的是: pandas裡的索引並不要求唯一性,如果一個操作不支持重複的索引,會自動拋出異常。這麼做的原因是很多操作不會用到索引,比如GroupBy。

<code>>>> s = pd.Series(np.random.randn(5), index=['a', 'a', 'a', 'a', 'a'])

>>> s
a 0.847331
a -2.138021
a -0.364763
a -0.603172
a 0.363691
dtype: float64
/<code>

使用dict創建(From dict)

當data是dict類型時,如果指定了index參數,那麼就使用index參數作為索引。否者,就使用排序後的data的key作為index。

<code>>>> d = {'b': 0., 'a': 1., 'c': 2.}

# 索引的值是排序後的
>>> pd.Series(d)
a 1.0
b 0.0
c 2.0
dtype: float64

# 字典中不存在的key, 直接賦值為NaN(Not a number)
>>> pd.Series(d, index=['b', 'c', 'd', 'a'])
b 0.0
c 2.0
d NaN
a 1.0
dtype: float64
/<code>

使用標量創建(From scalar value)

當data是標量時,必須提供index, 值會被重複到index的長度

<code>>>> pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
/<code>

創建DataFrame

DataFrame是一個二維的數據結構,可以看做是一個excel表格或一張SQL表,或者值為Series的字典。 跟Series一樣,DataFrame也可以通過多種類型的數據結構來創建

  • 字典(包含一維ndarray數組,列表,字典或Series)
  • 二維的ndarray數組
  • 結構化的ndarray
  • Series
  • 另一個DataFrame

除了data之外,還接受index和columns參數來分佈指定行和列的標籤

從Series字典或嵌套的字典創建(From dict of Series or dicts)

結果的索引是多個Series索引的合集,如果沒有指定columns,就用排序後的字典的key作為列標籤。

<code>>>> d = {'one': pd.Series([1,2,3], index=['a', 'b', 'c']),
... 'two': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}
...

>>> df = pd.DataFrame(d)

>>> df
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4

>>> pd.DataFrame(d, index=['d', 'b', 'a'])
one two
d NaN 4
b 2.0 2
a 1.0 1

>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
two three
d 4 NaN
b 2 NaN
a 1 NaN

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')

>>> df.columns
Index(['one', 'two'], dtype='object')
/<code>

從ndarray類型/列表類型的字典(From dict of ndarrays / lists)

<code>>>> d = {'one': [1,2,3,4], 'two': [4,3,2,1]}

>>> pd.DataFrame(d)
one two
0 1 4
1 2 3
2 3 2
3 4 1

>>> pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
one two
a 1 4
b 2 3
c 3 2

d 4 1
/<code>

從結構化ndarray創建(From structured or record array)

<code>>>> data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

>>> data
array([(0, 0., b''), (0, 0., b'')],
dtype=[('A', '
>>> data[:] = [(1, 2., 'Hello'), (2, 3., 'World')]

>>> pd.DataFrame(data)
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'

>>> pd.DataFrame(data, index=['first', 'second'])
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'

>>> pd.DataFrame(data, index=['first', 'second'], columns=['C', 'A', 'B'])
C A B
first b'Hello' 1 2.0
second b'World' 2 3.0
/<code>

從字典列表裡創建(a list of dicts)

<code>>>> data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]

>>> pd.DataFrame(data2)
a b c
0 1 2 NaN
1 5 10 20.0

>>> pd.DataFrame(data2, index=["first", "second"])
a b c
first 1 2 NaN
second 5 10 20.0

>>> pd.DataFrame(data2, columns=["a", "b"])
a b
0 1 2
1 5 10
/<code>

從元祖字典創建(From a dict of tuples)

通過元祖字典,可以創建多索引的DataFrame

<code>>>> pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
... ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
... ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
... ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
... ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
...
a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
/<code>

通過Series創建(From a Series)

<code>>>> pd.DataFrame(pd.Series([1,2,3]))
0
0 1
1 2
2 3
/<code>

查看數據

<code>>>> dates = pd.date_range('20130101', periods=6)

>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

>>> df
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 -1.831020 0.813526 0.403101 -1.251946

# 獲取前幾行(默認前5行)
>>> df.head()
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087

# 獲取後3行
>>> df.tail(3)
A B C D
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 -1.831020 0.813526 0.403101 -1.251946

# 獲取索引
>>> df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

# 獲取列信息
>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

# 獲取數據信息
>>> df.values
array([[ 1.23189704, -0.16983942, 1.3332949 , 0.36714191],
[-0.12744988, -1.71667129, 0.91034961, 0.15118638],
[-0.24165226, -0.98464711, 0.78865554, -0.20363944],
[ 0.04498958, -0.25515787, -1.21384804, 1.07671506],
[ 0.41821265, 0.10740007, 0.61944799, 1.49408712],
[-1.8310196 , 0.81352564, 0.40310115, -1.25194611]])

# 獲取簡單的統計信息
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.084170 -0.367565 0.473500 0.272257
std 1.007895 0.880134 0.883494 0.970912
min -1.831020 -1.716671 -1.213848 -1.251946
25% -0.213102 -0.802275 0.457188 -0.114933
50% -0.041230 -0.212499 0.704052 0.259164
75% 0.324907 0.038090 0.879926 0.899322
max 1.231897 0.813526 1.333295 1.494087

# 轉置矩陣
>>> df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 1.231897 -0.127450 -0.241652 0.044990 0.418213 -1.831020
B -0.169839 -1.716671 -0.984647 -0.255158 0.107400 0.813526
C 1.333295 0.910350 0.788656 -1.213848 0.619448 0.403101
D 0.367142 0.151186 -0.203639 1.076715 1.494087 -1.251946

# 按照列排序
>>> df.sort_values(by='B')
A B C D
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 -1.831020 0.813526 0.403101 -1.251946
/<code>

選擇數據

獲取

選擇列, 返回的是Series

<code>>>> df['A']
2013-01-01 1.231897
2013-01-02 -0.127450
2013-01-03 -0.241652
2013-01-04 0.044990
2013-01-05 0.418213
2013-01-06 -1.831020
Freq: D, Name: A, dtype: float64

>>> df.A
2013-01-01 1.231897
2013-01-02 -0.127450
2013-01-03 -0.241652
2013-01-04 0.044990
2013-01-05 0.418213
2013-01-06 -1.831020
Freq: D, Name: A, dtype: float64
/<code>

選擇行

<code>>>> df[0:3]
A B C D

2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639

>>> df["20130102":"20130104"]
A B C D
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
/<code>

通過Label選擇

<code># 返回的Series
>>> df.loc[dates[0]]
A 1.231897
B -0.169839
C 1.333295
D 0.367142
Name: 2013-01-01 00:00:00, dtype: float64

# 返回的DateFrame
>>> df.loc[:, ['A', 'B']]
A B
2013-01-01 1.231897 -0.169839
2013-01-02 -0.127450 -1.716671
2013-01-03 -0.241652 -0.984647
2013-01-04 0.044990 -0.255158
2013-01-05 0.418213 0.107400
2013-01-06 -1.831020 0.813526

>>> df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 -0.127450 -1.716671
2013-01-03 -0.241652 -0.984647
2013-01-04 0.044990 -0.255158

# 降維返回
>>> df.loc['20130102',['A','B']]
A -0.127450
B -1.716671
Name: 2013-01-02 00:00:00, dtype: float64
/<code>

通過Position選擇

<code># 返回第4行
>>> df.iloc[3]

A 0.044990
B -0.255158
C -1.213848
D 1.076715
Name: 2013-01-04 00:00:00, dtype: float64


>>> df.iloc[3:5,0:2]
A B
2013-01-04 0.044990 -0.255158
2013-01-05 0.418213 0.107400

>>> df.iloc[1:3, :]
A B C D
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639

# 獲得指定位置的元素
>>> df.iloc[1,1]
-1.7166712884342545

>>> df.iat[1,1]
-1.7166712884342545
/<code>

布爾索引

<code>>>> df[df.A > 0]
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087


>>> df[df > 0]
A B C D
2013-01-01 1.231897 NaN 1.333295 0.367142
2013-01-02 NaN NaN 0.910350 0.151186
2013-01-03 NaN NaN 0.788656 NaN
2013-01-04 0.044990 NaN NaN 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 NaN 0.813526 0.403101 NaN


>>> df2=df.copy()

>>> df2['E'] = ['one','one','two','three','four','three']

>>> df2

A B C D E
2013-01-01 1.231897 -0.169839 1.333295 0.367142 one
2013-01-02 -0.127450 -1.716671 0.910350 0.151186 one
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 two
2013-01-04 0.044990 -0.255158 -1.213848 1.076715 three
2013-01-05 0.418213 0.107400 0.619448 1.494087 four
2013-01-06 -1.831020 0.813526 0.403101 -1.251946 three

# 使用isin()來過濾
>>> df2[df2['E'].isin(['two', 'four'])]
A B C D E
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 two
2013-01-05 0.418213 0.107400 0.619448 1.494087 four
/<code>

賦值

根據日期新增加一列

<code>>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64

>>> df['F'] = s1

>>> df
A B C D F
2013-01-01 1.231897 -0.169839 1.333295 0.367142 NaN
2013-01-02 -0.127450 -1.716671 0.910350 0.151186 1.0
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 2.0
2013-01-04 0.044990 -0.255158 -1.213848 1.076715 3.0
2013-01-05 0.418213 0.107400 0.619448 1.494087 4.0
2013-01-06 -1.831020 0.813526 0.403101 -1.251946 5.0

# 通過label賦值
>>> df.at[dates[0], 'A'] = 0

# 通過position賦值
>>> df.iat[0,1] = 0

# 通過ndarray賦值
>>> df.loc[:, 'D'] = np.array([5] * len(df))

>>> df
A B C D F
2013-01-01 0.000000 0.000000 1.333295 5 NaN
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0
2013-01-05 0.418213 0.107400 0.619448 5 4.0
2013-01-06 -1.831020 0.813526 0.403101 5 5.0

# 通過where操作
>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

>>> df
A B C D
2013-01-01 -1.231777 -0.068987 -0.105402 1.512076
2013-01-02 -1.120426 -0.240417 0.223964 -0.559793
2013-01-03 0.697097 0.758780 -1.191408 -0.793882
2013-01-04 0.332519 0.784564 0.805932 -1.169186
2013-01-05 0.010235 0.156115 0.419567 -2.279214
2013-01-06 0.294819 -0.691370 0.294119 -0.208475

>>> df2 = df.copy()

>>> df2[df > 0] = -df2

>>> df2
A B C D
2013-01-01 -1.231777 -0.068987 -0.105402 -1.512076
2013-01-02 -1.120426 -0.240417 -0.223964 -0.559793
2013-01-03 -0.697097 -0.758780 -1.191408 -0.793882
2013-01-04 -0.332519 -0.784564 -0.805932 -1.169186
2013-01-05 -0.010235 -0.156115 -0.419567 -2.279214
2013-01-06 -0.294819 -0.691370 -0.294119 -0.208475
/<code>

數據缺失

pandas使用np.nan來表示缺失的數據,它默認不參與任何運算

<code>>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 1.333295 5 NaN NaN
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 NaN
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 NaN
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 NaN

>>> df1.loc[dates[0]:dates[1], 'E'] = 1

>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 1.333295 5 NaN 1.0
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 1.0
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 NaN
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 NaN

# 丟棄所有包含NaN的行
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 -0.12745 -1.716671 0.91035 5 1.0 1.0

# 填充所有包含NaN的元素
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 1.333295 5 5.0 1.0
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 1.0
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 5.0
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 5.0

# 獲取元素值為nan的布爾掩碼
>>> pd.isna(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
/<code>

運算操作

Stats統計

運算操作都會排除NaN元素

<code>>>> dates = pd.date_range('20130101', periods=6)

>>> df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=list('ABCD'))

>>> df
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11

2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

# 計算列的平均值
>>> df.mean()
A 10.0
B 11.0
C 12.0
D 13.0
dtype: float64

計算行的平均值
>>> df.mean(1)
2013-01-01 1.5
2013-01-02 5.5
2013-01-03 9.5
2013-01-04 13.5
2013-01-05 17.5
2013-01-06 21.5
Freq: D, dtype: float64

# shift(n),按照列的方向,從上往下移動n個位置
>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)

>>> s
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64

# sub函數,DataFrame相減操作, 等於 df-s
>>> df.sub(s, axis='index')
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 7.0 8.0 9.0 10.0
2013-01-04 9.0 10.0 11.0 12.0
2013-01-05 11.0 12.0 13.0 14.0
2013-01-06 NaN NaN NaN NaN
/<code>

Apply

<code>>>> df
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

# 在列方向累加
>>> df.apply(np.cumsum)
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 6 8 10
2013-01-03 12 15 18 21
2013-01-04 24 28 32 36
2013-01-05 40 45 50 55
2013-01-06 60 66 72 78

# 列方向的最大值-最小值, 得到的是一個Series
>>> df.apply(lambda x: x.max() - x.min())
A 20
B 20
C 20
D 20
dtype: int64
/<code>

直方圖 Histogramming

<code>>>> s = pd.Series(np.random.randint(0, 7, size=10))

>>> s
0 6
1 5
2 0
3 2
4 5
5 1
6 3
7 3
8 3
9 1
dtype: int64

# 索引是出現的數字,值是次數
>>> s.value_counts()

3 3
5 2
1 2
6 1
2 1
0 1
dtype: int64
/<code>

字符串方法

<code>>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

>>> s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
/<code>

合併

Concat

<code>>>> df = pd.DataFrame(np.random.randn(10, 4))

>>> df
0 1 2 3
0 -1.710767 -2.107488 1.441790 0.959924
1 0.509422 0.099733 0.845039 0.232462
2 -0.609247 0.533162 -0.387640 0.668803
3 0.946219 -0.326805 1.245303 1.336090
4 -1.069114 0.755313 -1.003991 -0.327009
5 1.169418 -1.225637 -2.137500 1.766341
6 -1.751095 0.279439 0.018053 1.800435
7 -0.328828 -1.513893 1.879333 0.945217
8 2.440123 -0.260918 -0.232951 -1.337775
9 -0.876878 -1.153583 -1.487573 -1.509871

# 分成小塊
>>> pieces = [df[:3], df[3:7], df[7:]]

# 合併

>>> pd.concat(pieces)
0 1 2 3
0 -1.710767 -2.107488 1.441790 0.959924
1 0.509422 0.099733 0.845039 0.232462
2 -0.609247 0.533162 -0.387640 0.668803
3 0.946219 -0.326805 1.245303 1.336090
4 -1.069114 0.755313 -1.003991 -0.327009
5 1.169418 -1.225637 -2.137500 1.766341
6 -1.751095 0.279439 0.018053 1.800435
7 -0.328828 -1.513893 1.879333 0.945217
8 2.440123 -0.260918 -0.232951 -1.337775
9 -0.876878 -1.153583 -1.487573 -1.509871
/<code>

Join

跟數據庫的Join操作一樣

<code>>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

>>> left
key lval
0 foo 1
1 foo 2

>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

>>> right
key rval
0 foo 4
1 foo 5

>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
/<code>

另一個例子

<code>>>> left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

>>> left
key lval
0 foo 1
1 bar 2

>>> right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

>>> right
key rval
0 foo 4
1 bar 5

>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 bar 2 5
/<code>

Append

<code>>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

>>> df
A B C D
0 -1.521762 -0.850721 1.322354 -0.226562
1 -2.773304 -0.663303 0.895075 -0.171524
2 0.322975 -0.796484 0.379920 0.028333
3 -0.350795 1.839747 -0.359241 -0.027921
4 -0.945340 1.062598 -2.208670 0.769027
5 -0.329458 -0.145658 1.580258 -1.414820
6 -0.261757 -1.435025 -0.512306 -0.222287
7 -0.994207 -1.219057 0.781283 -1.795741

>>> s = df.iloc[3]

>>> df.append(s, ignore_index=True)
A B C D
0 -1.521762 -0.850721 1.322354 -0.226562
1 -2.773304 -0.663303 0.895075 -0.171524
2 0.322975 -0.796484 0.379920 0.028333
3 -0.350795 1.839747 -0.359241 -0.027921
4 -0.945340 1.062598 -2.208670 0.769027
5 -0.329458 -0.145658 1.580258 -1.414820
6 -0.261757 -1.435025 -0.512306 -0.222287
7 -0.994207 -1.219057 0.781283 -1.795741
8 -0.350795 1.839747 -0.359241 -0.027921
/<code>

Grouping

group by的操作需要經過以下1個或多個步驟

  • 根據條件分組數據(
    Spliting)
  • 在各個分組上執行函數(Applying)
  • 合併結果(Combining) >>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ... 'foo', 'bar', 'foo', 'foo'], ... 'B' : ['one', 'one', 'two', 'three', ... 'two', 'two', 'one', 'three'], ... 'C' : np.arange(1, 9), ... 'D' : np.arange(2, 10)}) ... ... >>> df A B C D 0 foo one 1 2 1 bar one 2 3 2 foo two 3 4 3 bar three 4 5 4 foo two 5 6 5 bar two 6 7 6 foo one 7 8 7 foo three 8 9 # 分組求和 >>> df.groupby('A').sum() C D A bar 12 15 foo 24 29 # 多列分組 >>> df.groupby(['A','B']).sum() C D A B bar one 2 3 three 4 5 two 6 7 foo one 8 10 three 8 9 two 8 10 >>> b = df.groupby(['A','B']).sum() # 多索引 >>> b.index MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']], labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], names=['A', 'B']) >>> b.columns Index(['C', 'D'], dtype='object')

Reshaping

Stack

<code>>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
... 'foo', 'foo', 'qux', 'qux'],
... ['one', 'two', 'one', 'two',
... 'one', 'two', 'one', 'two']]))
...

>>> tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]

>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

>>> df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

>>> df
A B
first second
bar one 0.096893 0.479194
two -0.771606 0.331693
baz one -0.022540 0.531284
two -0.039843 1.876942
foo one 0.250473 1.163931
two -1.127163 1.447566
qux one -0.410361 -0.734333
two -0.461247 0.018531

>>> df2 = df[:4]

>>> df2
A B
first second
bar one 0.096893 0.479194
two -0.771606 0.331693
baz one -0.022540 0.531284
two -0.039843 1.876942

>>> stacked = df2.stack()

>>> stacked
first second
bar one A 0.096893
B 0.479194
two A -0.771606
B 0.331693
baz one A -0.022540
B 0.531284
two A -0.039843
B 1.876942
dtype: float64

>>> type(stacked)
pandas.core.series.Series

>>> stacked.index
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two'], ['A', 'B']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second', None])

>>> stacked.values
array([ 0.09689327, 0.47919417, -0.77160574, 0.3316934 , -0.02253955,
0.53128436, -0.03984337, 1.8769416 ])


>>> stacked.unstack()
A B
first second
bar one 0.096893 0.479194
two -0.771606 0.331693
baz one -0.022540 0.531284
two -0.039843 1.876942

>>> stacked.unstack(1)
second one two
first
bar A 0.096893 -0.771606
B 0.479194 0.331693

baz A -0.022540 -0.039843
B 0.531284 1.876942

>>> stacked.unstack(0)
first bar baz
second
one A 0.096893 -0.022540
B 0.479194 0.531284
two A -0.771606 -0.039843
B 0.331693 1.876942
/<code>

數據透視表(Pivot Tables)

時間序列

pandas在時間序列上,提供了很方便的按照頻率重新採樣的功能,在財務分析上非常有用

<code># 把每秒的數據按5分鐘聚合
>>> rng = pd.date_range('1/1/2012', periods=100, freq='S')
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts.resample('5Min').sum()
2012-01-01 22073
Freq: 5T, dtype: int64
/<code>

加上時區信息

<code>>>> rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

>>> ts = pd.Series(np.random.randn(len(rng)), rng)

>>> ts
2012-03-06 -0.386974
2012-03-07 0.657785
2012-03-08 1.390234
2012-03-09 0.412904
2012-03-10 -1.189340
Freq: D, dtype: float64

>>> ts_utc = ts.tz_localize('UTC')

>>> ts_utc
2012-03-06 00:00:00+00:00 -0.386974

2012-03-07 00:00:00+00:00 0.657785
2012-03-08 00:00:00+00:00 1.390234
2012-03-09 00:00:00+00:00 0.412904
2012-03-10 00:00:00+00:00 -1.189340
Freq: D, dtype: float64
/<code>

轉換成另一個時區

<code>>>> ts_utc.tz_convert('Asia/Shanghai')
2012-03-06 08:00:00+08:00 -0.386974
2012-03-07 08:00:00+08:00 0.657785
2012-03-08 08:00:00+08:00 1.390234
2012-03-09 08:00:00+08:00 0.412904
2012-03-10 08:00:00+08:00 -1.189340
Freq: D, dtype: float64
/<code>

時間跨度轉換

<code>>>> rng = pd.date_range('1/1/2012', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)

>>> ts
2012-01-31 0.825174
2012-02-29 -2.190258
2012-03-31 -0.073171
2012-04-30 -0.404208
2012-05-31 0.245025
Freq: M, dtype: float64

>>> ps = ts.to_period()

>>> ps
2012-01 0.825174
2012-02 -2.190258
2012-03 -0.073171
2012-04 -0.404208
2012-05 0.245025
Freq: M, dtype: float64

>>> ps.to_timestamp()
2012-01-01 0.825174
2012-02-01 -2.190258
2012-03-01 -0.073171
2012-04-01 -0.404208
2012-05-01 0.245025
Freq: MS, dtype: float64
/<code>

轉換季度時間

<code>>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

>>> ts = pd.Series(np.random.randn(len(prng)), prng)

>>> ts.head()
1990Q1 -0.590040
1990Q2 -0.750392
1990Q3 -0.385517
1990Q4 -0.380806
1991Q1 -1.252727
Freq: Q-NOV, dtype: float64

>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

>>> ts.head()
1990-03-01 09:00 -0.590040
1990-06-01 09:00 -0.750392
1990-09-01 09:00 -0.385517
1990-12-01 09:00 -0.380806
1991-03-01 09:00 -1.252727
Freq: H, dtype: float64
/<code>

Categoricals分類

<code>>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

>>> df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
/<code>

轉換原始類別為分類數據類型

<code>>>> df["grade"] = df["raw_grade"].astype("category")

>>> df
id raw_grade grade
0 1 a a
1 2 b b

2 3 b b
3 4 a a
4 5 a a
5 6 e e

>>> df["grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
/<code>

重命名分類為更有意義的名稱

<code>>>> df["grade"].cat.categories = ["very good", "good", "very bad"]

>>> df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
/<code>

重新安排順分類,同時添加缺少的分類(序列 .cat方法下返回新默認序列)

<code>>>> df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

>>> df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

>>> df["grade"]
0 very good
1 good

2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
/<code>

按照分類排序

<code>>>> df.sort_values(by="grade")
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
/<code>

按照分類分組,同時也會顯示空的分類

<code>>>> df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
/<code>

Plotting

<code>>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> ts = ts.cumsum()

>>> ts.plot()
<matplotlib.axes.>

>>> plt.show()
/<matplotlib.axes.>/<code>
10分鐘入門Pandas

畫圖帶圖例的圖

<code>>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A','B'
... ,'C', 'D'])

>>> df.cumsum()

>>> plt.figure();df.plot();plt.legend(loc='best')
<matplotlib.legend.legend>

>>> plt.show()
/<matplotlib.legend.legend>/<code>
10分鐘入門Pandas

數據In/Out

CSV

保存到csv文件

<code>>>> df.to_csv('foo.csv')
/<code>

從csv文件讀取數據

<code>>>> pd.read_csv('foo.csv')
/<code>

HDF5

保存到HDF5倉庫

<code>>>> df.to_hdf('foo.h5','df')
/<code>

從倉庫讀取

<code>>>> pd.read_hdf('foo.h5','df')
/<code>

Excel

保存到excel

<code>>>> df.to_excel('foo.xlsx', sheet_name='Sheet1')
/<code>

從excel文件讀取

<code>>>> pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
/<code>


擴展閱讀

https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html


分享到:


相關文章: