Python中pandas的使用(更新中)

By | 2017年7月19日

目录:

  1. 读取数据
  2. 访问values
  3. 读取时间年月日
  4. group by、时间rolling、apply lambda函数
  5. diff 和 shift
  6. 行列转换 stack 和 unstack,MultiIndex
  7. Panel的使用
  8. append、concat、join
  9. MultiIndex
  10. 初始化dataframe及填充/修改/选取数据
  11. 空值
  12. 其他类型转为dataframe
  13. 遍历dataframe每一行

数据:

微信截图_20170719194349

1.读取数据

def convert_percent_string(string):
    if string.count('%') != 0:
        return np.float(string.replace('%', ''))*1.0/100
    else:
        return np.float(string)

calendar = pd.read_csv('economy_metrics.csv', encoding='gbk', index_col=0)   # 第一列作为index
calendar.index = pd.to_datetime(calendar.index)    # 将index转换为datetime格式
calendar = calendar.sort_index()            # 按index排序
calendar = calendar.replace('--', np.nan)   # 替换值
calendar = calendar.dropna()                # 去掉有空值的行
calendar['Actual'] = calendar['Actual'].apply(convert_percent_string)    # 应用apply函数
calendar['Survey'] = calendar['Survey'].apply(convert_percent_string)
calendar['OutofExpectation'] = calendar['Actual']-calendar['Survey']

 2.访问values

each_index = '2007/1/2 17:30'
print calendar[calendar.index == each_index]['Ticker'].values

此时结果是:[‘MPMIGBMA Index’ ‘hahaha’]

如果这个Time是唯一的,那么就可以这样写

calendar[calendar.index == each_index]['Ticker'].values[0]

此时结果是:MPMIGBMA Index

3.提取时间年月日

trx = pd.read_csv('10y_data_1min_20170714_2nd_sorted.csv', index_col=0)
trx.index = pd.to_datetime(trx.index)
trx = trx.sort_index()
temp = pd.DatetimeIndex(trx.index)        # DatetimeIndex方法
trx['Date'] = temp.date    # .date就是提取出年月日,.year是年,.month是月等等
print trx.head(10)

数据:

行业名称	交易日期	开盘价	收盘价	最高价	最低价	成交量	成交金额	换手率
农、林、牧、渔业	20140102	10.27	10.4108	10.4935	10.1886	204079136	2018686523	1.3047
采矿业	20140102	9.9227	9.8916	10.0363	9.753	600631854	4814913146	0.1637
农副食品加工业	20140102	11.3371	11.4331	11.5431	11.2083	171372475	2220440630	1.0073
食品制造业	20140102	21.063	20.9	21.296	20.5935	112019707	2421578859	1.0092
酒、饮料和精制茶制造业	20140102	17.3588	17.2603	17.485	17.0294	189769548	3066718181	0.8456
纺织业	20140102	7.7688	7.864	7.9595	7.6679	203875025	1547492044	1.217

任务:

  1. 按日期统计总的成交金额(group by)
  2. 并按时间排序(sort)
  3. 新增一列计算总成交金额的15天SMA(rolling)
  4. 将所有NaN取代为0(replace)
  5. 新增一列对成交金额取对数(apply lambda)
  6. 保存在doc1.csv中
  7. 选出test中“化学纤维制造业”从2014年6月2日到2014年7月31日的开盘价,保存在doc2.csv中
  8. 选出test中“专用设备制造业”换手率大于1的记录,保存在doc3.csv中
  9. 选出test中第100-150行,1000-1200行,1450-1550行,合并在同一个dataframe中保存在doc4.csv中

 4.group by、时间rolling、apply lambda函数

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

trading = pd.read_csv('test.csv', encoding='gbk', index_col=1)      # the trade date as index
# print type(trading.index), type(trading[u'行业名称'])

# 将交易日期字符转换为日期对象
trading.index = pd.to_datetime(trading.index.astype(str), format='%Y%m%d')              # change trade date to datetime

# 按时间排序
trading = trading.sort_index()

# 取出日期和成交金额两列,按日期统计总的成交金额
total_amount_group_by_date = trading[u'成交金额'].groupby(trading.index).sum()
total_amount_group_by_date = pd.DataFrame(total_amount_group_by_date)
# print total_amount_group_by_date.head()

# 新增一列计算总成交金额的15天SMA
total_amount_group_by_date['SMA-15d'] = total_amount_group_by_date.rolling(15).mean()

# 将所有NaN取代为0
total_amount_group_by_date = total_amount_group_by_date.replace(np.nan, 0)

# 新增一列对成交金额取对数
total_amount_group_by_date['log'] = total_amount_group_by_date[u'成交金额'].apply(lambda x: np.log(x))

# 保存在doc1.csv中
total_amount_group_by_date.to_csv('doc1.csv', encoding='gbk')

# 选出test中“化学纤维制造业”从2014年6月2日到2014年7月31日的开盘价,保存在doc2.csv中
chemistry_open = trading[trading[u'行业名称'] == u'化学纤维制造业']['2014-06-02':'2014-07-31'][u'开盘价']
chemistry_open.to_csv('doc2.csv', encoding='gbk')

# 选出test中“专用设备制造业”换手率大于1的记录,保存在doc3.csv中
equipment_turnover_rate = trading[(trading[u'行业名称'] == u'专用设备制造业') & (trading[u'换手率'].astype(float) > 1)]
equipment_turnover_rate.to_csv('doc3.csv', encoding='gbk')

# 选出test中第100-150行,1000-1200行,1450-1550行,合并在同一个dataframe中保存在doc4.csv中
new_trading = trading[100:150].append(trading[1000:1200]).append(trading[1450:1550])        # append数据合并、拼接
new_trading.to_csv('doc4.csv', encoding='gbk')

4.1 apply lambda函数

4.1.1 根据已有列是否满足一定条件新建列并赋值

# 使用apply函数, 如果city字段包含'ing'关键词,则'判断'这一列赋值为1,否则为0  
frame['panduan'] = frame.city.apply(lambda x: 1 if 'ing' in x else 0)
positive_list = [u'利多黄金', u'利多黄金原油', u'利空黄金', u'利空黄金原油', u'黄金', u'黄金晨报', u'白银']  # 正样本特征值列表
# 如果tags字段在positive_list中,则cell字段赋值为1,否则为0
shortnews[u'cell'] = shortnews[u'tags'].apply(lambda x: 1 if x in positive_list else 0)

5.diff 和 shift

diff:下一行的值减去前一行的值,结果留在了下一行;

shift:索引不动,将下一行的值挪到上一行(-1);

# 收益率diff = t+1期收益率 - t期收益率
diff_keydate = price_dataframe.diff().shift(periods=-1, axis=0).dropna()
df = pd.DataFrame(np.arange(24).reshape(6,4),index=range(0,12,2),columns=['A','B','C','D'])
print df
print df.diff()
print df.diff().shift(periods=-1, axis=0).dropna()

# 输出
     A   B   C   D
0    0   1   2   3
2    4   5   6   7
4    8   9  10  11
6   12  13  14  15
8   16  17  18  19
10  20  21  22  23
      A    B    C    D
0   NaN  NaN  NaN  NaN
2   4.0  4.0  4.0  4.0
4   4.0  4.0  4.0  4.0
6   4.0  4.0  4.0  4.0
8   4.0  4.0  4.0  4.0
10  4.0  4.0  4.0  4.0
     A    B    C    D
0  4.0  4.0  4.0  4.0
2  4.0  4.0  4.0  4.0
4  4.0  4.0  4.0  4.0
6  4.0  4.0  4.0  4.0
8  4.0  4.0  4.0  4.0

6.行列转换 stack 和 unstack

In [16]: df
Out[16]:
    three  two  one  
AA      0    1    2  
BB      3    4    5

In [17]: df.stack()  
Out[17]:  
AA  three    0  
    two      1  
    one      2  
BB  three    3  
    two      4  
    one      5  

In [21]: df.unstack()  
Out[21]:  
three  AA    0  
       BB    3  
two    AA    1  
       BB    4  
one    AA    2  
       BB    5  

In [27]: df.stack().unstack(0)
Out[27]:  
       AA  BB  
three   0   3  
two     1   4  
one     2   5

7.Panel的使用

data = {'Item1': pd.DataFrame(np.random.randn(4, 3)),
        'Item2': pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p
print p.axes                 # 取所有坐标轴的信息
print p.axes[0]              # 坐标轴顺序:items,major_axis,minor_axis
print p.axes[0][0]
print p.major_xs(0)          # 注意这里的 0 不是索引号,而是索引值

# 输出
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

[Index([u'Item1', u'Item2'], dtype='object'), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1)]

Index([u'Item1', u'Item2'], dtype='object')

Item1

 Item1 Item2
0 0.225719 -1.621769
1 1.271825 2.764002
2 -0.702630 NaN

8.append、concat、join

Item1 = pd.DataFrame(np.random.randn(2, 3))
Item2 = pd.DataFrame(np.random.randn(2, 2))
print Item1
print Item2
print Item1.append(Item2)
print pd.concat([Item1, Item2])
print pd.concat([Item1, Item2], axis=1)

          0         1         2
0 -0.036668 -0.438318  1.040368
1 -1.353837 -1.095082  0.938351
          0         1
0 -0.211336  0.368949
1 -0.070910 -0.544717
          0         1         2
0 -0.036668 -0.438318  1.040368
1 -1.353837 -1.095082  0.938351
0 -0.211336  0.368949       NaN
1 -0.070910 -0.544717       NaN
          0         1         2
0 -0.036668 -0.438318  1.040368
1 -1.353837 -1.095082  0.938351
0 -0.211336  0.368949       NaN
1 -0.070910 -0.544717       NaN
          0         1         2         0         1
0 -0.036668 -0.438318  1.040368 -0.211336  0.368949
1 -1.353837 -1.095082  0.938351 -0.070910 -0.544717
print(df1)
print(df2)
df_new = df1.set_index('指数wind代码').join(df2.set_index('指数wind代码'))       #这里要使用set_index设置两个df的index,然后用index去进行连接
print(df_new)

    指数wind代码   指数名称
0  000300.SH  沪深300
1  000010.SH  上证180
2  000903.SH  中证100
3  399905.SZ  中证500
4  399903.SZ  中证100
5  399300.SZ  沪深300
6  000905.SH  中证500

       指数wind代码  成分股wind代码      纳入日期      剔除日期
0     000300.SH  601618.SH  20100104      None
1     000010.SH  601099.SH  20100104      None
2     000905.SH  600267.SH  20150615      None
3     399300.SZ  600216.SH  20090701  20141212
4     399300.SZ  600098.SH  20050408  20090630
5     000905.SH  000040.SZ  20070115  20101231

指数wind代码  指数名称  成分股wind代码      纳入日期      剔除日期
000010.SH  上证180  601099.SH  20100104      None
000010.SH  上证180  600160.SH  20110701  20140613
000010.SH  上证180  600684.SH  20130701  20141212
000010.SH  上证180  600467.SH  20070104  20090630
000010.SH  上证180  600797.SH  20050701  20091231

 9.MultiIndex

data = {'Item1': pd.DataFrame(np.random.randn(4, 3)),
        'Item2': pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
df = p.to_frame()
print df
print df.index
print df.index.levels[0]
print list(df.index.levels[0])
print df.xs(2)                 # 以major为index,但xs不支持同一级索引多选
print df.ix[[1, 2, 3]]         # 以minor为index
print df.loc[pd.IndexSlice[[1, 3], :], :]
print df.sort_index(level='major', ascending=False)

# 输出
                Item1     Item2
major minor                    
0     0      0.520701  0.424011
      1      0.193741  0.830029
1     0      0.115968 -1.078769
      1      1.477486  0.394678
2     0     -1.394105 -0.291154
      1      1.796337 -0.443023
3     0     -2.006795  0.630577
      1     -1.286686  0.454017

MultiIndex(levels=[[0, 1, 2, 3], [0, 1, 2]],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'major', u'minor'])

RangeIndex(start=0, stop=4, step=1, name=u'major')

[0, 1, 2, 3]

          Item1     Item2
minor                    
0     -1.394105 -0.291154
1      1.796337 -0.443023

                Item1     Item2
major minor                    
0     1      0.193741  0.830029
1     0      0.115968 -1.078769
      1      1.477486  0.394678

                Item1     Item2
major minor                    
1     0      0.115968 -1.078769
      1      1.477486  0.394678
3     0     -2.006795  0.630577
      1     -1.286686  0.454017

                Item1     Item2
major minor                    
3     1     -1.286686  0.454017
      0     -2.006795  0.630577
2     1      1.796337 -0.443023
      0     -1.394105 -0.291154
1     1      1.477486  0.394678
      0      0.115968 -1.078769
0     1      0.193741  0.830029
      0      0.520701  0.424011

10.初始化dataframe及填充/修改/选取数据

#初始化
is_index_df = pd.DataFrame(columns=['日期','成分股wind代码','中证100','上证180','沪深300','中证500'])
is_index_df['成分股wind代码']=stock_list
is_index_df['日期']=str(each_date)
is_index_df['中证100']='0'
is_index_df['上证180']='0'
is_index_df['沪深300']='0'
is_index_df['中证500']='0'

#填充数据
for each_index in index_list:
    stock_index_list = list(df_new[df_new['指数名称']==each_index]['成分股wind代码'])
    for each_stock in stock_index_list:
        is_index_df.loc[(is_index_df['成分股wind代码']==each_stock), each_index] = '1'       #如果'成分股wind代码'这一列的值为each_stock,那么对应的 each_index 这一列下的值就为‘1’
is_index = is_index.append(is_index_df)

# MultiIndex下修改某一行某一列的值
import pandas as pd
df = pd.read_csv('股票连续持仓.csv', encoding='gbk')
df = df.set_index(['date', 'order_book_id'])
df['close']='0'
print(df)
idx = pd.IndexSlice
df.loc[idx['2017/4/24', '601288.XSHG'], 'close'] = '12.64'
print(df.xs(('2017/4/24', '601288.XSHG')))
print(df)
import pandas as pd
df = pd.read_csv('股票连续持仓.csv', encoding='gbk')
#print(df)
print(df[df.order_book_id.isin(['601398.XSHG','601628.XSHG'])])
print(df[df.quantity>10000])
print(df[df.order_book_id.isin(['601398.XSHG','601628.XSHG']) & (df.quantity>10000)])
#下面这种写法是并集关系,不推荐这种写法
print(df[df.isin({'date':['2017/4/24','2017/4/26'], 'order_book_id':['601398.XSHG','601628.XSHG']})])

#下面的~写法为isin的反操作
mask = df.index.isin([0,1,2])
df.loc[~mask]

 11.空值

df_new = df_new[(df_new['纳入日期']<=str(each_date)) & ((df_new['剔除日期']>str(each_date)) | (df_new['剔除日期'].isnull()))]      #这里的‘剔除日期’处的值为 None或NULL,都用isnull()来判断

#补充,数据库中的None判断方法则不同,直接用 is null
cursorWind.execute("select S_INFO_WINDCODE,S_CON_WINDCODE,S_CON_INDATE,S_CON_OUTDATE from wind.AINDEXMEMBERS where S_INFO_WINDCODE in ('000300.SH', '000010.SH', '000903.SH', '399905.SZ', '399903.SZ', '399300.SZ', '000905.SH') and (S_CON_OUTDATE>=20080101 or S_CON_OUTDATE is null)")

12.其他类型转为dataframe

12.1字典

import pandas as pd
test_dict = {'id':[1,2,3,4,5,6],'name':['Alice','Bob','Cindy','Eric','Helen','Grace '],'math':[90,89,99,78,97,93],'english':[89,94,80,94,94,90]}
#不定义列名
a = pd.DataFrame(test_dict)
b = pd.DataFrame.from_dict(test_dict)
c = pd.DataFrame.from_dict(test_dict, orient='index')
#定义列名
d = pd.DataFrame.from_dict(test_dict, orient='index', columns=['A', 'B', 'C', 'D', 'E', 'F'])


   id    name  math  english
0   1   Alice    90       89
1   2     Bob    89       94
2   3   Cindy    99       80
3   4    Eric    78       94
4   5   Helen    97       94
5   6  Grace     93       90

   id    name  math  english
0   1   Alice    90       89
1   2     Bob    89       94
2   3   Cindy    99       80
3   4    Eric    78       94
4   5   Helen    97       94
5   6  Grace     93       90

             0    1      2     3      4       5
id           1    2      3     4      5       6
name     Alice  Bob  Cindy  Eric  Helen  Grace 
math        90   89     99    78     97      93
english     89   94     80    94     94      90

             A    B      C     D      E       F
id           1    2      3     4      5       6
name     Alice  Bob  Cindy  Eric  Helen  Grace 
math        90   89     99    78     97      93
english     89   94     80    94     94      90

13.遍历dataframe每一行

方法一:

index_price_df = get_price('000905.XSHG', '2019-04-15', '2019-05-02', fields=['open', 'close', 'high', 'low'])
for index, row in index_price_df.iterrows():
    # 然后就可以对row进行操作了

方法二:

index_price_df = get_price('000905.XSHG', '2019-04-15', '2019-05-02', fields=['open', 'close', 'high', 'low'])
for row in index_price_df.itertuples():
    # 对每一行进行操作

print(row)
Pandas(Index=Timestamp('2019-04-15 00:00:00'), open=5795.5011, close=5655.9825, high=5815.3916, low=5654.3441)
Pandas(Index=Timestamp('2019-04-16 00:00:00'), open=5625.1221, close=5777.0111, high=5777.0111, low=5583.6142)
Pandas(Index=Timestamp('2019-04-17 00:00:00'), open=5776.6352, close=5807.7288, high=5840.6249, low=5757.9952)
Pandas(Index=Timestamp('2019-04-18 00:00:00'), open=5804.3014, close=5775.1709999999985, high=5826.6736, low=5762.5678)
Pandas(Index=Timestamp('2019-04-19 00:00:00'), open=5762.5009, close=5810.1798, high=5810.1798, low=5710.7953)
Pandas(Index=Timestamp('2019-04-22 00:00:00'), open=5821.0229, close=5723.0417, high=5824.5207, low=5704.9769)
Pandas(Index=Timestamp('2019-04-23 00:00:00'), open=5710.7334, close=5628.0624, high=5710.7334, low=5610.7353)
Pandas(Index=Timestamp('2019-04-24 00:00:00'), open=5636.4016, close=5679.7256, high=5682.4788, low=5568.408)
Pandas(Index=Timestamp('2019-04-25 00:00:00'), open=5652.0056, close=5458.4194, high=5662.7898, low=5457.8254)
Pandas(Index=Timestamp('2019-04-26 00:00:00'), open=5427.6228, close=5408.0306, high=5496.0353, low=5407.1942)

print(row.open)
5795.5011
5625.1221
5776.6352
5804.3014
5762.5009
5821.0229
5710.7334
5636.4016
5652.0056
5427.6228

参考文献:

  1. Pandas API Reference
  2. 【Python实战】Pandas:让你像写SQL一样做数据分析(一)
  3. pandas聚合和分组运算——GroupBy技术(1)
  4. 浅谈pandas中shift和diff函数关系
  5. pandas Dataframe中 iloc 取某行
  6. pandas中行列转换
  7. python/pandas/numpy数据分析(七)-MultiIndex
  8. Pandas面板(Panel)
  9. PANDAS 数据合并与重塑(concat篇)
  10. python/pandas dataframe中multiindex的操作

发表评论

电子邮件地址不会被公开。 必填项已用*标注