...

【網路爬蟲】臺灣證券交易所歷史資料教學(2)

Python進階教學


上一篇【網路爬蟲】臺灣證券交易所歷史資料教學(1) 提到如何從證交所網站剖析連接後,建立了抓取歷史股價的基本爬蟲,但是證交所網頁比較麻煩的地方在於每一次只能抓一個月的資料,所以必須要回溯過去撈取,再將資料拼湊起來。


創建日期序列

 

import pandas as pd

Dates = pd.date_range(start = '2000-01-01', end = '2020-09-01', freq = 'MS')

 

Output:

DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01',
               '2000-05-01', '2000-06-01', '2000-07-01', '2000-08-01',
               '2000-09-01', '2000-10-01',
               ...
               '2019-12-01', '2020-01-01', '2020-02-01', '2020-03-01',
               '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01',
               '2020-08-01', '2020-09-01'],
              dtype='datetime64[ns]', length=249, freq='MS')

 

 

由於帶入網址時必須帶入字串格式的日期,所以先將日期序列格式轉為文字,而且格式為yyyymmdd

Dates = Dates.astype(str)

 

Output:

Index(['2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01', '2000-05-01',
       '2000-06-01', '2000-07-01', '2000-08-01', '2000-09-01', '2000-10-01',
       ...
       '2019-12-01', '2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
       '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01'],
      dtype='object', length=249)

 

帶入上次寫好的Get_StockPrice()函數中

產生各個日期的字串後,必須要把字串中間的「-」消除,所以使用replace(‘-‘,”),用空白來取代減號;最後我們利用For迴圈來抓取證交所的網站,最後加一個time.sleep()就是要讓程式稍微停一秒鐘,讓對方比較不會感覺我們再用爬蟲程式攻擊網頁。

Symbol = '2330'
Dates = pd.date_range(start = '2000-01-01', end = '2020-09-01', freq = 'MS').astype(str)

for Date in Dates:
    print(Get_StockPrice(Symbol, Date.replace('-','')))
    time.sleep() 

 

完整程式碼

import pandas as pd
import numpy as np
import json
import requests
import datetime
import time


def Get_StockPrice(Symbol, Date):

    url = f'https://www.twse.com.tw/exchangeReport/STOCK_DAY?response=json&date={Date}&stockNo={Symbol}'
    print(url)
    data = requests.get(url).text
    json_data = json.loads(data)

    Stock_data = json_data['data']

    StockPrice = pd.DataFrame(Stock_data, columns = ['Date','Volume','Volume_Cash','Open','High','Low','Close','Change','Order'])

    StockPrice['Date'] = StockPrice['Date'].str.replace('/','').astype(int) + 19110000
    StockPrice['Date'] = pd.to_datetime(StockPrice['Date'].astype(str))
    StockPrice['Volume'] = StockPrice['Volume'].str.replace(',','').astype(float)/1000
    StockPrice['Volume_Cash'] = StockPrice['Volume_Cash'].str.replace(',','').astype(float)
    StockPrice['Order'] = StockPrice['Order'].str.replace(',','').astype(float)

    StockPrice['Open'] = StockPrice['Open'].str.replace(',','').astype(float)
    StockPrice['High'] = StockPrice['High'].str.replace(',','').astype(float)
    StockPrice['Low'] = StockPrice['Low'].str.replace(',','').astype(float)
    StockPrice['Close'] = StockPrice['Close'].str.replace(',','').astype(float)

    StockPrice = StockPrice.set_index('Date', drop = True)


    StockPrice = StockPrice[['Open','High','Low','Close','Volume']]
    print(StockPrice)
    return StockPrice

if __name__ == '__main__':   
    Symbol = '2330'
    Dates = pd.date_range(start = '2010-01-01', end = '2020-09-01', freq = 'MS').astype(str)

    data = Get_StockPrice(Symbol, Dates[0].replace('-',''))

    for Date in Dates[1:]:
        print(Date)
        try:
            data = pd.concat([data,Get_StockPrice(Symbol, Date.replace('-',''))], axis = 0)
            time.sleep(5)
        except:
            pass
    print(data)