[Python/파이썬] LSTM FinanceDataReader tensorflow keras sklearn - 1. LSTM 모델을 활용한 S&P500 예측

1. S&P500 Dataset 불러오기 - FinanceDataReader 사용

# 패키지 선언
import pandas as pd
import numpy as np
import FinanceDataReader as fdr
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
# S&P500 지수 (NYSE)
sp = fdr.DataReader('US500', '2020-01-01', '2022-01-01')
sp
Output
CloseOpenHighLowVolumeChange
Date
2020-01-023257.853244.673258.143235.530.00.0084
2020-01-033234.853226.363246.153222.340.0-0.0071
2020-01-063246.283217.553246.843214.640.00.0035
2020-01-073237.183241.863244.913232.430.0-0.0028
2020-01-083253.053238.593267.073236.670.00.0049
.....................
2021-12-274791.194733.994791.494733.990.00.0138
2021-12-284786.364795.494807.024780.040.0-0.0010
2021-12-294793.064788.644804.064778.080.00.0014
2021-12-304778.734794.234808.934775.330.0-0.0030
2021-12-314766.184775.214786.834765.750.0-0.0026


2. Data Scaling
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# 스케일을 적용할 column을 정의합니다.
scale_cols = ['Close']
# 스케일 후 columns
scaled = scaler.fit_transform(sp[scale_cols])
# DataFrame
df_scaled = pd.DataFrame(scaled, columns=scale_cols)
df_scaled
Output
Close
00.399290
10.390291
20.394763
30.391202
40.397412
......
5000.999268
5010.997378
5021.000000
5030.994393
5040.989482


3. Train set / Test set 분할
from sklearn.model_selection import train_test_split

train = df_scaled[:-28]
test = df_scaled[-28:]
print(train.shape,test.shape)
Output

 (477, 1) (28, 1)

X_train=train[:-7]
y_train=train[7:]
print(X_train.shape,y_train.shape)
Output

 (470, 1) (470, 1)

X_test=test[:-7]
y_test=test[7:]
print(X_test.shape,y_test.shape)
Output

 (21, 1) (21, 1)

def make_dataset(data,label,window_size=7):
    feature_list=[]
    label_list=[]
    for i in range(len(data)-window_size):
        feature_list.append(np.array(data[i:i+window_size]))
        label_list.append(np.array(label.iloc[i]))
    return np.array(feature_list), np.array(label_list)
# train dataset
X_train, y_train=make_dataset(X_train,y_train,7)

# train, validation set 생성
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train, test_size=0.2)

# test dataset
X_test,y_test=make_dataset(X_test,y_test,7)
X_test.shape,y_test.shape
Output

 ((14, 7, 1), (14, 1 ))


4. Modeling - LSTM Model

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint,ReduceLROnPlateau
from keras.layers import LSTM

earlystopping = EarlyStopping(patience=10,verbose=1)
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

model=tf.keras.models.Sequential([
    tf.keras.layers.LSTM(102,return_sequences=True,input_shape=(X_train.shape[1],1)),
    tf.keras.layers.LSTM(56,return_sequences=False),
    tf.keras.layers.Dense(28),
    tf.keras.layers.Dense(1)
])
model.compile(loss='mean_squared_error', optimizer='adam')
hist=model.fit(X_train,y_train,epochs=100,batch_size=5,validation_data=(X_val,y_val),callbacks=[earlystopping])


5. Valid Loss plot 생성

str_plt_style = 'bmh'
plt.style.use([str_plt_style])
plt.rcParams["figure.figsize"] = (8,6) 
plt.rcParams["font.size"]=11

plt.title('S&P500_Loss')
plt.plot(hist.history['loss'],label='train loss')
plt.plot(hist.history['val_loss'],label='valid loss')
plt.legend()
plt.show()

2차 프로젝트 - Valid Loss plot


6. S&P500 지수 예측 plot 생성

y_pred=model.predict(X_test)

# 원래 값으로 되돌리기
y_pred = scaler.inverse_transform(y_pred)
y_test = scaler.inverse_transform(y_test)
str_plt_style='bmh'
plt.style.use([str_plt_style])
plt.rcParams["figure.figsize"]=(16,9) 
plt.rcParams["font.size"]=11

plt.title('S&P500')
plt.plot(y_test,label='actual')
plt.plot(y_pred,label='prediction')
plt.legend()
plt.show()

2차 프로젝트 - S&P500 plot


7. Shiftin' plot 생성

result = pd.DataFrame(index=test.index[14:])
result.reset_index(inplace=True)
 
result['y_pred'] = y_pred
str_plt_style = 'bmh'
plt.style.use([str_plt_style])
plt.rcParams["figure.figsize"] = (16,9) 
plt.rcParams["font.size"]=11

plt.title('S&P500')
plt.plot(y_test,label='actual')

shift = result.y_pred.shift(-1).values
plt.plot(shift,label="shiftin'",color='orchid')
plt.legend()
plt.show()

2차 프로젝트 - S&P500 Shftin' plot

훈련 결과, 예측 값이 실제 값에 대해 shifting되는 경향을 보임

Network가 Test Data를 mimicking하는 것으로 추측

(Shifthin' plot을 통해 추세선의 유사성 확인 가능)

➟ Multi-Step Forecast 기법을 활용한 Seq2Seq 모델로 문제 해결 시도 예정


8. 정확도 측정

from sklearn.metrics import mean_squared_error 

# RMSE
MSE = mean_squared_error(y_test, y_pred) 
RMSE = np.sqrt(MSE)

# MAPE
def MAPE(y_test, y_pred):
    return np.mean(np.abs((y_test - y_pred) / y_test)) * 100 
print('RMSE =',round(RMSE,2))
print('MAPE =',round(MAPE(y_test, y_pred),2),'%')
Output

 RMSE = 107.38
 MAPE = 1.98 %



댓글

이 블로그의 인기 게시물

[Python/파이썬] Numpy Pandas Matplotlib Seaborn Sklearn - 2. 신용등급 Grouped Barplot