Time Series

Predict the future eh?

Sample Time Series

Time series work is expanded on regression modeling but with the focus on time. If you think across time tomorrow, next week, next month has an affect on Y.

Most common time series models are ARIMA, SARIMA, SARIMAX, and VAR. However these are univariate models. Multivariate models are most likely used in the industry because there are other factors that affect Y. Holidays, weekends, climate, and weather can be used to augment the model.

As always find the baseline model and iterate on top.

There are other feature engineering can be done on top of the model.

  • shift() for lagging
  • rolling() for moving average
  • diff() for differencing
  • pct_change() for percent change
  • resample() for resampling
  • interpolate() for interpolation
  • expanding() for expanding window
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Load time series data into a pandas DataFrame
df = pd.read_csv('path/to/data.csv', parse_dates=['date_column'])

# Set the date column as the index
df.set_index('date_column', inplace=True)

# Define the feature columns and target column
feature_cols = ['feature_1', 'feature_2', ...]
target_col = 'target'

# Split the data into training and testing sets using TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(df):
    train_data = df.iloc[train_index]
    test_data = df.iloc[test_index]
    X_train = train_data[feature_cols]
    y_train = train_data[target_col]
    X_test = test_data[feature_cols]
    y_test = test_data[target_col]

    # Create and train the LGBM model
    params = {
        'objective': 'regression',
        'metric': 'mse',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'feature_fraction': 0.8
    train_set = lgb.Dataset(X_train, y_train)
    val_set = lgb.Dataset(X_test, y_test)
    model = lgb.train(params, train_set, num_boost_round=1000, valid_sets=[train_set, val_set],
                      early_stopping_rounds=50, verbose_eval=100)

    # Make predictions on the test data and calculate mean squared error
    y_pred = model.predict(X_test, num_iteration=model.best_iteration)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean Squared Error: {mse:.4f}')

Reference material

  • Time series book forecasting and horizon