Predict house prices - deep learning, keras

Stefan Fouche
Apr 4, 2018 in R
backtesting cross-validation deep_learning keras regression time-series

Overview
Naive model (no time index)
Time series models using LSTM together with an inference network
Future additions

Overview

In this notebook I fool around with some fast prototype models for predicting house prices using deep neural nets!

First I will do some LSTM modelling for the house price movement using tidy backtested crossvalidation.

Next I use tidy crossvalidated dense feed forward networks to predict the price of a property based on the properties descriptions!

Coming soon; I will use the LSTM model to produce a 1 month forecast and feed this as a dimension to the inference model during predictions of houses for which we believe we have the descriptive data now but we will only sell the property in a months time. I will compare the errors by holding out the last few months of data to test the combined accuracy accounting for price movement.

Naive model (no time index)

As a little introduction we can cover the example in the book by François Chollet with J. J. Allaire

For our first model we will be using the boston housing data from 1970. This data collected features of the sale but had no time index - probably because the data was only collected over 1970; possibly with some cencoring.

With no time index in the data we can think of the tensors as being 1D tensors since the samples have n features on only 1 axis.

With this model we cannot really predict into the future so our predictions are only valid for predicting house prices right now (1970).

Load the data

dataset <- dataset_boston_housing()
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset

The training data we load here is already in a matrix format for us:

train_data %>% head

##         [,1] [,2]  [,3] [,4]  [,5]  [,6]  [,7]   [,8] [,9] [,10] [,11]
## [1,] 1.23247  0.0  8.14    0 0.538 6.142  91.7 3.9769    4   307  21.0
## [2,] 0.02177 82.5  2.03    0 0.415 7.610  15.7 6.2700    2   348  14.7
## [3,] 4.89822  0.0 18.10    0 0.631 4.970 100.0 1.3325   24   666  20.2
## [4,] 0.03961  0.0  5.19    0 0.515 6.037  34.5 5.9853    5   224  20.2
## [5,] 3.69311  0.0 18.10    0 0.713 6.376  88.4 2.5671   24   666  20.2
## [6,] 0.28392  0.0  7.38    0 0.493 5.708  74.3 4.7211    5   287  19.6
##       [,12] [,13]
## [1,] 396.90 18.72
## [2,] 395.38  3.11
## [3,] 375.52  3.26
## [4,] 396.90  8.01
## [5,] 391.43 14.65
## [6,] 391.13 11.74

We have 13 features/dimensions of collected data for each sample that we can use to predict the house price:

train_targets %>% head

## [1] 15.2 42.3 50.0 21.1 17.7 18.5

Scale the variables

mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)

Define the model

We can start by defining the model. Since we will be using cross-validation we define a function that constructs the model. We do this because we need to construct and train a model on each fold so we can gauge the performance accross the different folds as we fine tune our model:

Function that constructs a model

Note that I am assuming the data input is a resample object… That’s because I will be creating a tidy crossvalidation and so the objects in my table will be resamples…

I am also assuming the outcome is called ‘target’, but this can be generalised using enquo

define_compile_train_predict <- function(train,test,epochs = 100,shuffle = TRUE,...){
    
train_X <- train %>% as_tibble() %>% select(-target) %>% as.matrix()
test_X <- test %>% as_tibble() %>% select(-target) %>% as.matrix()
train_y <- train %>% as_tibble() %>%  select(target) %>% as.matrix()
test_y <- test %>% as_tibble() %>%  select(target) %>% as.matrix()
    
#define
base_model <- 
keras_model_sequential() %>% 
layer_dense(units = 64, activation = 'relu', input_shape = c(13)) %>% # 1 axis and nr features = number columns
layer_batch_normalization() %>%
# layer_dropout(rate = 0.2) %>%
layer_dense(units = 64, activation ='relu') %>% 
# layer_dense(units = 25, activation ='relu') %>%
layer_dense(units = 1) 
    
#compile
base_model %>% compile(
  loss='mse',
  optimizer='rmsprop',
    # optimizer='adam',
  metrics = c('mae')
)

#train
history <- base_model %>% 
  keras::fit(train_X,
             train_y,
             epochs=epochs,
             shuffle=shuffle,
             validation_data= list(test_X, test_y),
                verbose = 0
             )  

#predict
predictions <- 
    base_model %>% 
    keras::predict_on_batch(x = train_X)

results <- 
base_model %>% 
keras::evaluate(test_X, test_y, verbose = 0)

plot_outp <- 
    history %>% plot

return(list(predictions = predictions, loss = results$loss, mae = results$mean_absolute_error, plot_outp = plot_outp,base_model=base_model,history = history))
    
}

In the example in the book they use a for loop to set up the different folds and then apply the training and validation inside these loops. Since R has a lot of this already solved in tidy packages I am just going to leverage them and stay inside of a tidy format so we can keep all the intermediate results and visualize the outputs all in one go

Construct tidy cross-validation

To do this we first setup a nice tibble using the modelr package:

tidy_cross_val <- 
    train_data %>% 
    as_tibble() %>% 
    bind_cols(target = train_targets) %>% 
    modelr::crossv_kfold(k = 5) %>% 
    rename(fold = .id) %>% 
    select(fold,everything())

tidy_cross_val

## # A tibble: 5 x 3
##   fold  train          test          
##   <chr> <list>         <list>        
## 1 1     <S3: resample> <S3: resample>
## 2 2     <S3: resample> <S3: resample>
## 3 3     <S3: resample> <S3: resample>
## 4 4     <S3: resample> <S3: resample>
## 5 5     <S3: resample> <S3: resample>

Now we can build the models and also compile them, after that we just pluck out the stuff we stored in our function:

# tidy_cross_val$model_output %>% map_dbl(pluck,"mae")
tidy_cross_val <- 
tidy_cross_val %>% 
mutate(model_output = pmap(list(train = train,test = test), define_compile_train_predict,epochs = 200)) %>% 
mutate(mae = model_output %>% map_dbl(pluck,"mae")) %>% 
mutate(plot_outp = model_output %>% map(pluck,"plot_outp")) %>% 
mutate(loss = model_output %>% map_dbl(pluck,"loss")) %>% 
mutate(predictions = model_output %>% map(pluck,"predictions")) %>% 
select(everything(),model_output)

tidy_cross_val

## # A tibble: 5 x 8
##   fold  train     test     model_output   mae plot_outp  loss predictions 
##   <chr> <list>    <list>   <list>       <dbl> <list>    <dbl> <list>      
## 1 1     <S3: res~ <S3: re~ <list [6]>    2.41 <S3: gg>  10.5  <dbl [323 x~
## 2 2     <S3: res~ <S3: re~ <list [6]>    2.17 <S3: gg>  12.6  <dbl [323 x~
## 3 3     <S3: res~ <S3: re~ <list [6]>    2.24 <S3: gg>   8.56 <dbl [323 x~
## 4 4     <S3: res~ <S3: re~ <list [6]>    2.74 <S3: gg>  13.9  <dbl [323 x~
## 5 5     <S3: res~ <S3: re~ <list [6]>    2.55 <S3: gg>  10.9  <dbl [324 x~

By mapping the deeplearning model over all these folds of data we store our results directly into this table. That means we can do interesting things like plot them altogether:

tidy_cross_val %>% 
  ggplot(aes(x=fold,y=mae, fill = fold))+
  geom_bar(stat="identity")+
  ggtitle("The k-fold accuracy of the deep neural net",subtitle = "Not yet scaled back...")

This gives us a good idea of our certainty in the model.

From our tidy cross-validation output we may notice some extreme values for folds in terms of high mae or loss… These should be treated appropriately

Let’s take a closer look at the max mae fold:

tidy_cross_val %>% 
filter(.$mae == max(.$mae)) %>% 
pull(plot_outp)

## [[1]]

We can look at some extreme folds to see if the validation and training loss diverges which should indicate over-fitting. In this case the model seems to fit great

Measuring over-fit using k-folds crossvalidation

We can visualize this over-fitting more generally by looking in the history object we saved from the keras::fit function. By taking the mae at each epoch for all the folds we can get the average mae over the folds at each epoch and visualize how long we should train before we start making some spurious predictions:

all_mae_histories <- 
tidy_cross_val$model_output %>% map_df(pluck,"history","metrics","val_mean_absolute_error") %>% t()

average_mae_history <- data.frame(
epoch = seq(1:ncol(all_mae_histories)),
validation_mae = apply(all_mae_histories, 2, mean)
)

average_mae_history %>% 
ggplot(aes(x = epoch, y = validation_mae, color = epoch))+
    geom_line()

average_mae_history %>% 
ggplot(aes(x = epoch, y = validation_mae, color = epoch))+
    geom_smooth()

## `geom_smooth()` using method = 'loess'

The benefit of this is that you are plotting the mean behaviour of the training at each epoch over all folds

Get results

Once we are happy with our model we can pull out all the results like this:

tidy_cross_val %>% 
select(train,predictions) %>% 
mutate(train = train %>% map(as_tibble)) %>% 
unnest(.drop = TRUE) %>% 
head()

## # A tibble: 6 x 15
##   predictions       V1     V2     V3     V4     V5     V6     V7     V8
##         <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1        16.7 -0.272   -0.483 -0.435 -0.257 -0.165 -0.176  0.812  0.117
## 2        54.0  0.125   -0.483  1.03  -0.257  0.628 -1.83   1.11  -1.19 
## 3        21.4 -0.401   -0.483 -0.868 -0.257 -0.361 -0.324 -1.24   1.11 
## 4        18.6 -0.00563 -0.483  1.03  -0.257  1.33   0.153  0.694 -0.578
## 5        18.8 -0.375   -0.483 -0.547 -0.257 -0.549 -0.788  0.189  0.483
## 6        10.8  0.589   -0.483  1.03  -0.257  1.22  -1.03   1.11  -1.06 
## # ... with 6 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>,
## #   V13 <dbl>, target <dbl>

But we have only predicted on the training data in our function and we used the model trained only on that fold… Instead it would make a lot more sense to train the model on all the data now that we have tested and tuned the model (possibly censor bad folds?)

Using all the data

Run model:

train <- 
    train_data %>% 
    as_tibble() %>% 
    bind_cols(target = train_targets) 

test <- 
    test_data %>% 
    as_tibble() %>% 
bind_cols(target = test_targets) 

results <- 
    define_compile_train_predict(train,test,epochs = 100)

print('mae:')

## [1] "mae:"

results$mae

## [1] 2.429138

print('loss:')

## [1] "loss:"

results$loss

## [1] 14.56804

results$plot_outp

Get predictions:

output_data <-
bind_cols(predictions = results$predictions,train["target"],train)

output_data %>% head

## # A tibble: 6 x 16
##   predictions target       V1     V2     V3     V4     V5     V6     V7
##         <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1        17.3   15.2 -0.272   -0.483 -0.435 -0.257 -0.165 -0.176  0.812
## 2        43.4   42.3 -0.403    2.99  -1.33  -0.257 -1.21   1.89  -1.91 
## 3        51.4   50.0  0.125   -0.483  1.03  -0.257  0.628 -1.83   1.11 
## 4        22.3   21.1 -0.401   -0.483 -0.868 -0.257 -0.361 -0.324 -1.24 
## 5        18.4   17.7 -0.00563 -0.483  1.03  -0.257  1.33   0.153  0.694
## 6        20.8   18.5 -0.375   -0.483 -0.547 -0.257 -0.549 -0.788  0.189
## # ... with 7 more variables: V8 <dbl>, V9 <dbl>, V10 <dbl>, V11 <dbl>,
## #   V12 <dbl>, V13 <dbl>, target1 <dbl>

Let’s visualize the accuracy of this model:

output_data %>% 
  arrange(target) %>% 
  mutate(perfect = seq(5,50,length.out = nrow(.))) %>% 
  mutate(estimate = ifelse(target > predictions,"underestimated","overestimated")) %>% 
  select(predictions,target,estimate,perfect) %>% 
  ggplot()+
  geom_point(aes(x = predictions, y = target, color = estimate))+
  geom_line(aes(x=perfect, y = perfect))+
  ggtitle("Model accuracy")

What can we say about the bias of the model:

output_data %>% 
  arrange(target) %>% 
  mutate(perfect = seq(5,50,length.out = nrow(.))) %>% 
  mutate(estimate = ifelse(target > predictions,"underestimated","overestimated")) %>% 
  select(predictions,target,estimate,perfect) %>% 
  group_by(estimate) %>% 
  tally %>% 
  ggplot()+
  geom_bar(aes(x= estimate, y=n, fill = estimate),stat="identity")+
  ggtitle("Low model error but upward bias")

Benchmark vs Gradient boosting machines

train <- 
    train_data %>% 
    as_tibble() %>% 
    bind_cols(target = train_targets %>% as.numeric() )

test <- 
    test_data %>% 
    as_tibble() %>% 
bind_cols(target = test_targets %>% as.numeric() ) 

tr_contrl <- caret::trainControl(method = 'cv', number = 5)

results <- 
caret::train(form = target~., data = train,method = 'gbm',trControl = tr_contrl)

results %>% summary

##     var     rel.inf
## V13 V13 45.00698622
## V6   V6 29.45790750
## V8   V8  6.97044377
## V1   V1  5.42153107
## V5   V5  4.70147030
## V7   V7  2.23913360
## V11 V11  1.66965671
## V12 V12  1.46723050
## V10 V10  1.28689723
## V4   V4  0.87476817
## V9   V9  0.42687149
## V3   V3  0.37745176
## V2   V2  0.09965169

results

## Stochastic Gradient Boosting 
## 
## 404 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 323, 323, 323, 323, 324 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      4.397289  0.7634246  2.947976
##   1                  100      3.964773  0.8032464  2.700262
##   1                  150      3.862273  0.8106288  2.670015
##   2                   50      3.868630  0.8136770  2.621053
##   2                  100      3.638292  0.8332715  2.537585
##   2                  150      3.562024  0.8394754  2.497162
##   3                   50      3.611696  0.8389383  2.463967
##   3                  100      3.350380  0.8589171  2.345241
##   3                  150      3.214416  0.8688836  2.286692
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

So the gradient boosted machine using k-fold can get around 2.4-2.9 MAE and the GBM model is historically the best tabular learning paradigm in the average kaggle contest the last couple of years (around 2016). Our model learned the same ammount of accuracy also using cross-validation.

Let’s visualize the accuracy of this model:

output_data <- 
  train %>% 
  as_tibble() %>% 
  mutate(predictions = results %>% predict(newData=train %>% select(-target))) %>% 
  select(target,predictions)

output_data %>% 
  arrange(target) %>% 
  mutate(perfect = seq(5,50,length.out = nrow(.))) %>% 
  mutate(estimate = ifelse(target > predictions,"underestimated","overestimated")) %>% 
  select(predictions,target,estimate,perfect) %>% 
  ggplot()+
  geom_point(aes(x = predictions, y = target, color = estimate))+
  geom_line(aes(x=perfect, y = perfect))+
  ggtitle("Model accuracy GBM")

What can we say about the bias of the model:

output_data %>% 
  arrange(target) %>% 
  mutate(perfect = seq(5,50,length.out = nrow(.))) %>% 
  mutate(estimate = ifelse(target > predictions,"underestimated","overestimated")) %>% 
  select(predictions,target,estimate,perfect) %>% 
  group_by(estimate) %>% 
  tally %>% 
  ggplot()+
  geom_bar(aes(x= estimate, y=n, fill = estimate),stat="identity")+
  ggtitle("GBM has low model error and less bias")

Time series models using LSTM together with an inference network

Besides a dense feedforward network we can also predict time series movements of house prices using a LSTM network. This will allow us to predict the average move in house prices over the next couple of months

Read in the data

All residential home sales in Ames, Iowa between 2006 and 2010

# https://www.openintro.org/stat/data/ames.csv

if(!file.exists("../../static/data/cencus_data.csv")){
    download.file(url = 'https://www.openintro.org/stat/data/ames.csv',destfile = '../../static/data/cencus_data.csv')
}

So we can tell that in this data we have primarily just a time index and a value. This is enough for us but we need to pre-process and label these 2 fields in a tidy dataframe before we start ironing out the model process.

Read in the data

housing_time_series <- 
    data.table::fread(input = "../../static/data/cencus_data.csv")

housing_time_series %>% head

##    Order       PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## 1:     1 526301100          20        RL          141    31770   Pave
## 2:     2 526350040          20        RH           80    11622   Pave
## 3:     3 526351010          20        RL           81    14267   Pave
## 4:     4 526353030          20        RL           93    11160   Pave
## 5:     5 527105010          60        RL           74    13830   Pave
## 6:     6 527105030          60        RL           78     9978   Pave
##    Alley Lot.Shape Land.Contour Utilities Lot.Config Land.Slope
## 1:    NA       IR1          Lvl    AllPub     Corner        Gtl
## 2:    NA       Reg          Lvl    AllPub     Inside        Gtl
## 3:    NA       IR1          Lvl    AllPub     Corner        Gtl
## 4:    NA       Reg          Lvl    AllPub     Corner        Gtl
## 5:    NA       IR1          Lvl    AllPub     Inside        Gtl
## 6:    NA       IR1          Lvl    AllPub     Inside        Gtl
##    Neighborhood Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual
## 1:        NAmes        Norm        Norm      1Fam      1Story            6
## 2:        NAmes       Feedr        Norm      1Fam      1Story            5
## 3:        NAmes        Norm        Norm      1Fam      1Story            6
## 4:        NAmes        Norm        Norm      1Fam      1Story            7
## 5:      Gilbert        Norm        Norm      1Fam      2Story            5
## 6:      Gilbert        Norm        Norm      1Fam      2Story            6
##    Overall.Cond Year.Built Year.Remod.Add Roof.Style Roof.Matl
## 1:            5       1960           1960        Hip   CompShg
## 2:            6       1961           1961      Gable   CompShg
## 3:            6       1958           1958        Hip   CompShg
## 4:            5       1968           1968        Hip   CompShg
## 5:            5       1997           1998      Gable   CompShg
## 6:            6       1998           1998      Gable   CompShg
##    Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area Exter.Qual
## 1:      BrkFace      Plywood        Stone          112         TA
## 2:      VinylSd      VinylSd         None            0         TA
## 3:      Wd Sdng      Wd Sdng      BrkFace          108         TA
## 4:      BrkFace      BrkFace         None            0         Gd
## 5:      VinylSd      VinylSd         None            0         TA
## 6:      VinylSd      VinylSd      BrkFace           20         TA
##    Exter.Cond Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 1:         TA     CBlock        TA        Gd            Gd            BLQ
## 2:         TA     CBlock        TA        TA            No            Rec
## 3:         TA     CBlock        TA        TA            No            ALQ
## 4:         TA     CBlock        TA        TA            No            ALQ
## 5:         TA      PConc        Gd        TA            No            GLQ
## 6:         TA      PConc        TA        TA            No            GLQ
##    BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF
## 1:          639            Unf            0         441          1080
## 2:          468            LwQ          144         270           882
## 3:          923            Unf            0         406          1329
## 4:         1065            Unf            0        1045          2110
## 5:          791            Unf            0         137           928
## 6:          602            Unf            0         324           926
##    Heating Heating.QC Central.Air Electrical X1st.Flr.SF X2nd.Flr.SF
## 1:    GasA         Fa           Y      SBrkr        1656           0
## 2:    GasA         TA           Y      SBrkr         896           0
## 3:    GasA         TA           Y      SBrkr        1329           0
## 4:    GasA         Ex           Y      SBrkr        2110           0
## 5:    GasA         Gd           Y      SBrkr         928         701
## 6:    GasA         Ex           Y      SBrkr         926         678
##    Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath
## 1:               0        1656              1              0         1
## 2:               0         896              0              0         1
## 3:               0        1329              0              0         1
## 4:               0        2110              1              0         2
## 5:               0        1629              0              0         2
## 6:               0        1604              0              0         2
##    Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd
## 1:         0             3             1           TA             7
## 2:         0             2             1           TA             5
## 3:         1             3             1           Gd             6
## 4:         1             3             1           Ex             8
## 5:         1             3             1           TA             6
## 6:         1             3             1           Gd             7
##    Functional Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt
## 1:        Typ          2           Gd      Attchd          1960
## 2:        Typ          0           NA      Attchd          1961
## 3:        Typ          0           NA      Attchd          1958
## 4:        Typ          2           TA      Attchd          1968
## 5:        Typ          1           TA      Attchd          1997
## 6:        Typ          1           Gd      Attchd          1998
##    Garage.Finish Garage.Cars Garage.Area Garage.Qual Garage.Cond
## 1:           Fin           2         528          TA          TA
## 2:           Unf           1         730          TA          TA
## 3:           Unf           1         312          TA          TA
## 4:           Fin           2         522          TA          TA
## 5:           Fin           2         482          TA          TA
## 6:           Fin           2         470          TA          TA
##    Paved.Drive Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
## 1:           P          210            62              0           0
## 2:           Y          140             0              0           0
## 3:           Y          393            36              0           0
## 4:           Y            0             0              0           0
## 5:           Y          212            34              0           0
## 6:           Y          360            36              0           0
##    Screen.Porch Pool.Area Pool.QC Fence Misc.Feature Misc.Val Mo.Sold
## 1:            0         0      NA    NA           NA        0       5
## 2:          120         0      NA MnPrv           NA        0       6
## 3:            0         0      NA    NA         Gar2    12500       6
## 4:            0         0      NA    NA           NA        0       4
## 5:            0         0      NA MnPrv           NA        0       3
## 6:            0         0      NA    NA           NA        0       6
##    Yr.Sold Sale.Type Sale.Condition SalePrice
## 1:    2010       WD          Normal    215000
## 2:    2010       WD          Normal    105000
## 3:    2010       WD          Normal    172000
## 4:    2010       WD          Normal    244000
## 5:    2010       WD          Normal    189900
## 6:    2010       WD          Normal    195500

Process data

We need to:
- Deal with NA’s
- Create vectorized tensors
- Scale tensors into small values
- Make sure all tensors have homogeneous scale
- Split data into time series and inference sets

We need to create a date formatted feature and we need to parse our categorical features to vectorized tensors. The reason we do this is because we want to fit 2 different models at the same time. One model will predict the price of a property based on the autocorrelations in the data, much like an arima or ets model would (we can use LSTM networks for this), and the other model will predict the price of the home given the 1D tensor data describing the property - street, size, zoning etc.

Let’s count NA’s

NA_info <- 
housing_time_series %>% 
  as_tibble() %>% 
map_df(~.x %>% is.na %>% sum) %>% 
gather(key = "feature", value = "count_NA") %>% 
arrange(-count_NA)

So out of the 82 columns in the data the colums that have more than 50 NA’s are

remove_features_str <- 
NA_info %>% 
filter(count_NA>50) %>% 
pull(feature)

remove_features_str

##  [1] "Pool.QC"        "Misc.Feature"   "Alley"          "Fence"         
##  [5] "Fireplace.Qu"   "Lot.Frontage"   "Garage.Yr.Blt"  "Garage.Qual"   
##  [9] "Garage.Cond"    "Garage.Type"    "Garage.Finish"  "Bsmt.Qual"     
## [13] "Bsmt.Cond"      "Bsmt.Exposure"  "BsmtFin.Type.1" "BsmtFin.Type.2"

remove_features_str %>% length()

## [1] 16

For the sake of this tutorial we will just throw these away but it may be useful to impute these using nearest neighbors or mean imputations. You can leverage something like the mice package.

Remove the columns and remove leftover NA observations

housing_time_series_cleaned <- 
housing_time_series %>% 
select(-which(names(.) %in% remove_features_str)) %>% 
na.omit() %>% 
select(SalePrice,Yr.Sold,Mo.Sold,everything()) %>% 
select(-PID,-Order
# ,-Year.Built,-Year.Remod.Add
)
    
housing_time_series_cleaned %>% dim()

## [1] 2904   64

So what we are left with is a table with 66 features and 2904 samples…

Split the data for 2 models

We want to train a model to predict house prices using the descriptive features and also train an lstm model to predict future price movements.

First we need to split test train sets:

split_data <- 
caret::createDataPartition(housing_time_series_cleaned$SalePrice, p = 0.8, list = FALSE)

train <- 
housing_time_series_cleaned[split_data,]

test <- 
housing_time_series_cleaned[-split_data,]

It’s important that our time series does not have missing months since we will lag time vectors later on… If there were missing months the auto correlation of say n month lags will get mixed up with n+1 month lags. Let’s check this:

interaction(test$Yr.Sold,test$Mo.Sold) %>% unique()

##  [1] 2010.5  2010.2  2010.1  2010.6  2010.4  2010.3  2010.7  2009.10
##  [9] 2009.6  2009.5  2009.12 2009.7  2009.8  2009.3  2009.2  2009.9 
## [17] 2009.11 2009.4  2009.1  2008.5  2008.8  2008.10 2008.7  2008.6 
## [25] 2008.3  2008.11 2008.9  2008.2  2008.1  2008.4  2008.12 2007.5 
## [33] 2007.8  2007.3  2007.7  2007.6  2007.10 2007.1  2007.9  2007.4 
## [41] 2007.11 2007.12 2007.2  2006.8  2006.12 2006.9  2006.7  2006.1 
## [49] 2006.3  2006.2  2006.6  2006.4  2006.11 2006.5  2006.10
## 60 Levels: 2006.1 2007.1 2008.1 2009.1 2010.1 2006.2 2007.2 ... 2010.12

interaction(train$Yr.Sold,train$Mo.Sold) %>% unique()

##  [1] 2010.5  2010.6  2010.4  2010.3  2010.1  2010.2  2010.7  2009.7 
##  [9] 2009.8  2009.6  2009.10 2009.11 2009.9  2009.2  2009.12 2009.5 
## [17] 2009.4  2009.3  2009.1  2008.5  2008.6  2008.4  2008.10 2008.7 
## [25] 2008.11 2008.8  2008.9  2008.3  2008.12 2008.1  2008.2  2007.3 
## [33] 2007.4  2007.2  2007.10 2007.7  2007.8  2007.6  2007.5  2007.9 
## [41] 2007.11 2007.1  2007.12 2006.5  2006.6  2006.10 2006.3  2006.11
## [49] 2006.2  2006.7  2006.1  2006.9  2006.8  2006.12 2006.4 
## 60 Levels: 2006.1 2007.1 2008.1 2009.1 2010.1 2006.2 2007.2 ... 2010.12

Since both sets have all 60 months we don’t have any problems here…

For the LSTM model we need only the time index and the value:

For the time series model we are going to train on the moving average monthly house price so the complete model can use this in predictions

LSTM_train <- 
train %>% 
select(SalePrice,Yr.Sold,Mo.Sold) %>% 
tidyr::unite(col = index,Yr.Sold,Mo.Sold,sep="-") %>% 
mutate(index = index %>% zoo::as.yearmon() %>% as_date()) %>% 
rename(value = SalePrice) %>% 
group_by(index) %>% 
summarise(value = mean(value,na.rm = TRUE)) %>% 
as_tbl_time(index = index) 

LSTM_test <- 
test %>% 
select(SalePrice,Yr.Sold,Mo.Sold) %>% 
tidyr::unite(col = index,Yr.Sold,Mo.Sold,sep="-") %>% 
mutate(index = index %>% zoo::as.yearmon() %>% as_date()) %>% 
rename(value = SalePrice) %>% 
group_by(index) %>% 
summarise(value = mean(value,na.rm = TRUE)) %>% 
as_tbl_time(index = index) 

LSTM_train %>% head

## # A time tibble: 6 x 2
## # Index: index
##   index        value
##   <date>       <dbl>
## 1 2006-01-01 212711.
## 2 2006-02-01 182493.
## 3 2006-03-01 180162.
## 4 2006-04-01 170443.
## 5 2006-05-01 170079.
## 6 2006-06-01 178162.

Function that one_hot encodes classes for prediction

one_hot_categorical_inputs_fn <- function(tibble_df) {
    
one_hot_list <- 
tibble_df %>% 
map_if(is.character,as_factor) %>% 
map_if(is.factor,keras::to_categorical)

bind_binary_classes <- 
one_hot_list %>% 
purrr::keep(is.matrix) %>% 
reduce(cbind)

bind_numeric_features <- 
one_hot_list %>% 
purrr::discard(is.matrix) %>% 
reduce(cbind) %>% 
as.matrix()
    
return(cbind(bind_numeric_features,bind_binary_classes))
    
}

Before we cast everything to numeric we should scale these numeric variables since the new sparse matrices will have 1’s and 0’s:

Function that will scale all the numeric variables

scale_numerics_fn <- function(df_in, mean = NULL, std = NULL,...) {
    
numeric_cols <- 
df_in %>% map_lgl(is.numeric)

scaled_numeric <- 
df_in %>% 
select_if(numeric_cols)

if(is.null(mean) | is.null(std)) {
mean <- apply(scaled_numeric, 2, mean)
std <- apply(scaled_numeric, 2, sd)
}

recombined <- 
cbind(
scale(scaled_numeric, center = mean, scale = std),
df_in %>% select_if(!numeric_cols)
)

return(list(recombined = recombined,mean = mean, std = std))

}

Scale numerics… (Remember that the test set must be scaled using the training scales otherwise you wouldn’t be predicting using the same data scale as trained)

Here we will start out with the original cleaned data again because we want to spread the same number of classes in the train and test sets

data_scaled <- 
    housing_time_series_cleaned %>% 
    select(-Yr.Sold,-Mo.Sold,-SalePrice)

data_scaled_outp <- 
data_scaled %>% 
scale_numerics_fn

data_scaled <- 
    data_scaled_outp$recombined

inference_X <- 
data_scaled %>% 
one_hot_categorical_inputs_fn()

inference_target <- 
housing_time_series_cleaned %>% 
select(SalePrice) %>% 
as.matrix()

inference_X %>% head

##              out                                                         
## [1,] -0.87715441  2.74965990 -0.06154026 -0.5089121 -0.3685560 -1.1560238
## [2,] -0.87715441  0.18952801 -0.77071847  0.3889498 -0.3354663 -1.1080859
## [3,] -0.87715441  0.52561839 -0.06154026  0.3889498 -0.4347354 -1.2518997
## [4,] -0.87715441  0.13082338  0.64763795 -0.5089121 -0.1038383 -0.7725204
## [5,]  0.05903575  0.47009041 -0.77071847 -0.5089121  0.8557633  0.6656175
## [6,]  0.05903575 -0.01936899 -0.06154026  0.3889498  0.8888530  0.6656175
##                                                                         
## [1,]  0.05579978 0.43144278 -0.2947952 -0.2670621  0.06612207  1.2816766
## [2,] -0.56928686 0.05583144  0.5533971 -0.6565320 -0.38356992 -0.6750520
## [3,]  0.03347525 1.05526512 -0.2947952 -0.3467782  0.63164381  0.4397683
## [4,] -0.56928686 1.36717629 -0.2947952  1.1086092  2.40542888  2.4505644
## [5,] -0.56928686 0.76531953 -0.2947952 -0.9594530 -0.27909603 -0.5926635
## [6,] -0.45766424 0.35017015 -0.2947952 -0.5335415 -0.28363837 -0.5978128
##                                                                      
## [1,] -0.7820247 -0.1014432  0.3137316  1.0827555 -0.2503599 -1.019769
## [2,] -0.7820247 -0.1014432 -1.1925373 -0.8217399 -0.2503599 -1.019769
## [3,] -0.7820247 -0.1014432 -0.3343604 -0.8217399 -0.2503599 -1.019769
## [4,] -0.7820247 -0.1014432  1.2135291  1.0827555 -0.2503599  0.793709
## [5,]  0.8554028 -0.1014432  0.2602194 -0.8217399 -0.2503599  0.793709
## [6,]  0.8016783 -0.1014432  0.2106711 -0.8217399 -0.2503599  0.793709
##                                                                       
## [1,] -0.7518758  0.1759575 -0.2073247  0.3546094  2.1638512  0.3093533
## [2,] -0.7518758 -1.0320345 -0.2073247 -0.9167310 -0.9240229 -1.0040415
## [3,]  1.2421400  0.1759575 -0.2073247 -0.2810608 -0.9240229 -1.0040415
## [4,]  1.2421400  0.1759575 -0.2073247  0.9902796  2.1638512  0.3093533
## [5,]  1.2421400  0.1759575 -0.2073247 -0.2810608  0.6199141  0.3093533
## [6,]  1.2421400  0.1759575 -0.2073247  0.3546094  0.6199141  0.3093533
##                                                                       
## [1,]  0.2588252  0.9160874  0.2215652 -0.3587571 -0.1035821 -0.2866842
## [2,]  1.1972919  0.3632432 -0.7032610 -0.3587571 -0.1035821  1.8440946
## [3,] -0.7446839  2.3613800 -0.1662651 -0.3587571 -0.1035821 -0.2866842
## [4,]  0.2309499 -0.7424451 -0.7032610 -0.3587571 -0.1035821 -0.2866842
## [5,]  0.0451149  0.9318829 -0.1960982 -0.3587571 -0.1035821 -0.2866842
## [6,] -0.0106356  2.1007535 -0.1662651 -0.3587571 -0.1035821 -0.2866842
##                                                                           
## [1,] -0.06330281 -0.08980944 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1
## [2,] -0.06330281 -0.08980944 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1
## [3,] -0.06330281 21.88417774 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1
## [4,] -0.06330281 -0.08980944 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1
## [5,] -0.06330281 -0.08980944 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1
## [6,] -0.06330281 -0.08980944 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1
##                                                                           
## [1,] 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##                                                                           
## [1,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
## [2,] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
## [3,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
## [4,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
## [5,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [6,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
##                                                                           
## [1,] 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##                                                                           
## [1,] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
## [2,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
## [3,] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
## [4,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0
## [5,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
## [6,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
##                                                                           
## [1,] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
## [2,] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
## [3,] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
## [4,] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1
## [6,] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0
##                                                                           
## [1,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [2,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [3,] 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [4,] 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [5,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [6,] 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

inference_target %>% head

##      SalePrice
## [1,]    215000
## [2,]    105000
## [3,]    172000
## [4,]    244000
## [5,]    189900
## [6,]    195500

inference_train_X <- 
inference_X[split_data,]

inference_test_X <- 
inference_X[-split_data,]
    
inference_train_target <- 
inference_target[split_data,]
    
inference_test_target <- 
inference_target[-split_data,]

So we have 265 dimesnions on the tensor axis

So at this point we have nicely scaled and homogeneous inputs in a nice vectorized format for our tensors to learn with. The only data we have not vectorized is the time series data we will use in the LSTM model

Design inference model

Now that our data is processed we can try to build the skeleton of our inference model that we will train together with the LSTM model

We can use the same structure we used in the naive model:

define_compile_train_predict <- function(train_X,test_X,train_y,test_y,epochs = 100,shuffle = TRUE,...){
    
#define
base_model <- 
keras_model_sequential() %>% 
layer_dense(units = 128, activation = 'relu', input_shape = dim(train_X)[2]) %>% # 1 axis and nr features = number columns
layer_batch_normalization() %>%
# layer_dropout(rate = 0.2) %>%
layer_dense(units = 64, activation ='relu') %>% 
layer_dense(units = 32, activation ='relu') %>% 
# layer_dense(units = 25, activation ='relu') %>%
layer_dense(units = 1) 
    
#compile
base_model %>% compile(
  loss='mse',
  optimizer='rmsprop',
    # optimizer='adam',
  metrics = c('mae')
)

#train
history <- base_model %>% 
  keras::fit(train_X,
             train_y,
             epochs=epochs,
             shuffle=shuffle,
             validation_data= list(test_X, test_y),
                verbose = 0
             )  

#predict
predictions <- 
    base_model %>% 
    keras::predict_on_batch(x = train_X)

results <- 
base_model %>% 
keras::evaluate(test_X, test_y, verbose = 0)

plot_outp <- 
    history %>% plot

return(list(predictions = predictions, loss = results$loss, mae = results$mean_absolute_error, plot_outp = plot_outp,base_model=base_model,history = history))
    
}

Let’s see out of box performance of our inference network:

model_outp <- 
define_compile_train_predict(train_X = inference_train_X, train_y = inference_train_target,test_X = inference_test_X, test_y = inference_test_target)

model_outp$plot_outp

model_outp$history$metrics$val_mean_absolute_error %>% last()

## [1] 14664.32

This looks promising… If we calculate the MAE / average price: 8.1628129 %

This is the expected % error before we optimize the model and combine the LSTM network. I have not yet tested and tuned the network here so with some revision to the network we could potentially improve this by a lot.

Design LSTM model

Let’s start by visualizing the time series data…

Now remember we made the training time tibble:

LSTM_train %>% head

## # A time tibble: 6 x 2
## # Index: index
##   index        value
##   <date>       <dbl>
## 1 2006-01-01 212711.
## 2 2006-02-01 182493.
## 3 2006-03-01 180162.
## 4 2006-04-01 170443.
## 5 2006-05-01 170079.
## 6 2006-06-01 178162.

Let’s visualize the average price

LSTM_train %>%
    # filter_time("2008" ~ "end") %>%
    ggplot(aes(index, value)) +
    geom_line(color = palette_light()[[1]], alpha = 0.5) +
    geom_point(color = palette_light()[[1]]) +
    geom_smooth(method = "loess", span = 0.2, se = FALSE) +
    theme_tq()+
labs(title = 'All residential home sales in Ames, Iowa between 2006 and 2010')

So this timeseries actually looks surprisingly stationary with expected volatility around the 2009 market crash. We will have to see if a LSTM network can make some sense out of this.

We define the handy tidy_acf function

tidy_acf <- function(data, value, lags = 0:20) {
    
    value_expr <- enquo(value)
    
    acf_values <- data %>%
        pull(value) %>%
        acf(lag.max = tail(lags, 1), plot = FALSE) %>%
        .$acf %>%
        .[,,1]
    
    ret <- tibble(acf = acf_values) %>%
        rowid_to_column(var = "lag") %>%
        mutate(lag = lag - 1) %>%
        filter(lag %in% lags)
    
    return(ret)
}

tidy_pacf <- function(data, value, lags = 0:20) {
    
    value_expr <- enquo(value)
    
    pacf_values <- data %>%
        pull(value) %>%
        pacf(lag.max = tail(lags, 1), plot = FALSE) %>%
        .$acf %>%
        .[,,1]
    
    ret <- tibble(pacf = pacf_values) %>%
        rowid_to_column(var = "lag") %>%
        mutate(lag = lag - 1) %>%
        filter(lag %in% lags)
    
    return(ret)
}

Let’s view the significant lags in the ACF and PACF

max_lag <- 100 # these are months, since our samples are sales in a month

LSTM_train %>%
    tidy_acf(value, lags = 0:max_lag)

## # A tibble: 55 x 2
##      lag       acf
##    <dbl>     <dbl>
##  1    0.  1.00    
##  2    1.  0.205   
##  3    2.  0.148   
##  4    3. -0.000328
##  5    4. -0.0990  
##  6    5.  0.0441  
##  7    6. -0.0976  
##  8    7.  0.0221  
##  9    8. -0.0316  
## 10    9. -0.124   
## # ... with 45 more rows

LSTM_train %>%
    tidy_pacf(value, lags = 0:max_lag)

## # A tibble: 54 x 2
##      lag    pacf
##    <dbl>   <dbl>
##  1    0.  0.205 
##  2    1.  0.111 
##  3    2. -0.0529
##  4    3. -0.112 
##  5    4.  0.0965
##  6    5. -0.101 
##  7    6.  0.0382
##  8    7. -0.0279
##  9    8. -0.119 
## 10    9.  0.0262
## # ... with 44 more rows

Visualize it:

LSTM_train %>%
    tidy_acf(value, lags = 0:max_lag) %>%
    ggplot(aes(lag, acf)) +
    geom_segment(aes(xend = lag, yend = 0), color = palette_light()[[1]]) +
    geom_vline(xintercept = 2, size = 1, color = palette_light()[[2]],alpha = 0.3) +
    # geom_vline(xintercept = 25, size = 2, color = palette_light()[[2]]) +
    # annotate("text", label = "10 Year Mark", x = 130, y = 0.8, 
             # color = palette_light()[[2]], size = 6, hjust = 0) +
    theme_tq() +
    labs(title = "ACF: Ames, Iowa home sales",subtitle = 'Mostly white noise, possible 2 lag')

LSTM_train %>%
    tidy_pacf(value, lags = 0:max_lag) %>%
    ggplot(aes(lag, pacf)) +
    geom_segment(aes(xend = lag, yend = 0), color = palette_light()[[1]]) +
    # geom_vline(xintercept = 3, size = 1, color = palette_light()[[2]],alpha = 0.3) +
    # annotate("text", label = "10 Year Mark", x = 130, y = 0.8, 
             # color = palette_light()[[2]], size = 6, hjust = 0) +
    theme_tq() +
    labs(title = "PACF: Ames, Iowa home sales",subtitle = 'Mostly white noise, possible 2 lag')

To remain completely objective here we can consult the forecast package:

forecast::auto.arima(LSTM_train$value,max.q = 0)

## Series: LSTM_train$value 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1        mean
##       0.2717  179891.444
## s.e.  0.1504    3282.577
## 
## sigma^2 estimated as 3.3e+08:  log likelihood=-616.46
## AIC=1238.92   AICc=1239.39   BIC=1244.94

For the sake of exploration we can try the ar 2 framework

LSTM_train %>%
    tidy_acf(value, lags = 0:15) %>%
    ggplot(aes(lag, acf)) +
    geom_vline(xintercept = 2, size = 3, color = palette_light()[[2]],alpha=0.3) +
    geom_segment(aes(xend = lag, yend = 0), color = palette_light()[[1]]) +
    geom_point(color = palette_light()[[1]], size = 2) +
    geom_label(aes(label = acf %>% round(2)), vjust = -1,
              color = palette_light()[[1]]) +
    annotate("text", label = "2 Month Mark", x = 3, y = 0.8, 
             color = palette_light()[[2]], size = 5, hjust = 0) +
    theme_tq() +
    labs(title = "ACF: Iowa",
         subtitle = "Zoomed in on Lags 1 to 15")

optimal_lag_setting <- 2

For us to validate the model we will use a method called backtesting. Backtesting is basically splitting the data into different splits of data and for each split leaving out a future horizon that you can predict and measure accuracy. By testing all the different folds we can feel more certain of model performance since it wasn’t just a one time lucky score.

# periods_train <- 12 * 50
# periods_test  <- 12 * 10
# skip_span     <- 12 * 20

periods_train <- 36
periods_test  <- 3
skip_span     <- 6

rolling_origin_resamples <- rolling_origin(
    LSTM_train,
    initial    = periods_train,
    assess     = periods_test,
    cumulative = FALSE,
    skip       = skip_span
)

rolling_origin_resamples

## # Rolling origin forecast resampling 
## # A tibble: 3 x 2
##   splits       id    
##   <list>       <chr> 
## 1 <S3: rsplit> Slice1
## 2 <S3: rsplit> Slice2
## 3 <S3: rsplit> Slice3

A function that visualizes the different times series splits:

# Plotting function for a single split
plot_split <- function(split, expand_y_axis = TRUE, alpha = 1, size = 1, base_size = 14) {
    
    # Manipulate data
    train_tbl <- training(split) %>%
        add_column(key = "training") 
    
    test_tbl  <- testing(split) %>%
        add_column(key = "testing") 
    
    data_manipulated <- bind_rows(train_tbl, test_tbl) %>%
        as_tbl_time(index = index) %>%
        mutate(key = fct_relevel(key, "training", "testing"))
        
    # Collect attributes
    train_time_summary <- train_tbl %>%
        tk_index() %>%
        tk_get_timeseries_summary()
    
    test_time_summary <- test_tbl %>%
        tk_index() %>%
        tk_get_timeseries_summary()
    
    # Visualize
    g <- data_manipulated %>%
        ggplot(aes(x = index, y = value, color = key)) +
        geom_line(size = size, alpha = alpha) +
        theme_tq(base_size = base_size) +
        scale_color_tq() +
        labs(
            title    = glue("Split: {split$id}"),
            subtitle = glue("{train_time_summary$start} to {test_time_summary$end}"),
            y = "", x = ""
        ) +
        theme(legend.position = "none") 
    
    if (expand_y_axis) {
        
        sun_spots_time_summary <- sun_spots %>% 
            tk_index() %>% 
            tk_get_timeseries_summary()
        
        g <- g +
            scale_x_date(limits = c(sun_spots_time_summary$start, 
                                    sun_spots_time_summary$end))
    }
    
    return(g)
}

Plotting function that we can map over all these splits

# Plotting function that scales to all splits 
plot_sampling_plan <- function(sampling_tbl, expand_y_axis = TRUE, 
                               ncol = 3, alpha = 1, size = 1, base_size = 14, 
                               title = "Sampling Plan") {
    
    # Map plot_split() to sampling_tbl
    sampling_tbl_with_plots <- sampling_tbl %>%
        mutate(gg_plots = map(splits, plot_split, 
                              expand_y_axis = expand_y_axis,
                              alpha = alpha, base_size = base_size))
    
    # Make plots with cowplot
    plot_list <- sampling_tbl_with_plots$gg_plots 
    
    p_temp <- plot_list[[1]] + theme(legend.position = "bottom")
    legend <- get_legend(p_temp)
    
    p_body  <- plot_grid(plotlist = plot_list, ncol = ncol)
    
    p_title <- ggdraw() + 
        draw_label(title, size = 18, fontface = "bold", colour = palette_light()[[1]])
    
    g <- plot_grid(p_title, p_body, legend, ncol = 1, rel_heights = c(0.05, 1, 0.05))
    
    return(g)
    
}

Let’s visualize our backtesting folds:

rolling_origin_resamples %>%
    plot_sampling_plan(expand_y_axis = FALSE, ncol = 3, alpha = 1, size = 1, base_size = 10, 
                       title = "Backtesting Strategy: Rolling Origin Sampling Plan")

Test a LSTM model

OK, so now that we have a backtesting strategy set we should try to fit a model on one of the folds:

split    <- rolling_origin_resamples$splits[[1]]
split_id <- rolling_origin_resamples$id[[1]]

plot_split(split, expand_y_axis = FALSE, size = 0.5) +
    theme(legend.position = "bottom") +
    ggtitle(glue("Split: {split_id}"))

Split test/train

df_trn <- training(split)
df_tst <- testing(split)

df <- bind_rows(
    df_trn %>% add_column(key = "training"),
    df_tst %>% add_column(key = "testing")
) %>% 
    as_tbl_time(index = index)

df

## # A time tibble: 39 x 3
## # Index: index
##    index        value key     
##    <date>       <dbl> <chr>   
##  1 2006-01-01 212711. training
##  2 2006-02-01 182493. training
##  3 2006-03-01 180162. training
##  4 2006-04-01 170443. training
##  5 2006-05-01 170079. training
##  6 2006-06-01 178162. training
##  7 2006-07-01 169873. training
##  8 2006-08-01 210880. training
##  9 2006-09-01 211548. training
## 10 2006-10-01 171406. training
## # ... with 29 more rows

Scale and center the data

We can process this data using the recipes package

rec_obj <- recipe(value ~ ., df) %>%
    step_sqrt(value) %>%
    step_center(value) %>%
    step_scale(value) %>%
    prep()

df_processed_tbl <- bake(rec_obj, df)

df_processed_tbl

## # A tibble: 39 x 3
##    index        value key     
##    <date>       <dbl> <fct>   
##  1 2006-01-01  1.62   training
##  2 2006-02-01 -0.0515 training
##  3 2006-03-01 -0.186  training
##  4 2006-04-01 -0.757  training
##  5 2006-05-01 -0.778  training
##  6 2006-06-01 -0.302  training
##  7 2006-07-01 -0.791  training
##  8 2006-08-01  1.52   training
##  9 2006-09-01  1.56   training
## 10 2006-10-01 -0.699  training
## # ... with 29 more rows

We will need to save the scaling factors so that we can restore the scale at the end

center_history <- rec_obj$steps[[2]]$means["value"]
scale_history  <- rec_obj$steps[[3]]$sds["value"]

c("center" = center_history, "scale" = scale_history)

## center.value  scale.value 
##    428.24106     20.34574

Let’s define our model inputs:

# Model inputs
lag_setting  <- 3 # = nrow(df_tst)
batch_size   <- 3
train_length <- 24 # nrow(df_trn)
tsteps       <- 1
epochs       <- 300

Set up training and test sets

# Training Set
lag_train_tbl <- df_processed_tbl %>%
    mutate(value_lag = lag(value, n = lag_setting)) %>%
    filter(!is.na(value_lag)) %>%
    filter(key == "training") %>%
    tail(train_length)

x_train_vec <- lag_train_tbl$value_lag
x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))

y_train_vec <- lag_train_tbl$value
y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))

# Testing Set
lag_test_tbl <- df_processed_tbl %>%
    mutate(
        value_lag = lag(value, n = lag_setting)
    ) %>%
    filter(!is.na(value_lag)) %>%
    filter(key == "testing")

x_test_vec <- lag_test_tbl$value_lag
x_test_arr <- array(data = x_test_vec, dim = c(length(x_test_vec), 1, 1))

y_test_vec <- lag_test_tbl$value
y_test_arr <- array(data = y_test_vec, dim = c(length(y_test_vec), 1))

Test a LSTM network to get a feel for the model:

model <- keras_model_sequential()

model %>%
    layer_lstm(units            = 10, 
               input_shape      = c(tsteps, 1), 
               batch_size       = batch_size,
               return_sequences = TRUE, 
               stateful         = TRUE) %>% 
    layer_lstm(units            = 10, 
               return_sequences = FALSE, 
               stateful         = TRUE) %>% 
    layer_dense(units = 1)

model %>% 
    compile(loss = 'mae', optimizer = 'adam')

model

## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## lstm_1 (LSTM)                    (3, 1, 10)                    480         
## ___________________________________________________________________________
## lstm_2 (LSTM)                    (3, 10)                       840         
## ___________________________________________________________________________
## dense_23 (Dense)                 (3, 1)                        11          
## ===========================================================================
## Total params: 1,331
## Trainable params: 1,331
## Non-trainable params: 0
## ___________________________________________________________________________

for (i in 1:epochs) {
    model %>% keras::fit(x   = x_train_arr, 
                  y          = y_train_arr, 
                  batch_size = batch_size,
                  epochs     = 1, 
                  verbose    = 0, 
                  shuffle    = FALSE)
    
    model %>% reset_states()
    # cat("Epoch: ", i, '\n')
    
}

Now let’s predict the next 3 months so we can test the performance of the model

pred_out <- model %>% 
    predict(x_test_arr, batch_size = batch_size) %>%
    .[,1] 

# Retransform values
pred_tbl <- tibble(
    index   = lag_test_tbl$index,
    value   = (pred_out * scale_history + center_history)^2
) 

# Combine actual data with predictions
tbl_1 <- df_trn %>%
    add_column(key = "actual")

tbl_2 <- df_tst %>%
    add_column(key = "actual")

tbl_3 <- pred_tbl %>%
    add_column(key = "predict")

# Create time_bind_rows() to solve dplyr issue
time_bind_rows <- function(data_1, data_2, index) {
    index_expr <- enquo(index)
    bind_rows(data_1, data_2) %>%
        as_tbl_time(index = !! index_expr)
}

ret <- list(tbl_1, tbl_2, tbl_3) %>%
    reduce(time_bind_rows, index = index) %>%
    arrange(key, index) %>%
    mutate(key = as_factor(key))

ret

## # A time tibble: 42 x 3
## # Index: index
##    index        value key   
##    <date>       <dbl> <fct> 
##  1 2006-01-01 212711. actual
##  2 2006-02-01 182493. actual
##  3 2006-03-01 180162. actual
##  4 2006-04-01 170443. actual
##  5 2006-05-01 170079. actual
##  6 2006-06-01 178162. actual
##  7 2006-07-01 169873. actual
##  8 2006-08-01 210880. actual
##  9 2006-09-01 211548. actual
## 10 2006-10-01 171406. actual
## # ... with 32 more rows

Function to calculate rmse for us

calc_rmse <- function(prediction_tbl) {
    
    rmse_calculation <- function(data) {
        data %>%
            spread(key = key, value = value) %>%
            select(-index) %>%
            filter(!is.na(predict)) %>%
            rename(
                truth    = actual,
                estimate = predict
            ) %>%
            rmse(truth, estimate)
    }
    
    safe_rmse <- possibly(rmse_calculation, otherwise = NA)
    
    safe_rmse(prediction_tbl)
        
}

which is:

calc_rmse(ret)

## [1] 32301.01

Function that plots the errors…

# Setup single plot function
plot_prediction <- function(data, id, alpha = 1, size = 2, base_size = 14) {
    
    rmse_val <- calc_rmse(data)
    
    g <- data %>%
        ggplot(aes(index, value, color = key)) +
        geom_point(alpha = alpha, size = size) + 
        # geom_line(alpha = alpha, size = size)+
        theme_tq(base_size = base_size) +
        scale_color_tq() +
        theme(legend.position = "none") +
        labs(
            title = glue("{id}, RMSE: {round(rmse_val, digits = 1)}"),
            x = "", y = ""
        )
    
    return(g)
}

Let’s see the performance:

ret %>% 
    plot_prediction(id = split_id, alpha = 0.65) +
    theme(legend.position = "bottom")

Awesome the LSTM model is ready to start making us some predictions!

Now we need to back test the model on all the time slices now since we now have a method we can map over all of them simultaneously

Everything set… time to get started!

Back test LSTM model

Let’s create a function that will compile, fit and predict the LSTM network.

Basically this function will do everything we just did for testing the single fold so we can map it accross all the folds

predict_keras_lstm <- function(split, epochs = 300, lag_setting, batch_size, train_length,  tsteps, optimizer = 'adam',...) {
    
    lstm_prediction <- function(split, epochs, lag_setting, batch_size, train_length,  tsteps, optimizer, ...) {
        
        # 5.1.2 Data Setup
        df_trn <- training(split)
        df_tst <- testing(split)
        
        df <- bind_rows(
            df_trn %>% add_column(key = "training"),
            df_tst %>% add_column(key = "testing")
        ) %>% 
            as_tbl_time(index = index)
        
        # 5.1.3 Preprocessing
        rec_obj <- recipe(value ~ ., df) %>%
            step_sqrt(value) %>%
            step_center(value) %>%
            step_scale(value) %>%
            prep()
        
        df_processed_tbl <- bake(rec_obj, df)
        
        center_history <- rec_obj$steps[[2]]$means["value"]
        scale_history  <- rec_obj$steps[[3]]$sds["value"]
        
        # # 5.1.4 LSTM Plan
        # lag_setting  <- 120 # = nrow(df_tst)
        # batch_size   <- 40
        # train_length <- 440
        # tsteps       <- 1
        # epochs       <- epochs
        
        # 5.1.5 Train/Test Setup
        lag_train_tbl <- df_processed_tbl %>%
            mutate(value_lag = lag(value, n = lag_setting)) %>%
            filter(!is.na(value_lag)) %>%
            filter(key == "training") %>%
            tail(train_length)
        
        x_train_vec <- lag_train_tbl$value_lag
        x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))
        
        y_train_vec <- lag_train_tbl$value
        y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))
        
        lag_test_tbl <- df_processed_tbl %>%
            mutate(
                value_lag = lag(value, n = lag_setting)
            ) %>%
            filter(!is.na(value_lag)) %>%
            filter(key == "testing")
        
        x_test_vec <- lag_test_tbl$value_lag
        x_test_arr <- array(data = x_test_vec, dim = c(length(x_test_vec), 1, 1))
        
        y_test_vec <- lag_test_tbl$value
        y_test_arr <- array(data = y_test_vec, dim = c(length(y_test_vec), 1))
                
        # 5.1.6 LSTM Model
        model <- keras_model_sequential()

        model %>%
            layer_lstm(units            = 10, 
                       input_shape      = c(tsteps, 1), 
                       batch_size       = batch_size,
                       return_sequences = TRUE, 
                       stateful         = TRUE) %>% 
            layer_lstm(units            = 10, 
                       return_sequences = FALSE, 
                       stateful         = TRUE) %>% 
            layer_dense(units = 1)
        
        model %>% 
            compile(loss = 'mae', optimizer = optimizer)
        
        # 5.1.7 Fitting LSTM
        for (i in 1:epochs) {
            model %>% fit(x          = x_train_arr, 
                          y          = y_train_arr, 
                          batch_size = batch_size,
                          epochs     = 1, 
                          verbose    = 0, 
                          shuffle    = FALSE)
            
            model %>% reset_states()
            # cat("Epoch: ", i, '\n')
            
        }
        
        # 5.1.8 Predict and Return Tidy Data
        # Make Predictions
        pred_out <- model %>% 
            predict(x_test_arr, batch_size = batch_size) %>%
            .[,1] 
        
        # Retransform values
        pred_tbl <- tibble(
            index   = lag_test_tbl$index,
            value   = (pred_out * scale_history + center_history)^2
        ) 
        
        # Combine actual data with predictions
        tbl_1 <- df_trn %>%
            add_column(key = "actual")
        
        tbl_2 <- df_tst %>%
            add_column(key = "actual")
        
        tbl_3 <- pred_tbl %>%
            add_column(key = "predict")
        
        # Create time_bind_rows() to solve dplyr issue
        time_bind_rows <- function(data_1, data_2, index) {
            index_expr <- enquo(index)
            bind_rows(data_1, data_2) %>%
                as_tbl_time(index = !! index_expr)
        }
        
        ret <- list(tbl_1, tbl_2, tbl_3) %>%
            reduce(time_bind_rows, index = index) %>%
            arrange(key, index) %>%
            mutate(key = as_factor(key))

        return(ret)
        
    }
    
    safe_lstm <- possibly(lstm_prediction, otherwise = NA)
    
    safe_lstm(split, epochs, lag_setting, batch_size, train_length,  tsteps, optimizer, ...)
    
}

Set our test parameters (remember that the training samples must be divisable by the batch size)

lag_setting  <- 3 # = nrow(df_tst)
batch_size   <- 3
train_length <- 24 # nrow(df_trn)
tsteps       <- 1
epochs       <- 300

Now we map the LSTM model over all our times series folds to back test:

sample_predictions_lstm_tbl <- 
rolling_origin_resamples %>%
    mutate(predict = map(
        splits,
        predict_keras_lstm,
        epochs = 300,
        lag_setting = lag_setting,
        batch_size = batch_size,
        train_length = train_length ,
        tsteps = tsteps ,
        optimizer = 'adam'
        ))

sample_predictions_lstm_tbl %>% head

The same way we created a funtion to plot the test and training sets we can make one to plot the actuals and predicted values:

plot_predictions <- function(sampling_tbl, predictions_col, 
                             ncol = 3, alpha = 1, size = 2, base_size = 14,
                             title = "Backtested Predictions") {
    
    predictions_col_expr <- enquo(predictions_col)
    
    # Map plot_split() to sampling_tbl
    sampling_tbl_with_plots <- sampling_tbl %>%
        mutate(gg_plots = map2(!! predictions_col_expr, id, 
                               .f        = plot_prediction, 
                               alpha     = alpha, 
                               size      = size, 
                               base_size = base_size)) 
    
    # Make plots with cowplot
    plot_list <- sampling_tbl_with_plots$gg_plots 
    
    p_temp <- plot_list[[1]] + theme(legend.position = "bottom")
    legend <- get_legend(p_temp)
    
    p_body  <- plot_grid(plotlist = plot_list, ncol = ncol)
    
    
    
    p_title <- ggdraw() + 
        draw_label(title, size = 18, fontface = "bold", colour = palette_light()[[1]])
    
    g <- plot_grid(p_title, p_body, legend, ncol = 1, rel_heights = c(0.05, 1, 0.05))
    
    return(g)
    
}

Let’s see how our LSTM model did:

sample_predictions_lstm_tbl %>%
    plot_predictions(predictions_col = predict, alpha = 0.5, size = 1, base_size = 10,
                     title = "Keras Stateful LSTM: Backtested Predictions")

Seems to be working! Even though we had small correlations

We can reuse our rmse calculator to see our accuracy over each time slice:

sample_rmse_tbl <- sample_predictions_lstm_tbl %>%
    mutate(rmse = map_dbl(predict, calc_rmse)) %>%
    select(id, rmse)

sample_rmse_tbl

## # Rolling origin forecast resampling 
## # A tibble: 3 x 2
##   id       rmse
## * <chr>   <dbl>
## 1 Slice1 37492.
## 2 Slice2 20870.
## 3 Slice3  8719.

We can plot the RMSE over the various folds:

sample_rmse_tbl %>%
    ggplot(aes(rmse)) +
    geom_histogram(aes(y = ..density..), fill = palette_light()[[1]], bins = 5) +
    geom_density(fill = palette_light()[[1]], alpha = 0.5) +
    theme_tq() +
    ggtitle("Histogram of RMSE")

And the distribution is:

sample_rmse_tbl %>%
    summarize(
        mean_rmse = mean(rmse),
        sd_rmse   = sd(rmse)
    )

## # Rolling origin forecast resampling 
## # A tibble: 1 x 2
##   mean_rmse sd_rmse
##       <dbl>   <dbl>
## 1    22360.  14444.

Test final model on full dataset

Now we create our function that will test the LSTM model on all the data:

predict_keras_lstm_future <- function(data, epochs = 300,lag_setting, batch_size, train_length,  tsteps, optimizer = 'adam',...) {
    
    lstm_prediction <- function(data, epochs,lag_setting, batch_size, train_length,  tsteps, optimizer, ...) {
        
        # 5.1.2 Data Setup (MODIFIED)
        df <- data
        
        # 5.1.3 Preprocessing
        rec_obj <- recipe(value ~ ., df) %>%
            step_sqrt(value) %>%
            step_center(value) %>%
            step_scale(value) %>%
            prep()
        
        df_processed_tbl <- bake(rec_obj, df)
        
        center_history <- rec_obj$steps[[2]]$means["value"]
        scale_history  <- rec_obj$steps[[3]]$sds["value"]
        
        # 5.1.5 Train Setup (MODIFIED)
        lag_train_tbl <- df_processed_tbl %>%
            mutate(value_lag = lag(value, n = lag_setting)) %>%
            filter(!is.na(value_lag)) %>%
            tail(train_length)
        
        x_train_vec <- lag_train_tbl$value_lag
        x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))
        
        y_train_vec <- lag_train_tbl$value
        y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))
        
        x_test_vec <- y_train_vec %>% tail(lag_setting)
        x_test_arr <- array(data = x_test_vec, dim = c(length(x_test_vec), 1, 1))
                
        # 5.1.6 LSTM Model
        model <- keras_model_sequential()

        model %>%
            layer_lstm(units            = 10, 
                       input_shape      = c(tsteps, 1), 
                       batch_size       = batch_size,
                       return_sequences = TRUE, 
                       stateful         = TRUE) %>% 
            layer_lstm(units            = 10, 
                       return_sequences = FALSE, 
                       stateful         = TRUE) %>% 
            layer_dense(units = 1)
        
        model %>% 
            compile(loss = 'mae', optimizer = optimizer)
        
        # 5.1.7 Fitting LSTM
        for (i in 1:epochs) {
            model %>% fit(x          = x_train_arr, 
                          y          = y_train_arr, 
                          batch_size = batch_size,
                          epochs     = 1, 
                          verbose    = 0, 
                          shuffle    = FALSE)
            
            model %>% reset_states()
            # cat("Epoch: ", i, '\n')
            
        }
        
        # 5.1.8 Predict and Return Tidy Data (MODIFIED)
        # Make Predictions
        pred_out <- model %>% 
            predict(x_test_arr, batch_size = batch_size) %>%
            .[,1] 
        
        # Make future index using tk_make_future_timeseries()
        idx <- data %>%
            tk_index() %>%
            tk_make_future_timeseries(n_future = lag_setting)
        
        # Retransform values
        pred_tbl <- tibble(
            index   = idx,
            value   = (pred_out * scale_history + center_history)^2
        )
        
        # Combine actual data with predictions
        tbl_1 <- df %>%
            add_column(key = "actual")

        tbl_3 <- pred_tbl %>%
            add_column(key = "predict")

        # Create time_bind_rows() to solve dplyr issue
        time_bind_rows <- function(data_1, data_2, index) {
            index_expr <- enquo(index)
            bind_rows(data_1, data_2) %>%
                as_tbl_time(index = !! index_expr)
        }

        ret <- list(tbl_1, tbl_3) %>%
            reduce(time_bind_rows, index = index) %>%
            arrange(key, index) %>%
            mutate(key = as_factor(key))

        return(ret)
        
    }
    
    safe_lstm <- possibly(lstm_prediction, otherwise = NA)
    
    safe_lstm(data, epochs, lag_setting, batch_size, train_length,  tsteps, optimizer, ...)
    
}

future_LSTM_tbl <- 
predict_keras_lstm_future(LSTM_train,
        epochs = 300,
        lag_setting = lag_setting,
        batch_size = batch_size,
        train_length = train_length ,
        tsteps = tsteps ,
        optimizer = 'adam'
)

future_LSTM_tbl %>%
    # filter_time("2008" ~ "end") %>%
    plot_prediction(id = NULL, alpha = 0.4, size = 1.5) +
    theme(legend.position = "bottom") +
    ggtitle("House prices: 3 Month Forecast using LSTM deep neural net", subtitle = "Using the full dataset")

Combine LSTM and Inference networks into 1 deep neural network??

I was tempted to combine these 2 networks together so that I could backpropegate them simultaneously. This would allow us to use the inference network using descriptive features together with the macro economic trends in the market of housing prices.

To stack different deep neural networks together we can use the keras API like this:

Define complete model

# data.frame(factor = 1:100,res = 2324/1:100) %>% filter(res == trunc(res))
# batch_size <- 28
batch_size <- 83

inference_input <- 
layer_input( name = "inference_input",batch_shape = c(batch_size,265))

encoded_inference_network <- 
inference_input %>%
layer_dense(units = 128, activation = 'relu') %>% # 1 axis and nr features = number columns
layer_batch_normalization() %>%
# layer_dropout(rate = 0.2) %>%
layer_dense(units = 64, activation ='relu') %>% 
layer_dense(units = 32, activation ='relu') 
# layer_dense(units = 25, activation ='relu') %>%

LSTM_input <- 
layer_input(name = "LSTM_input",batch_shape = c(batch_size,1,1))

encoded_LSTM_network <- 
LSTM_input %>%
            layer_lstm(units            = 100, 
                       batch_size       = batch_size,
                       return_sequences = TRUE, 
                       stateful         = TRUE) %>% 
            layer_lstm(units            = 100, 
                       return_sequences = FALSE, 
                       stateful         = TRUE) 

concatenated <- 
layer_concatenate(list(encoded_LSTM_network, encoded_inference_network))

house_price <- concatenated %>%
layer_dense(units = 5, activation = "relu") %>% 
layer_dense(units = 1)

model <- keras_model(list(LSTM_input, inference_input), house_price)
model %>% compile(
optimizer = "adam",
loss='mse',
metrics = c('mae')
)

Create our LSTM input:

I realised very quickly that we have a small problem here… To backpropogate both networks simultaneously they would need to feed the same batch of inputs into both networks! But our LSTM model was trained on monthly data - this means I had a 1 to many relationship between the 2 models inputs.

I tried to naively circumvent this by infering the lag of prices on a per house level. We could do this by simply using the average shift and applying it as a predicted lag index:

prepare_LSTM_input <- function(monthly_ts,sample_ts,lag_setting,train_length) {

# 5.1.5 Train Setup (MODIFIED)
lag_train_tbl <- monthly_ts %>%
    mutate(value_lag_month = lag(value, n = lag_setting)) %>%
    filter(!is.na(value_lag_month)) %>%
    tail(train_length) %>% 
rename(value_month = value)

join_df <- 
sample_ts %>% 
left_join(lag_train_tbl, by = 'index') %>% 
mutate(value_lag = value * value_lag_month/value_month) %>% 
select(index,value,value_lag)

rec_obj <- recipe(index ~ ., join_df) %>%
    step_sqrt(all_predictors()) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    prep()

df_processed_tbl <- bake(rec_obj, join_df)

center_history <- rec_obj$steps[[2]]$means["value"]
scale_history  <- rec_obj$steps[[3]]$sds["value"]

x_train_vec <- df_processed_tbl$value_lag
x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))

y_train_vec <- df_processed_tbl$value
y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))

return(list(LSTM_input = x_train_arr, LSTM_target = y_train_arr, center_history = center_history, scale_history = scale_history))
}

return_original_scale <- function(pred_out, scale_history,center_history) {
return((pred_out * scale_history + center_history)^2)
}

sample_ts <- train %>%
  select(SalePrice,Yr.Sold,Mo.Sold) %>%
  tidyr::unite(col = index,Yr.Sold,Mo.Sold,sep="-") %>%
  mutate(index = index %>% zoo::as.yearmon() %>% as_date()) %>%
  rename(value = SalePrice) %>% as_tbl_time(index = index)
    
LSTM_input_full_model <- 
prepare_LSTM_input(monthly_ts = LSTM_train,sample_ts = sample_ts,lag_setting = 3,train_length = 24)

So let’s try to train the 2 deep neural networks simulteneously:

history <- model %>% fit(
list(inference_input = inference_train_X,
LSTM_input = LSTM_input_full_model$LSTM_input),
LSTM_input_full_model$LSTM_target,
epochs = 300, 
batch_size = batch_size,
shuffle = FALSE
,verbose = 0
)

The network can’t back propogate properly! This is because the LSTM model needs the inputs to move consistently through time. Even though our inputs are correctly lagged we have inconsistent numbers of rows for each time index which ruins the ‘memory’ part of the model…

What we can do however is use the prediction of price movement as a feature in the inference model. We can train the inference model by adding the future average price of properties by month by region truncating that data that has no future observations.

Then we train the LSTM and inference networks seperately - the LSTM then predicts that future movement for us so we can test the accuracy of our combined model feeding the LSTM into the inference network into a result

Future additions

I will build this feed forward from LSTM to inference network soon and then update this post!! Watch this space!!

StatsLab

Predict house prices - deep learning, keras

Overview

Naive model (no time index)

Load the data

Scale the variables

Define the model

Function that constructs a model

Construct tidy cross-validation

Measuring over-fit using k-folds crossvalidation

Get results

Using all the data

Benchmark vs Gradient boosting machines

Time series models using LSTM together with an inference network

Read in the data

Read in the data

Process data

Split the data for 2 models

Design inference model

Design LSTM model

Test a LSTM model

Split test/train

Scale and center the data

Everything set… time to get started!

Back test LSTM model

Test final model on full dataset

Combine LSTM and Inference networks into 1 deep neural network??

Define complete model

Future additions

Comments