Deep Learning-based architectures have had a significant impact on computer vision, natural language processing, and other machine learning areas. However, the scenario is not so straightforward when it comes to Forecasting, an area where statistical and traditional machine learning models have generally outperformed other types of models. In recent years, Transformer architectures (e.g., Google’s Temporal Fusion Transformer) have sometimes outperformed conventional models, particularly in long-term forecasting (12-week horizons, and beyond.)
In April 2023, Google researchers proposed a Multi-Layer Perceptron (MLP)-based architecture called TiDE (Time-Series Dense Encoder) for long-term Time-Series forecasting. Their empirical data showed that TiDE not only matched (and in some cases, outperformed) complex Deep Learning architectures, but was also 5x-10x faster than Transformer models. TiDE adopters, including the author of this paper, generally agree with these assessments. It is also interesting to note that TiDE often outperforms TSMixer, another MLP-based architecture developed by Google around the same time.
Summarizing the TiDe Architecture
TiDE consists of an MLP-based Encoder-Decoder architecture. The encoder consumes historical time-series data (along with associated covariate information) to create condensed, hidden representations. The decoder then uses these hidden representations to generate the forecasts. The Residual Block – a simple MLP comprising a hidden layer with ReLU activation, a dropout layer, and final layer normalization – drives both the encoder and the decoder.
See this figure from the original paper.
Source: Long-term Forecasting with TiDE: Time-series Dense Encoder, April 2023
To explain further, the encoder drives two main operations:
- Feature Projection through dimensionality reduction of the dynamic covariates.
- Dense Encoding in which the embedding is created based on the feature projection outputs, the static covariates, and the historical time-series data.
The decoder then takes the embedding, and generates the forecasts through two main operations:
- Dense Decoding by mapping the embedding into a vector per time-step.
- Temporal Decoding by combining the Dense Decoding output with the projected features of that time step to generate the final predictions.
Why does TiDE outperform complex models?
Many researchers, including the author of this paper, have observed better long-term forecasting with TiDE vis-à-vis more complex models. For instance, auto-regressive models (e.g., Amazon’s DeepAR) sometimes have the tendency to show unreasonably high forecasts for longer horizons. Ordinarily, TiDE does not appear to have such limitations.
Ablation studies highlighted in the original paper identify the residual connections, and the temporal decoder as the key differentiators. The residual connections enable a smoother gradient flow, thus reducing the risk of exploding or vanishing gradients that are sometimes observed in certain types of models. This not only contributes to faster (and more stable) training cycles but also reduces the risk of over-fitting.
The temporal decoding process serves to optimize the initial forecasts (produced as a result of the dense decoding process) for adaptation to future covariates, thereby efficiently provisioning for the impact of external events. This helps avoid the accuracy losses often observed in other types of models when real-world disruptions take place in the longer term.
Furthermore, compute/memory consumptions are much less in TiDE, thus enabling a greater number of iterations to be executed within a reasonable training time. This offers a practical advantage over more complex models.
Closing Comments
While conventional linear models often outperform complex architectures in many forecasting use cases, their utility is often limited in forecasting over longer horizons, or when there are high non-linearities in the data. Many Deep Learning-based architectures (e.g., Transformers) may offer improved results in such cases – however, these models are generally compute-intensive, offer limited ‘out-of-the-box’ interpretability, and operationalization tasks (e.g., hyperparameter optimization) may need a high degree of AI engineering skills. At a high level, traditional neural network architectures like TiDE have the potential to act as a bridge between the two different approaches – combining the operational ease and simplicity of linear models, and the non-linear forecasting power of neural networks.