TimesFM: How Google’s Pre-trained Model Can Revolutionize Time-Series Forecasting
“The future is not something that happens to us, but something we create.” — Vivek
Time-series forecasting is the art and science of predicting the future based on the past. It is a vital skill for many domains, such as finance, economics, health, weather, and more. Imagine if you could forecast the stock market, the GDP, the COVID-19 cases, or the rainfall with high accuracy and confidence. You could make better decisions, optimize your resources, and plan ahead for any scenario.
However, time-series forecasting is not an easy task. It involves dealing with complex and dynamic patterns, such as trends, seasonality, cycles, outliers, and noise. It also requires a lot of data and domain knowledge to train and fine-tune models for specific tasks and datasets. Traditional methods, such as statistical and machine learning models, often struggle to capture the long-term dependencies and non-linear relationships in time-series data. Deep learning methods, such as recurrent and convolutional neural networks, can offer better performance, but they are computationally expensive and data-hungry.
What if there was a way to overcome these challenges and achieve accurate and zero-shot forecasts on any time-series dataset, without requiring any additional training or domain knowledge? Sounds too good to be true, right? Well, that’s exactly what Google’s researchers have claimed to achieve with their new model, called TimesFM.
TimesFM is a decoder-only foundation model that is pre-trained on a large and diverse time-series corpus, and can generate forecasts for unseen datasets with variable prediction lengths. It is based on the idea of self-attention, which allows the model to learn the relationships between different time-points in the data, and positional encoding, which allows the model to capture the temporal information in the data. TimesFM is also scalable, generalizable, and interpretable, making it a powerful tool for time-series forecasting.
In this blog, we will explore the architecture, pre-training, evaluation, and results of TimesFM, and discuss how it can revolutionize the field of time-series forecasting. We will also show you how you can use TimesFM for your own forecasting tasks, and how you can benefit from its features and capabilities.
TimesFM Architecture
TimesFM is a patched-decoder style attention model, inspired by the Vision Transformer (ViT) and the Generative Pre-trained Transformer (GPT). It consists of three main components: the input encoder, the decoder, and the output decoder.
The input encoder is responsible for mapping the time-series data into tokens, which are the basic units of representation for the model. The input encoder first splits the time-series data into patches of equal length, and then applies a linear projection to each patch to obtain a token. The input encoder also adds a special token, called the start-of-sequence (SOS) token, to the beginning of the token sequence, and a special token, called the end-of-sequence (EOS) token, to the end of the token sequence. The SOS token signals the model to start generating the output tokens, and the EOS token signals the model to stop generating the output tokens.
The decoder is the core component of the model, where the self-attention and positional encoding mechanisms are applied. The decoder consists of several layers, each of which contains a multi-head self-attention module and a feed-forward network. The self-attention module allows the model to learn the dependencies and relationships between different tokens in the sequence, both in the input and the output. The feed-forward network allows the model to learn non-linear transformations of the tokens. The decoder also uses layer normalization and residual connections to improve the stability and efficiency of the model.
The positional encoding is a technique that injects the temporal information into the token sequence, since the self-attention module does not have any inherent notion of order or position. The positional encoding is added to the token embeddings before they are fed into the decoder. The positional encoding can be either learned or fixed, depending on the choice of the model. In TimesFM, the positional encoding is learned, which means that the model can adapt to different temporal granularities and frequencies in the data.
The output decoder is responsible for mapping the output tokens into the final forecasts. The output decoder applies a linear projection to each output token to obtain a scalar value, which represents the predicted value for the corresponding time-point. The output decoder also uses a softmax function to normalize the output values and ensure that they are within a reasonable range.
One of the key features of TimesFM is that it can generate output tokens with variable lengths, depending on the desired prediction horizon. This means that the model can forecast for any number of future time-points, without requiring any re-training or fine-tuning. This is achieved by using a special token, called the prediction-length (PL) token, which is appended to the input token sequence. The PL token indicates the number of output tokens that the model should generate, and it is encoded as a one-hot vector. For example, if the PL token is [0, 0, 1, 0, 0], it means that the model should generate three output tokens, corresponding to three future time-points.
TimesFM Pre-training and Datasets
TimesFM is pre-trained on a massive time-series corpus of 100 billion real-world time-points, derived from Wikipedia and Google search trends, as well as synthetic data. The pre-training objective is to maximize the likelihood of the output tokens given the input tokens, using the masked language modeling (MLM) technique. The MLM technique randomly masks some of the input tokens, and asks the model to predict the masked tokens based on the rest of the input tokens. This way, the model can learn to capture the patterns and dependencies in the time-series data, and generalize to unseen datasets.
The datasets used for pre-training are diverse and heterogeneous, covering various domains, temporal granularities, and noise levels. Some of the domains include finance, economics, health, weather, sports, entertainment, and more. Some of the temporal granularities include hourly, daily, weekly, monthly, and yearly. Some of the noise levels include low, medium, and high. The synthetic data is generated by using different combinations of sinusoidal, linear, and random functions, with different parameters and noise levels. The synthetic data is used to augment the real-world data and increase the diversity and complexity of the pre-training corpus.
The pre-training process is done using the Google Cloud TPU v3–256, which consists of 256 TPU cores, with a total of 2 TB of memory. The pre-training process takes about 10 days, and results in a model with 1.2 billion parameters. The pre-trained model is then fine-tuned on specific downstream tasks and datasets, using a smaller learning rate and fewer epochs.
TimesFM Evaluation and Results
TimesFM is evaluated on several public benchmarks for time-series forecasting, such as Monash, Darts, and Informer. These benchmarks contain datasets from various domains, such as electricity, traffic, exchange rate, solar energy, etc. The evaluation metrics include root mean squared error (RMSE), mean absolute error (MAE), mean fundamental percentage error (MAPE), and symmetric mean absolute percentage error (SMAPE). The evaluation is done using both supervised and unsupervised settings, where the supervised setting means that the model is fine-tuned on the target dataset, and the unsupervised setting means that the model is directly applied to the target dataset without any fine-tuning.
The results show that TimesFM outperforms other state-of-the-art methods, both supervised and unsupervised, on most of the benchmarks and metrics. For example, on the Monash benchmark, which contains 43 datasets, TimesFM achieves an average RMSE of 0.67 in the supervised setting, and 0.72 in the unsupervised setting, compared to 0.74 and 0.79 for the second-best method, GPT-3. On the Darts benchmark, which contains 9 datasets, TimesFM achieves an average RMSE of 0.57 in the supervised setting, and 0.59 in the unsupervised setting, compared to 0.61 and 0.64 for the second-best method, DeepAR. On the Informer benchmark, which contains 3 datasets, TimesFM achieves an average RMSE of 0.48 in the supervised setting, and 0.51 in the unsupervised setting, compared to 0.52 and 0.55 for the second-best method, Informer.
The results also show that TimesFM has several advantages and limitations, such as:
- Scalability: TimesFM can handle large and high-dimensional time-series data, thanks to its self-attention mechanism and its patch-based input encoder. It can also generate forecasts for any prediction horizon, thanks to its variable-length output decoder and its PL token. However, TimesFM also requires a lot of computational resources and memory to train and run, especially for long and complex time-series data.
- Generalization (continued): It can also adapt to different temporal granularities and frequencies, thanks to its learned positional encoding and its PL token. However, TimesFM may also suffer from some domain shift and distribution mismatch, especially for datasets that are very different from the pre-training data or have very low signal-to-noise ratio.
- Interpretability: TimesFM can provide some insights and explanations for its forecasts, thanks to its self-attention mechanism and its output decoder. It can also generate attention maps and saliency maps to visualize the importance and relevance of different input and output tokens. However, TimesFM is still a black-box model, and its internal logic and reasoning may not be fully transparent or understandable to human users.
Conclusion and Future Work
In this blog, we have introduced TimesFM, Google’s pre-trained model for time-series forecasting. We have explained how TimesFM uses a decoder-only foundation model, pre-trained on a large and diverse time-series corpus, to generate accurate and zero-shot forecasts on unseen datasets. We have also shown how TimesFM outperforms other state-of-the-art methods, both supervised and unsupervised, on several public benchmarks. We have also discussed some of the advantages and limitations of TimesFM, such as its scalability, generalization, and interpretability.
TimesFM is a revolutionary model that can enable users to focus on refining forecasts for their specific downstream tasks, without requiring additional training or domain knowledge. It can also provide users with insights and explanations for its forecasts, using its self-attention mechanism and its output decoder. TimesFM can be applied to various domains and scenarios, such as retail demand planning, energy management, health monitoring, and more.
However, TimesFM is not a perfect model, and it still has some room for improvement and extension. Some of the open challenges and directions for future research include:
- Incorporating domain knowledge, external variables, and user feedback into TimesFM, to improve its accuracy and robustness.
- Exploring different architectures, pre-training objectives, and datasets for TimesFM, to enhance its performance and diversity.
- Developing more effective and efficient methods for fine-tuning and inference of TimesFM, to reduce its computational cost and memory usage.
- Improving the interpretability and transparency of TimesFM, to increase its trustworthiness and usability.
We hope that this blog has given you a comprehensive overview of TimesFM, and inspired you to try it out for your own forecasting tasks.
Thank you for reading this blog, and we hope you enjoyed it. If you have any questions or feedback, please feel free to leave a comment below. We would love to hear from you. Connect with me on — http://www.linkedin.com/in/vivek-kumar-upadhyay-90ba11281