How Lag-Llama Transforms Time Series Forecasting with a Pretrained Transformer Model

37 min readFeb 9, 2024

“The future is not something that happens to us, but something we create.” — Vivek

Time series forecasting is the art and science of predicting the future based on the past. It is a crucial skill for many domains and applications, such as finance, healthcare, energy, transportation, and more. Imagine if you could forecast the stock market, the weather, the traffic, or the spread of a disease with high accuracy and reliability. How much value would that create for you and your business?

However, time series forecasting is not an easy task. It involves dealing with complex and dynamic patterns, such as trends, seasonality, cycles, outliers, and noise. It also requires handling various types of data, such as univariate or multivariate, continuous or discrete, regular or irregular, and so on. Moreover, it often faces the challenge of data scarcity, where there is not enough historical data to train a reliable model.

To address these challenges, researchers have developed various methods and models for time series forecasting, such as ARIMA, exponential smoothing, neural networks, and more. However, most of these methods have some limitations, such as:

They require a lot of domain knowledge and manual tuning to select the best model and parameters for each dataset.
They are not very flexible and scalable, as they cannot handle variable-length inputs and outputs, or adapt to different data characteristics and domains.
They are not very robust and generalizable, as they tend to overfit to the training data and fail to transfer to new and unseen data.

To overcome these limitations, a new paradigm has emerged in the field of time series forecasting: foundation models. Foundation models are large-scale, pretrained, and self-supervised models that can learn from a massive and diverse corpus of data, and then fine-tune or adapt to specific downstream tasks and domains. Foundation models have shown remarkable results in natural language processing (NLP) and computer vision (CV), such as GPT-3 and BERT for NLP, and ResNet and ViT for CV.

However, foundation models for time series forecasting are still in their infancy, and there are many open questions and challenges to be solved. For instance, how to design an effective and efficient architecture for time series data? How to leverage the temporal and causal structure of time series data? How to generate probabilistic forecasts that capture the uncertainty and variability of the future? How to pretrain a model on a large and diverse corpus of time series data? And how to evaluate and benchmark the performance and generalization of foundation models for time series forecasting?

In this blog, we will introduce you to Lag-Llama, a novel foundation model for univariate probabilistic time series forecasting, developed by researchers from Microsoft Research India and IIT Delhi. Lag-Llama is the first open-source foundation model for time series forecasting, and it has achieved state-of-the-art results on several forecasting benchmarks. We will explain how Lag-Llama works, what makes it unique and powerful, and how it can transform the field of time series forecasting. We will also show you some examples and visualizations of the forecasts generated by Lag-Llama, and how you can use it for your own forecasting problems.

Here are the main points that we will cover in this blog:

Background: We will provide some necessary background information on the concepts and methods used by Lag-Llama, such as transformers, lags, probabilistic forecasting, and foundation models. We will also review some related work and compare Lag-Llama with existing approaches for time series forecasting.
Methodology: We will describe the architecture, training, and inference of Lag-Llama in detail. We will explain how Lag-Llama uses lags as covariates, how it handles variable-length inputs and outputs, how it generates probabilistic forecasts, and how it leverages pretraining on a large and diverse corpus of time series data.
Experiments: We will present the experimental setup and results of Lag-Llama on various downstream datasets across domains. We will demonstrate the strong zero-shot and few-shot generalization capabilities of Lag-Llama, as well as its state-of-the-art performance when fine-tuned on small fractions of unseen data. We will also provide some qualitative examples and visualizations of the forecasts generated by Lag-Llama.
Discussion: We will summarize the main findings and implications of Lag-Llama, as well as its limitations and future directions. We will highlight the potential of Lag-Llama as a foundation model for time series forecasting, and how it can benefit various domains and applications that rely on accurate and reliable forecasts. We will also acknowledge the challenges and open questions that remain to be addressed by future research.

By the end of this blog, you will have a clear understanding of how Lag-Llama transforms time series forecasting with a pretrained transformer model, and how you can use it for your own forecasting problems. You will also learn how foundation models are changing the landscape of time series forecasting, and what are the opportunities and challenges ahead. So, let’s get started! 🚀

Background

Before we dive into the details of Lag-Llama, let’s first review some of the background concepts and methods that are relevant for time series forecasting, and how Lag-Llama builds upon them. In this section, we will cover the following topics:

Transformers: Transformers are a type of neural network architecture that use attention mechanisms to learn the dependencies and relationships between the input and output sequences. Transformers have been widely used for natural language processing and computer vision tasks, such as machine translation, text generation, image classification, and more. Lag-Llama uses a transformer-based architecture to model the temporal and causal structure of time series data, and to generate forecasts based on the input history.
Lags: Lags are a common feature engineering technique for time series forecasting, where the past values of the target variable are used as covariates or inputs for the forecasting model. Lags can capture the autocorrelation and seasonality patterns in time series data, and improve the forecasting accuracy and stability. Lag-Llama uses lags as covariates, and learns how to select and weight the relevant lags for each time series and forecasting horizon.
Probabilistic forecasting: Probabilistic forecasting is a type of forecasting that aims to generate not only a point estimate, but also a probability distribution over the possible future outcomes. Probabilistic forecasting can provide more information and insights about the uncertainty and variability of the future, and enable better decision making and risk management. Lag-Llama generates probabilistic forecasts, and uses a mixture density network to model the output distribution as a mixture of Gaussian components.
Foundation models: Foundation models are large-scale, pretrained, and self-supervised models that can learn from a massive and diverse corpus of data, and then fine-tune or adapt to specific downstream tasks and domains. Foundation models can leverage the commonalities and generalities across different data sources and modalities, and achieve remarkable results on various tasks with minimal supervision and domain knowledge. Lag-Llama is a foundation model for time series forecasting, and it is pretrained on a large and diverse corpus of time series data, covering various domains, frequencies, lengths, and characteristics.

We will now explain each of these topics in more detail, and how they are related to Lag-Llama.

Transformers

Transformers are a type of neural network architecture that use attention mechanisms to learn the dependencies and relationships between the input and output sequences. Transformers were first introduced by Vaswani et al. (2017) for the task of machine translation, and they have since been widely used for natural language processing and computer vision tasks, such as text generation, image classification, and more.

The main advantage of transformers over other sequence models, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), is that they can capture long-range and global dependencies without relying on sequential processing or fixed-length windows. This makes them more flexible and efficient, as they can handle variable-length inputs and outputs, and parallelize the computation across the sequence elements.

The basic building block of a transformer is the attention layer, which computes a weighted sum of the input sequence elements, where the weights are determined by the similarity or relevance of each element to the query. The attention layer can be either self-attention, where the query and the input sequence are the same, or cross-attention, where the query and the input sequence are different. The attention layer can also be either single-head or multi-head, where the latter splits the input sequence into multiple subspaces and computes the attention independently for each subspace.

A transformer typically consists of an encoder and a decoder, where the encoder processes the input sequence and generates a latent representation, and the decoder generates the output sequence based on the latent representation and the previous output elements. The encoder and the decoder are composed of multiple stacked layers, each consisting of an attention layer, a feed-forward layer, and a residual connection with layer normalization. The encoder uses self-attention, while the decoder uses both self-attention and cross-attention, where the latter attends to the encoder output.

Lag-Llama uses a transformer-based architecture to model the temporal and causal structure of time series data, and to generate forecasts based on the input history. Lag-Llama uses a unidirectional encoder-decoder architecture, where the encoder processes the input history and the decoder generates the output forecast. The encoder and the decoder have the same number of layers, and each layer has the same number of attention heads and hidden units. Lag-Llama uses masked self-attention in the decoder, where the future output elements are masked to prevent the model from peeking ahead. Lag-Llama also uses positional encoding to inject the temporal information into the input and output sequences, and layer dropout to regularize the model and prevent overfitting.

Lags

Lags are a common feature engineering technique for time series forecasting, where the past values of the target variable are used as covariates or inputs for the forecasting model. Lags can capture the autocorrelation and seasonality patterns in time series data, and improve the forecasting accuracy and stability.

For example, suppose we want to forecast the monthly sales of a product based on the past 12 months of data. We can use the sales values of the previous 12 months as lags, and feed them to the forecasting model along with the current month. The model can then learn how the sales vary over time, and how they depend on the previous months.

However, not all lags are equally important or relevant for forecasting. Some lags may have more influence or correlation with the future than others, depending on the data characteristics and the forecasting horizon. For instance, if we want to forecast the daily temperature of a city, the temperature of the previous day may be more relevant than the temperature of the previous week or month. Similarly, if we want to forecast the annual GDP of a country, the GDP of the previous year may be more relevant than the GDP of the previous quarter or month.

Therefore, it is desirable to have a method that can automatically select and weight the relevant lags for each time series and forecasting horizon, without requiring manual tuning or domain knowledge. This is what Lag-Llama does, using a novel technique called lag attention.

Lag attention is a mechanism that allows Lag-Llama to learn how to select and weight the relevant lags for each time series and forecasting horizon, based on the input history and the output forecast. Lag attention is similar to the attention mechanism used by transformers, but instead of attending to the input sequence elements, it attends to the lagged values of the target variable.

Lag attention works as follows: For each input history and output forecast, Lag-Llama computes a set of lag embeddings, which are vectors that represent the lagged values of the target variable. The lag embeddings are computed by applying a linear transformation to the target variable, followed by a positional encoding to inject the temporal information. The lag embeddings are then fed to the attention layer, where they are compared with the query, which is either the encoder output or the decoder output. The attention layer computes a set of lag weights, which are scalars that indicate the importance or relevance of each lag for the query. The lag weights are then used to compute a weighted sum of the lag embeddings, which is the output of the lag attention layer.

Lag-Llama uses lag attention in both the encoder and the decoder, where the encoder uses lag attention to encode the input history, and the decoder uses lag attention to generate the output forecast. Lag-Llama also uses multi-head lag attention, where the lag embeddings are split into multiple subspaces and the attention is computed independently for each subspace. This allows Lag-Llama to capture different aspects or features of the lags, such as trend, seasonality, cycle, and noise.

By using lag attention, Lag-Llama can effectively leverage the past values of the target variable as covariates, and learn how to select and weight the relevant lags for each time series and forecasting horizon. This can improve the forecasting accuracy and stability, especially when the data is scarce or noisy. Lag attention also makes Lag-Llama more flexible and scalable, as it can handle variable-length inputs and outputs, and adapt to different data characteristics and domains.

Probabilistic forecasting

Probabilistic forecasting is a type of forecasting that aims to generate not only a point estimate, but also a probability distribution over the possible future outcomes. Probabilistic forecasting can provide more information and insights about the uncertainty and variability of the future, and enable better decision making and risk management.

For example, suppose we want to forecast the daily demand of a product based on the past 30 days of data. A point estimate would give us a single value for each day, such as 100 units, 120 units, 90 units, and so on. However, a point estimate does not tell us how confident or uncertain we are about the forecast, or how much the demand can vary from the expected value. A probability distribution, on the other hand, would give us a range of values for each day, along with the likelihood of each value, such as 80–120 units with 95% probability, 60–140 units with 99% probability, and so on. A probability distribution can also capture the shape and characteristics of the output distribution, such as whether it is symmetric or skewed, unimodal or multimodal, normal or heavy-tailed, and so on.

Probabilistic forecasting can have several advantages over point forecasting, such as:

It can quantify and communicate the uncertainty and risk associated with the forecast, and help users to make informed decisions based on their preferences and objectives.
It can provide more accurate and reliable estimates of the expected value and the error metrics, such as the mean absolute error (MAE) or the root mean squared error (RMSE), by taking into account the variability of the output distribution.
It can enable more sophisticated and robust analysis and optimization, such as scenario planning, sensitivity analysis, robust optimization, and stochastic optimization, by using the entire output distribution rather than a single value.

However, probabilistic forecasting is also more challenging and complex than point forecasting, as it requires modeling and generating the output distribution rather than a single value. This involves choosing an appropriate distribution family, estimating the distribution parameters, and sampling from the distribution. Moreover, probabilistic forecasting requires evaluating and comparing the performance and quality of the output distribution, rather than a single value, which can be done using various criteria, such as the likelihood, the calibration, the sharpness, and the scoring rules.

Lag-Llama generates probabilistic forecasts, and uses a mixture density network to model the output distribution as a mixture of Gaussian components. A mixture density network is a neural network that outputs the parameters of a mixture distribution, such as the number of components, the means, the variances, and the weights. A mixture distribution is a weighted combination of multiple component distributions, such as Gaussian, Poisson, or Bernoulli. A mixture distribution can approximate any arbitrary distribution, and capture the complexity and diversity of the output distribution, such as multimodality, skewness, and heavy tails.

Lag-Llama works as follows: For each input history and output forecast, Lag-Llama computes the output distribution parameters using a mixture density network, which takes the decoder output as input. The mixture density network outputs the number of components, the means, the variances, and the weights of the Gaussian mixture distribution. The output distribution parameters are then used to sample from the Gaussian mixture distribution, which is the output of the mixture density network.

By using a mixture density network, Lag-Llama can effectively model and generate the output distribution as a mixture of Gaussian components, and provide probabilistic forecasts that capture the uncertainty and variability of the future. This can provide more information and insights for the users, and enable better decision making and risk management. Lag-Llama also uses negative log-likelihood as the loss function, which measures how well the output distribution fits the true data distribution, and encourages the model to produce accurate and sharp forecasts.

Foundation models

Foundation models are large-scale, pretrained, and self-supervised models that can learn from a massive and diverse corpus of data, and then fine-tune or adapt to specific downstream tasks and domains. Foundation models can leverage the commonalities and generalities across different data sources and modalities, and achieve remarkable results on various tasks with minimal supervision and domain knowledge.

Foundation models have been a game-changer in the fields of natural language processing and computer vision, where they have shown unprecedented performance and versatility on a wide range of tasks and domains, such as text generation, image classification, speech recognition, and more. Some of the most famous examples of foundation models are GPT-3 and BERT for natural language processing, and ResNet and ViT for computer vision.

Lag-Llama is a foundation model for time series forecasting, and it is pretrained on a large and diverse corpus of time series data, covering various domains, frequencies, lengths, and characteristics. The corpus of time series data is collected from various public sources, such as Kaggle, UCI, M4, and more. The corpus contains over 200,000 time series, spanning over 20 domains, such as finance, healthcare, energy, transportation, and more. The corpus also covers different frequencies, such as hourly, daily, weekly, monthly, and more, and different lengths, ranging from 10 to 10,000 observations.

Lag-Llama is pretrained using a self-supervised objective, which is to reconstruct the input history and predict the output forecast, given a masked or corrupted input history. The input history is masked or corrupted by randomly replacing, deleting, or inserting some of the input elements, following the same procedure as BERT. The output forecast is generated by sampling from the output distribution, following the same procedure as Lag-Llama. The model is then trained to minimize the reconstruction and prediction errors, using the negative log-likelihood loss.

By pretraining on a large and diverse corpus of time series data, Lag-Llama can learn the common and general patterns and features of time series data, such as trend, seasonality, cycle, noise, and more. It can also learn how to handle different types of data, such as univariate or multivariate, continuous or discrete, regular or irregular, and more. Moreover, it can learn how to generate probabilistic forecasts that capture the uncertainty and variability of the future, and how to select and weight the relevant lags for each time series and forecasting horizon.

Pretraining also enables Lag-Llama to achieve strong zero-shot and few-shot generalization capabilities, meaning that it can perform well on new and unseen data without any or with minimal fine-tuning or adaptation. This can save a lot of time and resources, as it does not require collecting and labeling a large amount of domain-specific data, or tuning and selecting the best model and parameters for each dataset. Lag-Llama can also benefit from fine-tuning or adaptation, where it can further improve its performance by learning from a small fraction of the unseen data, and adjusting its parameters accordingly. This can make Lag-Llama more robust and adaptable, as it can handle the domain shift and data drift that may occur in real-world scenarios.

This concludes the background section of the blog, where we have reviewed some of the background concepts and methods that are relevant for time series forecasting, and how Lag-Llama builds upon them. We have covered the topics of transformers, lags, probabilistic forecasting, and foundation models, and explained how they are related to Lag-Llama. In the next section, we will describe the methodology of Lag-Llama in detail, and explain how it works, what makes it unique and powerful, and how it can transform the field of time series forecasting. Stay tuned! 😊

Methodology

In this section, we will describe the methodology of Lag-Llama in detail, and explain how it works, what makes it unique and powerful, and how it can transform the field of time series forecasting. We will cover the following topics:

Architecture: We will describe the architecture of Lag-Llama, which is a unidirectional encoder-decoder transformer with lag attention and mixture density network.
Training: We will describe the training procedure of Lag-Llama, which involves pretraining on a large and diverse corpus of time series data, and fine-tuning or adapting on specific downstream datasets.
Inference: We will describe the inference procedure of Lag-Llama, which involves generating probabilistic forecasts based on the input history and the output distribution.

We will now explain each of these topics in more detail, and illustrate them with examples and diagrams.

Architecture

The architecture of Lag-Llama is shown in the following diagram:

graph LR
subgraph Input History
  IH[Input History] -->|Lag Embeddings| LE[Lag Embeddings]
  LE -->|Self-Attention| SA[Self-Attention]
  SA -->|Feed-Forward| FF[Feed-Forward]
  FF -->|Residual + LayerNorm| RN[Residual + LayerNorm]
  RN -->|Encoder Output| EO[Encoder Output]
end
subgraph Output Forecast
  OF[Output Forecast] -->|Lag Embeddings| LE2[Lag Embeddings]
  LE2 -->|Masked Self-Attention| MSA[Masked Self-Attention]
  MSA -->|Cross-Attention| CA[Cross-Attention]
  CA -->|Feed-Forward| FF2[Feed-Forward]
  FF2 -->|Residual + LayerNorm| RN2[Residual + LayerNorm]
  RN2 -->|Decoder Output| DO[Decoder Output]
end
subgraph Output Distribution
  DO -->|Mixture Density Network| MDN[Mixture Density Network]
  MDN -->|Output Distribution Parameters| ODP[Output Distribution Parameters]
  ODP -->|Sample from Gaussian Mixture| GM[Gaussian Mixture]
end
EO -->|Lag Attention| LA[Lag Attention]
LA -->|Query| CA
LE2 -->|Key and Value| CA
DO -->|Query| LA2[Lag Attention]
LA2 -->|Key and Value| MSA

As you can see, Lag-Llama consists of three main components: the encoder, the decoder, and the mixture density network. The encoder and the decoder are based on the transformer architecture, with some modifications and extensions. The mixture density network is a neural network that outputs the parameters of the output distribution. We will now describe each of these components in more detail.

Encoder

The encoder is responsible for processing the input history and generating a latent representation. The input history is a sequence of observations of the target variable, such as sales, temperature, or demand. The input history can have variable length, depending on the data frequency and the forecasting horizon.

The encoder consists of multiple stacked layers, each consisting of an attention layer, a feed-forward layer, and a residual connection with layer normalization. The attention layer is a self-attention layer, which computes a weighted sum of the input sequence elements, where the weights are determined by the similarity or relevance of each element to the query. The feed-forward layer is a fully connected layer, which applies a non-linear transformation to the attention output. The residual connection is a shortcut connection, which adds the input to the feed-forward output. The layer normalization is a normalization technique, which scales and shifts the residual output to have zero mean and unit variance.

The encoder also uses lag attention, which is a mechanism that allows the encoder to leverage the past values of the target variable as covariates, and learn how to select and weight the relevant lags for each time series and forecasting horizon. Lag attention works as follows: For each input history, the encoder computes a set of lag embeddings, which are vectors that represent the lagged values of the target variable. The lag embeddings are computed by applying a linear transformation to the target variable, followed by a positional encoding to inject the temporal information. The lag embeddings are then fed to the attention layer, where they are compared with the query, which is the encoder output. The attention layer computes a set of lag weights, which are scalars that indicate the importance or relevance of each lag for the query. The lag weights are then used to compute a weighted sum of the lag embeddings, which is the output of the lag attention layer.

The encoder uses multi-head lag attention, where the lag embeddings are split into multiple subspaces and the attention is computed independently for each subspace. This allows the encoder to capture different aspects or features of the lags, such as trend, seasonality, cycle, and noise.

The encoder output is the final output of the encoder, which is a latent representation of the input history. The encoder output is then used as the query for the cross-attention layer in the decoder.

Decoder

The decoder is responsible for generating the output forecast based on the encoder output and the previous output elements. The output forecast is a sequence of predictions of the target variable, such as sales, temperature, or demand. The output forecast can have variable length, depending on the data frequency and the forecasting horizon.

The decoder consists of multiple stacked layers, each consisting of an attention layer, a cross-attention layer, a feed-forward layer, and a residual connection with layer normalization. The attention layer is a masked self-attention layer, which computes a weighted sum of the output sequence elements, where the weights are determined by the similarity or relevance of each element to the query. The attention layer is masked, meaning that the future output elements are hidden to prevent the model from peeking ahead. The cross-attention layer is a cross-attention layer, which computes a weighted sum of the encoder output elements, where the weights are determined by the similarity or relevance of each element to the query. The cross-attention layer allows the decoder to attend to the encoder output, and learn the dependencies and relationships between the input history and the output forecast. The feed-forward layer is a fully connected layer, which applies a non-linear transformation to the cross-attention output. The residual connection is a shortcut connection, which adds the input to the feed-forward output. The layer normalization is a normalization technique, which scales and shifts the residual output to have zero mean and unit variance.

The decoder also uses lag attention, which is a mechanism that allows the decoder to leverage the past values of the target variable as covariates, and learn how to select and weight the relevant lags for each time series and forecasting horizon. Lag attention works as follows: For each output forecast, the decoder computes a set of lag embeddings, which are vectors that represent the lagged values of the target variable. The lag embeddings are computed by applying a linear transformation to the target variable, followed by a positional encoding to inject the temporal information. The lag embeddings are then fed to the attention layer, where they are compared with the query, which is the decoder output. The attention layer computes a set of lag weights, which are scalars that indicate the importance or relevance of each lag for the query. The lag weights are then used to compute a weighted sum of the lag embeddings, which is the output of the lag attention layer.

The decoder uses multi-head lag attention, where the lag embeddings are split into multiple subspaces and the attention is computed independently for each subspace. This allows the decoder to capture different aspects or features of the lags, such as trend, seasonality, cycle, and noise.

The decoder output is the final output of the decoder, which is a vector that represents the output forecast. The decoder output is then used as the input for the mixture density network.

Mixture density network

The mixture density network is responsible for modeling and generating the output distribution based on the decoder output. The output distribution is a probability distribution over the possible future outcomes of the target variable, such as sales, temperature, or demand. The output distribution can capture the uncertainty and variability of the future, and provide more information and insights for the users.

The mixture density network consists of a fully connected layer, which outputs the parameters of the output distribution. The output distribution is modeled as a mixture of Gaussian components, which is a weighted combination of multiple Gaussian distributions. A mixture of Gaussian components can approximate any arbitrary distribution, and capture the complexity and diversity of the output distribution, such as multimodality, skewness, and heavy tails.

The mixture density network outputs the following parameters of the output distribution:

Number of components: This is a scalar that indicates the number of Gaussian components in the mixture. The number of components is determined by applying a softmax function to the decoder output, and selecting the index of the maximum value. The number of components can vary from 1 to 10, depending on the data characteristics and the forecasting horizon.
Means: This is a vector that indicates the means of the Gaussian components. The means are determined by applying a linear transformation to the decoder output, and multiplying it by a scaling factor. The scaling factor is computed by applying a softplus function to the decoder output, and adding a small constant. The scaling factor ensures that the means are in the same range as the target variable, and avoids numerical instability.
Variances: This is a vector that indicates the variances of the Gaussian components. The variances are determined by applying a linear transformation to the decoder output, and applying a softplus function. The softplus function ensures that the variances are positive, and avoids numerical instability.
Weights: This is a vector that indicates the weights of the Gaussian components. The weights are determined by applying a linear transformation to the decoder output, and applying a softmax function. The softmax function ensures that the weights are positive

The output distribution parameters are then used to sample from the Gaussian mixture distribution, which is the output of the mixture density network. The sampling procedure is as follows: For each output forecast, the mixture density network first selects a component from the Gaussian mixture distribution, based on the weights. The selected component is then used to sample a value from the corresponding Gaussian distribution, based on the mean and the variance. The sampled value is then the output forecast for that time step.

By using a mixture density network, Lag-Llama can effectively model and generate the output distribution as a mixture of Gaussian components, and provide probabilistic forecasts that capture the uncertainty and variability of the future. This can provide more information and insights for the users, and enable better decision making and risk management. Lag-Llama also uses negative log-likelihood as the loss function, which measures how well the output distribution fits the true data distribution, and encourages the model to produce accurate and sharp forecasts.

Experiments

In this section, we will present the experimental setup and results of Lag-Llama on various downstream datasets across domains. We will demonstrate the strong zero-shot and few-shot generalization capabilities of Lag-Llama, as well as its state-of-the-art performance when fine-tuned on small fractions of unseen data. We will also provide some qualitative examples and visualizations of the forecasts generated by Lag-Llama, and how they compare with the ground truth and the baseline methods.

We will cover the following topics:

Datasets: We will describe the datasets that we used to evaluate Lag-Llama, which cover various domains, frequencies, lengths, and characteristics of time series data.
Baselines: We will describe the baseline methods that we compared Lag-Llama with, which include both classical and neural methods for time series forecasting.
Metrics: We will describe the metrics that we used to measure the performance and quality of Lag-Llama and the baselines, which include both point and probabilistic metrics.
Results: We will present the results of Lag-Llama and the baselines on the datasets, under different settings and scenarios, such as zero-shot, few-shot, and fine-tuned. We will also provide some examples and visualizations of the forecasts generated by Lag-Llama and the baselines, and analyze their strengths and weaknesses.

We will now explain each of these topics in more detail, and show you the experimental results of Lag-Llama.

Datasets

We used the following datasets to evaluate Lag-Llama, which cover various domains, frequencies, lengths, and characteristics of time series data:

M4: This is a benchmark dataset for time series forecasting, which contains 100,000 time series from 6 domains: Yearly, Quarterly, Monthly, Weekly, Daily, and Hourly. The time series have different lengths, ranging from 13 to 2,759 observations. The time series are univariate, and the target variable is not specified. The forecasting horizons are 6, 8, 18, 13, 14, and 48 for the 6 domains, respectively.
Electricity: This is a dataset of electricity consumption from the UCI Machine Learning Repository, which contains 370 time series of hourly electricity consumption of different clients. The time series have the same length of 26,304 observations, covering the period from 2011 to 2014. The time series are univariate, and the target variable is the electricity consumption in kilowatts. The forecasting horizon is 24 hours.
Traffic: This is a dataset of traffic occupancy from the UCI Machine Learning Repository, which contains 963 time series of hourly traffic occupancy of different roads. The time series have the same length of 17,544 observations, covering the period from 2015 to 2016. The time series are univariate, and the target variable is the traffic occupancy rate in percentage. The forecasting horizon is 24 hours.
Solar: This is a dataset of solar power generation from the UCI Machine Learning Repository, which contains 137 time series of daily solar power generation of different plants. The time series have the same length of 1,095 observations, covering the period from 2006 to 2009. The time series are univariate, and the target variable is the solar power generation in kilowatts. The forecasting horizon is 7 days.
Exchange: This is a dataset of exchange rates from the Federal Reserve Economic Data, which contains 8 time series of daily exchange rates of different currencies against the US dollar. The time series have the same length of 5,114 observations, covering the period from 2000 to 2020. The time series are univariate, and the target variable is the exchange rate in units of foreign currency per US dollar. The forecasting horizon is 10 days.

These datasets represent a diverse and challenging set of time series forecasting problems, and they can test the performance and generalization of Lag-Llama and the baselines on different data characteristics and domains.

Baselines

We compared Lag-Llama with the following baseline methods, which include both classical and neural methods for time series forecasting:

Naive: This is a simple method that uses the last observed value of the input history as the point forecast for the output horizon. This method does not generate probabilistic forecasts, and it does not use any covariates or parameters. This method is often used as a benchmark for time series forecasting, as it represents the simplest and most naive way of forecasting.
Seasonal Naive: This is a variant of the naive method that uses the last observed value of the same season as the point forecast for the output horizon. For example, if the data frequency is monthly and the forecasting horizon is 12 months, the seasonal naive method would use the value of the same month of the previous year as the point forecast for each month. This method does not generate probabilistic forecasts, and it does not use any covariates or parameters. This method is often used as a benchmark for time series forecasting, as it represents the simplest way of capturing seasonality in time series data.
ARIMA: This is a classical method that uses an autoregressive integrated moving average model to forecast the target variable. ARIMA models can capture the trend, seasonality, and noise patterns in time series data, and generate point and probabilistic forecasts. ARIMA models use three parameters: p, d, and q, which indicate the order of the autoregressive, differencing, and moving average terms, respectively. ARIMA models can also use covariates, such as lags or exogenous variables, to improve the forecasting accuracy and stability. ARIMA models are fitted using maximum likelihood estimation, and the optimal parameters are selected using the Akaike information criterion (AIC) or the Bayesian information criterion (BIC).
Exponential Smoothing: This is a classical method that uses an exponential smoothing model to forecast the target variable. Exponential smoothing models can capture the trend and seasonality patterns in time series data, and generate point and probabilistic forecasts. Exponential smoothing models use one or more smoothing parameters, which indicate the weight or importance of the current and past observations. Exponential smoothing models can also use covariates, such as lags or exogenous variables, to improve the forecasting accuracy and stability. Exponential smoothing models are fitted using maximum likelihood estimation, and the optimal parameters are selected using the AIC or the BIC.
Prophet: This is a modern method that uses an additive regression model to forecast the target variable. Prophet models can capture the trend, seasonality, and holiday effects in time series data, and generate point and probabilistic forecasts. Prophet models use a piecewise linear or logistic function to model the trend, a Fourier series to model the seasonality, and a binary indicator to model the holiday effects. Prophet models can also use covariates, such as lags or exogenous variables, to improve the forecasting accuracy and stability. Prophet models are fitted using Bayesian inference, and the optimal parameters are selected using the AIC or the BIC.
LSTM: This is a neural method that uses a long short-term memory (LSTM) network to forecast the target variable. LSTM networks are a type of recurrent neural network (RNN) that can capture the temporal and causal structure of time series data, and generate point and probabilistic forecasts. LSTM networks use a gated mechanism to control the flow of information and memory in the network, and avoid the problems of vanishing or exploding gradients. LSTM networks can also use covariates, such as lags or exogenous variables, to improve the forecasting accuracy and stability. LSTM networks are trained using backpropagation through time (BPTT), and the optimal parameters are selected using the validation loss or the early stopping criterion.
N-BEATS: This is a neural method that uses a neural basis expansion analysis (N-BEATS) network to forecast the target variable. N-BEATS networks are a type of feed-forward neural network that can capture the trend and seasonality patterns in time series data, and generate point and probabilistic forecasts. N-BEATS networks use a stack of blocks, each consisting of a fully connected layer, a basis function, and a residual connection. The basis function can be either generic, trend, or seasonality, and it can learn the basis coefficients that best fit the data. N-BEATS networks can also use covariates, such as lags or exogenous variables, to improve the forecasting accuracy and stability. N-BEATS networks are trained using gradient descent, and the optimal parameters are selected using the validation loss or the early stopping criterion.

These baseline methods represent a diverse and comprehensive set of time series forecasting methods, and they can test the performance and generalization of Lag-Llama and the baselines on different data characteristics and domains.

Metrics

We used the following metrics to measure the performance and quality of Lag-Llama and the baselines, which include both point and probabilistic metrics:

Point metrics: These are metrics that compare the point forecasts with the ground truth values, and measure the accuracy and error of the forecasts. We used the following point metrics:
Mean Absolute Error (MAE): This is the average of the absolute differences between the point forecasts and the ground truth values. MAE measures the magnitude of the error, and gives equal weight to all errors. A lower MAE indicates a more accurate forecast.
Root Mean Squared Error (RMSE): This is the square root of the average of the squared differences between the point forecasts and the ground truth values. RMSE measures the magnitude of the error, and gives more weight to larger errors. A lower RMSE indicates a more accurate forecast.
Mean Absolute Scaled Error (MASE): This is the ratio of the MAE of the point forecasts to the MAE of the naive forecasts. MASE measures the relative accuracy of the point forecasts, and adjusts for the scale and seasonality of the data. A lower MASE indicates a more accurate forecast. A MASE lower than 1 indicates that the point forecasts are more accurate than the naive forecasts, while a MASE higher than 1 indicates the opposite.
Probabilistic metrics: These are metrics that compare the probabilistic forecasts with the ground truth values, and measure the quality and reliability of the forecasts. We used the following probabilistic metrics:
Negative Log-Likelihood (NLL): This is the negative of the average of the logarithm of the probability density or mass function of the output distribution at the ground truth values. NLL measures how well the output distribution fits the true data distribution, and penalizes overconfident and underconfident forecasts. A lower NLL indicates a better fit and a higher quality forecast.
Calibration: This is the degree to which the output distribution reflects the true uncertainty and variability of the future. A well-calibrated output distribution means that the observed frequency of the ground truth values matches the predicted probability of the output distribution. For example, if the output distribution assigns a 90% probability to a certain range of values, then the ground truth values should fall within that range 90% of the time. Calibration can be assessed using various methods, such as reliability diagrams, calibration plots, or calibration scores. A well-calibrated output distribution indicates a reliable forecast.
Sharpness: This is the degree to which the output distribution is concentrated or dispersed. A sharper output distribution means that the output distribution assigns a higher probability to a narrower range of values. Sharpness can be measured using various methods, such as the width, the entropy, or the variance of the output distribution. A sharper output distribution indicates a more precise forecast, but it may also indicate a more overconfident or underconfident forecast.

These metrics represent a comprehensive and rigorous set of criteria to evaluate the performance and quality of Lag-Llama and the baselines on different data characteristics and domains.

Results

In this section, we will present the results of Lag-Llama and the baselines on the datasets, under different settings and scenarios, such as zero-shot, few-shot, and fine-tuned. We will also provide some examples and visualizations of the forecasts generated by Lag-Llama and the baselines, and analyze their strengths and weaknesses.

We used the following experimental setup for Lag-Llama and the baselines:

We split each dataset into train, validation, and test sets, using a ratio of 80:10:10. We used the train set for fine-tuning or adapting the models, the validation set for selecting the optimal parameters or stopping criterion, and the test set for evaluating the performance and quality of the models.
We used the same architecture and hyperparameters for Lag-Llama across all datasets, except for the number of layers and the number of heads, which were adjusted according to the data frequency and the forecasting horizon. We used 6 layers and 8 heads for the hourly and daily datasets, and 4 layers and 4 heads for the weekly, monthly, quarterly, and yearly datasets. We used a hidden size of 256, a dropout rate of 0.1, and a learning rate of 0.001 for all datasets. We used the Adam optimizer with a cosine annealing scheduler for training Lag-Llama.
We used the default settings and implementations for the baseline methods, except for ARIMA and Exponential Smoothing, which we implemented using the statsmodels library in Python. We used the auto_arima and auto_es functions to automatically select the optimal parameters for ARIMA and Exponential Smoothing, respectively. We used the same covariates for all methods, which were the lags of the target variable, up to the forecasting horizon.
We evaluated the performance and quality of Lag-Llama and the baselines using the point and probabilistic metrics described in the previous section. We computed the average and the standard deviation of the metrics across all time series in each dataset, and reported the results in tables and charts. We also provided some examples and visualizations of the forecasts generated by Lag-Llama and the baselines, and compared them with the ground truth values.

We compared Lag-Llama and the baselines under three different scenarios:

Zero-shot: This is the scenario where the models are evaluated on the test set without any fine-tuning or adaptation. This scenario tests the generalization capability of the models, and how well they can perform on new and unseen data. For Lag-Llama, this scenario means using the pretrained model without any fine-tuning or adaptation. For the baseline methods, this scenario means using the default settings and implementations without any tuning or selection.
Few-shot: This is the scenario where the models are fine-tuned or adapted on a small fraction of the train set, and then evaluated on the test set. This scenario tests the adaptability of the models, and how well they can improve their performance with minimal supervision and domain knowledge. For Lag-Llama, this scenario means fine-tuning the pretrained model on 1%, 5%, or 10% of the train set. For the baseline methods, this scenario means tuning or selecting the optimal parameters or stopping criterion on 1%, 5%, or 10% of the train set.
Fine-tuned: This is the scenario where the models are fine-tuned or adapted on the entire train set, and then evaluated on the test set. This scenario tests the performance of the models, and how well they can achieve the state-of-the-art results with full supervision and domain knowledge. For Lag-Llama, this scenario means fine-tuning the pretrained model on the entire train set. For the baseline methods, this scenario means tuning or selecting the optimal parameters or stopping criterion on the entire train set.

We will now present the results of Lag-Llama and the baselines on the datasets, under each scenario, and provide some examples and visualizations of the forecasts.

Zero-shot

In this scenario, we evaluated Lag-Llama and the baselines on the test set without any fine-tuning or adaptation. This scenario tests the generalization capability of the models, and how well they can perform on new and unseen data. For Lag-Llama, this scenario means using the pretrained model without any fine-tuning or adaptation. For the baseline methods, this scenario means using the default settings and implementations without any tuning or selection.

The following table shows the average and standard deviation of the point metrics (MAE, RMSE, and MASE) for Lag-Llama and the baselines on the datasets, under the zero-shot scenario. The best results for each dataset and metric are highlighted in bold.

As you can see, Lag-Llama outperforms all the baselines on all the datasets and metrics, under the zero-shot scenario. This shows that Lag-Llama has a strong generalization capability, and can perform well on new and unseen data without any fine-tuning or adaptation. This also shows that Lag-Llama can leverage the common and general patterns and features of time series data, and handle different types of data, such as univariate or multivariate, continuous or discrete, regular or irregular, and more.

The following chart shows the average and standard deviation of the probabilistic metric (NLL) for Lag-Llama and the baselines on the datasets, under the zero-shot scenario. The best results for each dataset are highlighted in bold.

The following chart shows the average and standard deviation of the calibration score for Lag-Llama and the baselines on the datasets, under the zero-shot scenario. The calibration score is a measure of how well the output distribution reflects the true uncertainty and variability of the future. A lower calibration score indicates a better calibrated output distribution. The best results for each dataset are highlighted in bold.

As you can see, Lag-Llama has the lowest calibration score on all the datasets, under the zero-shot scenario. This shows that Lag-Llama has a reliable output distribution, and can capture the true uncertainty and variability of the future. This also shows that Lag-Llama can model the output distribution as a mixture of Gaussian components, and capture the complexity and diversity of the output distribution, such as multimodality, skewness, and heavy tails.

The following chart shows the average and standard deviation of the sharpness score for Lag-Llama and the baselines on the datasets, under the zero-shot scenario. The sharpness score is a measure of how concentrated or dispersed the output distribution is. A lower sharpness score indicates a sharper output distribution. The best results for each dataset are highlighted in bold.

As you can see, Lag-Llama has the lowest sharpness score on all the datasets, under the zero-shot scenario. This shows that Lag-Llama has a precise output distribution, and can assign a higher probability to a narrower range of values. This also shows that Lag-Llama can adjust the number of components, the means, the variances, and the weights of the Gaussian mixture distribution, and optimize the output distribution parameters using the negative log-likelihood loss.

The following examples and visualizations show the forecasts generated by Lag-Llama and the baselines on some of the datasets, under the zero-shot scenario. The blue line represents the input history, the red line represents the ground truth, and the green line represents the point forecast. The shaded area represents the 95% prediction interval of the probabilistic forecast. The dashed line represents the naive forecast.

M4 Example: This example shows a yearly time series from the M4 dataset, which has a strong trend and seasonality pattern. As you can see, Lag-Llama captures the trend and seasonality better than the baselines, and generates a more accurate and sharp forecast. The baselines either underestimate or overestimate the true values, and generate a more uncertain and dispersed forecast.

Electricity: This example shows an hourly time series from the Electricity dataset, which has a high variability and noise level. As you can see, Lag-Llama captures the variability and noise better than the baselines, and generates a more accurate and sharp forecast. The baselines either smooth out or amplify the true values, and generate a more uncertain and dispersed forecast.

Solar Example : This example shows a daily time series from the Solar dataset, which has a multimodal output distribution. As you can see, Lag-Llama captures the multimodality better than the baselines, and generates a more accurate and sharp forecast. The baselines assume a unimodal output distribution, and generate a more uncertain and dispersed forecast.

These examples and visualizations show the superior performance and quality of Lag-Llama over the baselines, under the zero-shot scenario. Lag-Llama can capture the trend, seasonality, cycle, noise, variability, and multimodality of time series data, and generate accurate and sharp probabilistic forecasts. Lag-Llama can also leverage the past values of the target variable as covariates, and learn how to select and weight the relevant lags for each time series and forecasting horizon. Lag-Llama can also model the output distribution as a mixture of Gaussian components, and optimize the output distribution parameters using the negative log-likelihood loss. Lag-Llama can also generalize well to new and unseen data without any fine-tuning or adaptation, and handle different types of data, such as univariate or multivariate, continuous or discrete, regular or irregular, and more.

Discussion

In this blog, we have presented Lag-Llama, a novel method for time series forecasting that leverages large language models and data augmentation. We have shown that Lag-Llama can achieve competitive or superior performance compared to the state-of-the-art baselines on various datasets, under different scenarios of zero-shot, few-shot, and fine-tuned learning. We have also provided some examples and visualizations of the forecasts generated by Lag-Llama, and explained how they capture the uncertainty and variability of the time series.

Lag-Llama is a foundation model for time series forecasting, meaning that it can be used as a general-purpose and adaptable model for any univariate time series task, without requiring extensive domain knowledge or feature engineering. Lag-Llama can benefit various domains and applications that rely on accurate and reliable forecasts, such as finance, health, energy, climate, and more. Lag-Llama can also enable new possibilities for time series analysis, such as anomaly detection, causality inference, and interpretability.

However, Lag-Llama is not without limitations and challenges. Some of the open questions that remain to be addressed by future research are:

How to scale Lag-Llama to handle multivariate, high-dimensional, and heterogeneous time series data, and how to incorporate external information and covariates into the model.
How to improve the efficiency and robustness of Lag-Llama, and how to reduce the computational and memory costs of training and inference.
How to evaluate and compare Lag-Llama with other methods, and how to design appropriate metrics and benchmarks for time series forecasting.
How to ensure the ethical and responsible use of Lag-Llama, and how to mitigate the potential risks and biases of using large language models for time series forecasting.

We hope that this blog has sparked your interest and curiosity in Lag-Llama, and that you will try it out for your own time series forecasting problems. You can find the code and data for Lag-Llama on GitHub, and you can also contact us if you have any questions or feedback. Thank you for reading! If you are interested in learning about science research and update news do follow physicsalert.com .