FORECAST OF MULTIVARIATE TIME SERIES SAMPLED FROM INDUSTRIAL MACHINERY SENSORS

Goal: To evaluate the performance of a set of forecasting methods in the prediction of future values on a dataset of time series collected from sensors installed in an industrial gas turbine. Design / Methodology / Approach: Forecasting methods tested include the use of multivariate and univariate neural networks (FNN and LSTM), exponential smoothing and ARIMA models. Results: Results show that the use of ARIMA models to forecast on the dataset is the best default method to apply, and is the only forecasting method that consistently beats a simple naïve no-change model. Limitation of the investigation: There was a focus on evaluating neural networks. This limited resources available to evaluate other forecasting methods. There is no guarantee that it would not be possible to find neural networks capable of yielding better forecasts than the ones achieved by the best performing methods in this research. Practical implications: The broadest possible implications of the results are that the best default method to forecast industrial machinery time series is the use of ARIMA models. Additionally, neural networks are not capable of beating methods well stablished within the forecasting community, namely ARIMA models. Originality / Value: To the best of the authors’ knowledge, there is a scarce amount of published evaluations of multiple forecasting methods on data from real machines. This knowledge is useful for the understanding of the best forecasting methods available for the estimation of machine’s RUL using sensor time series.


INTRODUCTION
Condition Based Maintenance (CBM) is a maintenance policy that aims to take maintenance action before a failure happens. CBM times the maintenance action by assessing product condition, and predicting failure based on data gathered from the product. While the technologies and technical methods for CBM are still in their infancy, advancements in information technology have accelerated growth in CBM technology by enabling network bandwidth, data collection and retrieval, data analysis, and decision support capabilities for large datasets of time series. Process data, collected in the form of time series, is often compressed and archived for record keeping and only retrieved for emergency analysis after a fault has occurred. This data could be of tremendous advantage when combined with effective analytics and superior computing power capable of generating knowledge from the data (Qin, 2014;Shin and Jun, 2015). The topic of leveraging embedded sensors, industrial networks and data mining techniques in order to attend high complexity industrial demands also relates to the current research trend on the Internet of Things (IoT). The four basic aspects of IoTs are reliable and accurate data collection; capacity to collect a huge quantity of data; rapid data transmission and automated processes (Lopes Miranda Junior et al., 2017;Alarcón et al., 2016).
Diagnostics and prognostics are two parts of CBM. Diagnostics is a reactive process. It takes place after a fault has already occurred and aims to determine the root cause of the failure. It cannot prevent machine downtime and the corresponding expenses. Prognostics is a proactive process. It assesses and predicts future machine health, which includes detecting incipient failures and predicting remaining useful life (Lee et al., 2014).
There are three classes to the current approaches to prognostics: model based, data driven and hybrid. Model based approaches presume that it is possible to build a mathematical model from the understanding of the physical mechanisms involved in the failure modes of the machine that bases the model. While these approaches have the advantage of providing the ability to incorporate physical understanding of the system, if the understanding of the system degradation is poor, it may be difficult to model the system behavior. Data driven approaches use data gathered from sensors or by the machine operators to track features that indicate the degradation of the system. Data driven approaches can leverage computer intelligence techniques like neural networks and decision trees, or statistical techniques like auto-regressive models (Dragomir et al., 2009).
Historically, for time series in diverse domains, empirical evaluations showed that statistically sophisticated or complex methods do not necessarily produce more accurate forecasts than simpler ones. However, recent evaluations have concluded that complex methods based on computational intelligence and neural networks have caught up, and that simple methods can no longer claim to outperform computer intelligence methods without a proper empirical evaluation (Crone et al., 2011).
Several studies into prognostics have treated it from a time series forecasting perspective. Pham et al. (2012) used an Auto Regressive Moving Average (ARMA) model on baseline data. Pham et al. used deviations from the ARMA model on future values as a degradation index. After the degradation index reaches a threshold, Pham et al. used Cox's PHM (Proportional Hazards Model) to create a survival probability curve as a function of time followed by Support Vector Regression (SVR) to predict remaining useful life. Heng et al. (2009) used an artificial neural network with the most recent values of a condition index (bandpass vibration) as inputs to predict probability of failure in fixed time intervals ahead of the last condition index measure. Heng et al. (2009) benchmarked the proposed model against an Elman Recurrent Neural Network (RNN). Datong et al. (2011) developed a SVR based strategy for on-line prediction of industrial sensor data. The authors tested the strategy on a benchmark dataset. Datong et al. compare their proposed strategy against a different SVR based strategy. Niu and Yang (2010) used a neural network to fuse a set of features into a single value used for condition monitoring. After the condition index reaches a threshold value two non-linear techniques, Dempster-Shafer regression and least-squares support vector machines, predict the future behavior of the monitored index. A weighted average combines the predictions from both methods. Cho et al. (2016) developed a hybrid approach to predict the next failure time of a centrifugal compressor using vibration data. Bellow a threshold value, Cho et al. applied a Markov model to predict next failure time. Above the threshold value, Cho et al. apply a mix of moving average filter and simple linear regression.
Neural networks have showed good performance when applied to time series forecasting in domains other than CBM. Khashei and Bijari (2010) introduced an approach based on using an Auto Regressive Integrated Moving Average (ARIMA) model to extract features from a time series. The features serve as training input to a single hidden layer feedforward network. Khashei and Bijari (2010) test the proposed network on three different time series. Ma et al. (2015) used a long short-term memory (LSTM) neural network to predict traffic speed. The application of time series forecasting methods based on neural networks, combined with huge available amounts of historical data, 3/12 may lead to better prognostics of industrial machines. Ultimately, better prognostics lead to reduced maintenance costs and increased production availability.
The main goal of this study is to evaluate the performance of a set of methods in the prediction of future values of monitored parameters in industrial machines. The study is an empirical evaluation, done by generating forecasts using a dataset collected from an industrial gas turbine. It evaluates neural network architectures that take as input the entire multivariate time series and output forecasts for all monitored parameters at once. Forecasts for each individual parameter generated by neural networks, ARIMA models, exponential smoothing and naïve models, serve as benchmarks against the multivariate architecture.
The contribution of this paper is in the generation of knowledge directed specifically to the improvement of prognostics, when treated as a time series forecasting problem. To the best of the authors' knowledge, there is a scarce amount of published evaluations of multiple forecasting methods on data from real machines. This knowledge is useful for the understanding of the best forecasting methods available for those who want to estimate remaining useful life of machines.
The structure of the remainder of this study is as follows. Section 2 describes the dataset used in the study and the preprocessing applied to the dataset prior to any model building. Section 2 also describes the methods used to generate forecasts for the time series and the metric for the evaluation of the forecasts. Section 3 presents the results obtained by applying the proposed methods to the dataset. Section 4 provides a conclusion for the study.

Data
This study uses data collected from an oil platform's data historian system. The system stores data from multiple sensors installed through the offshore facility. The focus of the current research is on data collected from the sensors in one of the platform's gas turbines. The turbine operates in power generation role.
The dataset includes values collected from 32 sensors. The dataset contains data from four pressure sensors: P1 is gas fuel pressure, P2 is pressure at the lubrication oil header, P3 is the differential pressure at the inlet air filter and P4 is the pressure at the turbine axial compressor discharge. Two sensors measure rotation: R1 is the Gas Producer (GP) rotor rotation speed and R2 is the Power Turbine (PT) rotor rotation speed. The dataset also contains data from 14 temperature sensors: T1 is temperature at the lubrication oil header, T2 is temperature at the cold junction of the thermocouples installed in the turbine and T3 is the temperature inside the turbine hood. T4 and T8 are oil temperature at inlet of GP and PT rotors bearings respectively, T5, T6 and T9 are the temperatures at the turbine oil sumps, T7 is flow temperature at axial compressor discharge and T10 is the average of the readings of 4 thermocouples installed after the first GP turbine wheel (T11 thru T14). Twelve vibration sensors complete the dataset: V1 thru V10 are radial vibration sensors installed at the turbine's five bearings, V11 is power turbine axial vibration and V12 is auxiliary gearbox casing vibration. Figure 1 shows a schematic of the turbine instrumentation used in this research.

4/12
The historian system does not log values for each sensor at a fixed sampling rate. Different reasons may trigger a data logging event for each sensor. In order to allow the research to proceed with the use of standard techniques for evenly spaced time series, preprocessing of the dataset aggregates the time series into evenly spaced data. Daily bins divide the dataset. For each time series, the aggregated values equal the mean of the values inside the daily bins. In the event there are no values that fall into a daily bin for one of the series, a linearly interpolated value substitutes the missing value. In order to guarantee secrecy of the real operating parameters, normalization of the dataset using a standard scaler follows. The resulting series have mean zero, standard deviation one and are dimensionless. The resulting dataset contains 1461 daily values for the 32 sensors.
The study splits the dataset in two. The first 90% of the data serves as training data and the last 10% is the test set, used to evaluate the accuracy of the models in proper out of training sample data. The model training phase uses both the original dataset and a second dataset with extreme outliers removed. We define extreme outliers as any value that falls outside of a range defined by a distance of three interquartile ranges from each edge of a boxplot made with all the samples of a time series (Montgomery and Runger, 2010). The use of the dataset without extreme outliers is exclusively for model training. The evaluation of the models trained on the dataset without extreme outliers happens on the regular test set. Figure 2 shows the preprocessed dataset without extreme outliers.

Forecasting Methods
The following subsections describe the forecasting methods applied to the dataset. All methods use the first 90% of data for model training, the definition of model parameters. For all models, the training phase defines the model parameters using one-step ahead forecasts. In the evaluation phase, that uses the remaining 10% of data, there are no changes to the model parameters.

Neural Networks
Neural network is a term that encompass a large class of models and learning methods. Neural networks are nonlinear statistical models that model the outputs as

5/12
nonlinear functions of linear combinations of the inputs. One builds a neural network by connecting simple computing cells called neurons or processing units. This study uses neural networks implemented in python using the Keras library (Chollet, 2015).
There are three basic elements to a neuron's model. First, a set of connecting links to other neurons, each characterized by a weight of its own. Second, an adder, often called a propagation function, used to sum all the input signals to the neuron. Third is the activation function. The activation function limits the output of the neuron and is responsible for the nonlinearities in the network. Equation 1 and Equation 2 give the output x K of a neuron k that uses weighted sum as its adder. Haykin (2009) give further details on the mathematics of neural networks. One type of neural network used is this study is Feedforward Neural Networks (FNN). There are no loops in an FNN. A layer only uses as input the output from the previous layer. The study tries two approaches to forecasting with FNNs. The first approach is to create a FNN that uses all series as input and outputs the forecasts for the next time step for all of the time series in the dataset at once. We expect that this approach, from now on called multivariate FNN, will be able to capture the interactions between the different time series, resulting in better forecasts. Figure 3 shows an example multivariate FNN. The second approach, called univariate FNN, is to create independent FNNs for each series. The independent FNN only uses as input lagged values from the time series it forecasts. Figure 4 shows an example univariate FNN.
It is necessary to define the architecture of the FNNs before proceeding with the final training. All FNNs used in this study use the same procedure for architecture selection. The procedure starts with a split of the training set: 2/3 for training and 1/3 for validation. Several architectures with one or two hidden layers, different number of units in the hidden layers and different number of lagged values as input are considered. The procedure continues with training of  The longer the network runs, the more unstable are the gradients on inputs further back in time.
Long Short Term Memory (LSTM) units, are a special type of processing unit used to build the hidden layers in a LSTM network. LSTM units address the problem of gradient instability by creating paths through time that have derivatives that will not vanish or explode (Goodfellow et al., 2016). The LSTM units have an adaptive forget gate designed to reset the unit state when its contents are no longer relevant. The forget gate controls the weight of the state self-loop, and in that way, how much of the information in the state is preserved or discarded between time steps. Gers et al. (1999) give further detail on LSTM units.
The study tries two approaches to forecasting with LSTM networks. All LSTM networks tried use only one lagged value as input, since the LSTM units should have the capability to accumulate all relevant past information in their states. The first approach is to create a LSTM that uses all series as input and outputs the forecasts for the next time step for all of the time series in the dataset at once. We expect that this approach, from now on called multivariate LSTM, will be able to capture the interactions between the different time series, resulting in better forecasts. The second approach, called univariate LSTM, is to create independent LSTM networks for each series. The only input of the independent LTSM is the value at t 1 − of the time series it forecasts.

7/12
It is necessary to define the number of LSTM units in the hidden layer before proceeding with the final training. All LSTM networks used in this study use the same procedure to select the size of the hidden layer. The procedure starts with a split of the training set: 2/3 for training and 1/3 for validation. Several architectures with different number of units in the hidden layers are evaluated. The procedure continues with training of the networks on the reduced training set for a maximum of 1000 epochs, with early stopping if the validation loss (MSE) does not improve after 10 consecutive epochs. The final LSTM networks, trained in the entire training set, use the number of hidden units that presented the smallest average validation loss after 10 training runs.

Exponential Smoothing
Exponential smoothing is a forecasting approach that uses all historical values as predictors, giving more weight to more recent values, as in Equation 4 for As long as 0 1 α < < , the weight given to each observation decreases exponentially as each observation comes from further in the past, hence the name exponential smoothing. The Holt-Winters procedure, given in the additive form by Equations 6, 7, 8 and 9, generalizes simple exponential smoothing, allowing a trend and a seasonal term. Selecting the best values for the parameters in a Holt-Winters model is a non-linear optimization problem, and the task requires an optimization tool.
Where: b: Trend term; s: Seasonal term.  x B x − = (11) The study uses the R function auto.arima() for model selection and parameter estimation. It conducts a search over possible models and selects the best one based on the smallest Akaike Information Criterion (AIC). Hyndman and Khandakar (2008) give details on the function implementation.

Naïve Forecast
A naïve model is a model that presumes things will remain the same as they have in the past. For time series data, the naïve (no change) model simply forecasts the next observation to be equal as the latest observation. The naïve model serves as a benchmark model for other models. If a model cannot produce better forecasts than a simple alternative like naïve no-change, it is of no use (Armstrong, 2001).

Forecast Evaluation
Forecast accuracy assessment occurs after model training. The metric used for this model evaluation phase is the same used in model training: MSE. The accuracy assessment uses the test set consisting of the last 10% of the full dataset. The model training phase only uses one-step ahead forecast. The model evaluation phase tests model accuracy not only using one-step ahead forecasts, but also using multistep ahead forecasts. Forecasting windows tested are 1, 2, 5, 7, 10 and 14 days ahead. Since the models are configured to produce one step ahead forecasts, a recursive strategy is applied to generate the multistep ahead forecasts. The forecast for time t 1 + serves as input for the model to forecast the values at time t 2 + . This procedure repeats until the end of the multistep ahead forecast.
Retraining the neural networks after the observation of every new sample in the test set would require significant computational resources. In order to avoid the computational costs, there are no updates to network weights during the model evaluation stage. With the objective of testing the different forecasting methods in the same conditions, there are also no changes to the parameters of the ARIMA and ETS models, even though the computational costs would be significant smaller for these methods.
The calculated MSE are the average for all of the time series. The results do not show what the best method for each univariate time series would be. They show what method would deliver the best results, on average, for a random univariate time series drawn from the dataset, which in turn consists of a diverse collection of time series collected from the same industrial turbine.

RESULTS
The model that uses a one hidden layer feedforward neural network taking as input all of the time series and trying to predict the next values for all series requires selection of hyperparameters before training. The hyperparameters selection phase uses a split of the training set: 2/3 for training and 1/3 for validation. Table 1 summarizes all the hyperparameters considered. Table 2 shows the average validation loss for each of the architectures after 10 training runs. Based on these results, the final model uses 180 units in the hidden layer and a single time lag for the inputs.
The model with two hidden layers that predicts all values simultaneously uses the same 2/3 for training and 1/3 for validation split for selection of hyperparameters, as in the one hidden layer model. This step considered the same possible time lags and units in the hidden layers considered for the one hidden layer model. Table 3 shows the average  2,5,10,20,33,50,66,100,133,150,180,200 Source: Authors.   Table 4 shows the average validation loss for each of the architectures after 10 training runs. Based on these results, the final model uses 50 units in the hidden layer.

Time Lags
The prediction methods that use independent networks for each time series use a hyperparameter selection procedure similar to the one described for the architectures that predict all of the time series simultaneously. For the individual time series, we do not consider two hidden layers FNN architectures. Table 5 shows the architectures considered for each independent FNN. The independent LSTM networks tested use one time lag as input and the same number of units in the hidden layer as the considered FNNs. Table 6 shows the selected FNN architectures and Table 7 shows the selected LSTM architectures for each univariate time series, based on the average validation loss after 10 training runs. Table 8 and Table 9 summarize the results obtained by applying all the proposed forecasting methods to the test set. Table 8 shows the MSE for the scenario with models trained on the normal training set (with extreme outliers). Table 9 shows the MSE for the scenario with models trained on the training set without extreme outliers. The comparison between the two tables show that in general (26 out of 36 cases) removing the extreme outliers improves the forecast. However, all instances where the removal of the extreme outliers worsened the forecast happened on the methods of better performance (univariate LSTM, ARIMA and ES).
The forecasting method that showed the worst performance was the multivariate FNN. To the best of the authors' efforts, it was not possible to design and train a multivariate FNN capable of matching or surpassing the performance of the other methods, included the use of univariate FNNs. The multivariate FNNs performs worse than a simple naïve method in all forecast windows when trained on the normal training set. When trained on the training set without extreme outliers, the two hidden layers FNN is capable of beating the naïve model in three of the six forecast windows tested. The univariate FNN beats the naïve forecast in forecast windows equal or bigger than two.

10/12
The use of LSTM units generally improves the performance of the neural networks. Similar to the FNN case, using univariate LSTM networks yields better results than using a multivariate LSTM. In this study, attempting to capture the interactions between the different time series did not produce better forecasts. For forecast windows bigger than one, the univariate LSTM is capable of delivering better results than the no-change naïve method.
Exponential smoothing applied to each individual time series showed the third best performance on the study. ES only produces worse forecasts than the naïve method in the one-step ahead forecast scenario, but loses to univariate LSTM networks in forecasting windows bigger than 2. The winning method in the evaluation is the use of ARIMA models, selected for each individual time series, trained on the original training set (with outliers), using Hyndman's approach for model order selection. The use of ARIMA models consistently beats the naïve forecasts.

11/12
Not only ARIMA models produced the best forecasts in this empirical evaluation, but also the computational resources required to select the model parameters are much smaller.

CONCLUSION
This work empirically evaluated the forecasting performance of a set of different forecasting methods in a dataset of industrial sensors time series. The dataset comes from an oil platforms' data historian system. The results show that using ARIMA models to forecast these time series is the best default methodology to apply, and is the only methodology that consistently beats a simple naïve no-change model.
This study presents limitations. There was a focus on evaluating neural networks. This limited the time available to evaluate other forecasting methods developed by the computer science (e.g., Support Vector Regression) and statistics communities. Moreover, there is no guarantee that it would not be possible to find neural networks capable of yielding better forecasts than the ones achieved by the best performing methods in this research. Other limitation is that while the dataset consisted of 32 time series, these were drawn from only 4 types of machine sensors.
Future studies should focus on improving the variety of the time series in the dataset, assessing a greater variety of forecasting methods and finding better performance of models based on neural networks.