%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%%
%
% This is a general template file for the LaTeX package SVJour3
% for Springer journals.          Springer Heidelberg 2010/09/16
%
% Copy it to a new file with a new name and use it as the basis
% for your article. Delete % signs as needed.
%
% This template includes a few options for different layouts and
% content for various journals. Please consult a previous issue of
% your journal as needed.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% First comes an example EPS file -- just ignore it and
% proceed on the \documentclass line
% your LaTeX will extract the file if required
%\begin{filecontents*}{example.eps}
%!PS-Adobe-3.0 EPSF-3.0
%%BoundingBox: 19 19 221 221
%%CreationDate: Mon Sep 29 1997
%%Creator: programmed by hand (JK)
%%EndComments
%gsave
%newpath
%  20 20 moveto
%  20 220 lineto
%  220 220 lineto
%  220 20 lineto
%closepath
%2 setlinewidth
%gsave
%  .4 setgray fill
%grestore
%stroke
%grestore
%\end{filecontents*}
%
\RequirePackage{fix-cm}
%
%\documentclass{svjour3}                     % onecolumn (standard format)
%\documentclass[smallcondensed]{svjour3}     % onecolumn (ditto)
%\documentclass[smallextended]{svjour3}       % onecolumn (second format)
\documentclass[twocolumn]{svjour3}          % twocolumn
%
%\smartqed  % flush right qed marks, e.g. at end of proof
%
\usepackage{graphicx}
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{textcomp,multirow}
% \usepackage{mathptmx}      % use Times fonts if available on your TeX system
%
% insert here the call for the packages your document requires
%\usepackage{latexsym}
% etc.
%
% please place your own definitions here and don't use \def but
% \newcommand{}{}
%
% Insert the name of "your journal" with
% \journalname{myjournal}
%
\begin{document}
	\sloppy

\title{Estimation of Compressor Discharge Pressure of Dry-Low Emission Gas Turbine using Rowen's Parameters and Regression Methods\thanks{This work was supported in part by Ministry of Higher Education (MOHE) Malaysia through the Fundamental Research Grant (FRGS). Grant FRGS\_0153AB-L31.}
}

\author{Madiah Omar$^1$
	\and Rosdiazli Ibrahim$^1$
	\and M. Faris Abdullah$^1$        \and
        M. Haizad M. Tarik$^1$ \and
        Fiqkri Nayan$^2$
}

\institute{$^1$ \at
              Department of Electrical and Electronic Engineering, Universiti Teknologi PETRONAS (UTP),
              32610 Bandar Seri Iskandar, Perak, Malaysia. \\
              \email{madiah.omar\_g03268@utp.edu.my}           
           \and
           $^2$ \at
              Gas Disctrict Cooling (UTP)
}

\date{Received: date / Accepted: date}
% The correct dates will be entered by the editor

\maketitle

\begin{abstract}
A Dry-Low Emission (DLE) gas turbine is introduced for a cleaner power generation as compared to the conventional type. However, the technology is susceptible to frequent trips during the operation. Therefore, the trip mechanism is best studied from a simulation model of DLE gas turbine. The widely used Rowen’s model for conventional gas turbine excludes a Compressor Discharge Pressure (CDP) variable, which greatly important in DLE trips prediction. In this paper, a non-parametric regression method, Gaussian Process Regression(GPR) and parametric regression methods; Linear Regression, Quadratic Support Vector Machine, Cubic Support Vector Machine, Gaussian Support Vector Machine are applied to estimate CDP output. First, Rowen’s model and highly correlated variables are selected from the actual data as inputs. Second, the models were trained using the training set data and tested with the remaining testing sample. The output of CDP is compared with the actual value and the error is evaluated. The results validate the high accuracy of the non-parametric method and can be exploited to modify a state-of-the-art model for DLE representation.
\keywords{Compressor discharge pressure \and Dry-low emission \and Rowen model \and Gaussian Process Regression \and Gas turbine \and Non-parametric \and Operational data \and Parametric \and Turbine parameter}
\end{abstract}

\section{Introduction}
\label{intro}
Dry Low Emission (DLE) gas turbine is a clean operation technology that implements a Lean Premixed (LPM) combustion. LPM combustion is designed to achieve a low Nitrogen Oxide (NOx) emission through the diffusion of a high content of atmospheric nitrogen that acts as a diluent with the fuel before delivery to the combustor. The targeted diffusion plays a significant role by preventing a local hot spot within the combustor volume to reduce the NOx formation. Thus, this type of turbine has gained a huge interest in a greener environment and to comply with a stringent emission requirement from the government. Even though the DLE gas turbine seems to solve the emission problem, it still encounters a few challenges as highlighted in \cite{peck2009challenges,serbin2016investigations}. The challenges are; low power changes adaptability, power system instability and unplanned release of flare and exhaust during fault due to the rigorous DLE operational target as illustrated in Fig. \ref{fig:LeanPremixed}. From the diagram, the DLE turbine operation is always maintained at the LPM stage between lean extinction and conventional diffusion flame stage before reaching the top of the combustion profile \cite{hazel2014industrial,ayed2017cfd}. Thus, the stringent control is susceptible to a sudden disturbance, which leads to the Lean Blowout (LBO) or loss of flame error that triggers the gas turbine trip.
 
 \begin{figure}
 	\centering
 	\includegraphics[width=0.5\textwidth]{LPM}
 	\caption{Lean Premixed combustion for Dry Low Emission (DLE) Gas Turbine}
 	\label{fig:LeanPremixed}       % Give a unique label
 \end{figure}

A dynamic DLE gas turbine model is important to describe the operational performance that can be further utilized in LBO fault identification and prediction study. Typically, the conventional gas turbine simulation models are physical models, Rowen's model, IEEE model, aero-derivative model, GAST model, WECC/GGOVI model, CIGRE model and frequency dependent model \cite{yee2008overview}. Among these models, Rowen's model is widely utilized due to its capability in capturing real gas turbine operation from a derivation of various functions of the operating curves from the turbine \cite{yee2008overview}. The model's ability to represent DLE gas turbine has been tested in \cite{omar2017suitability}. However, the model does not has CDP that is important in LBO prediction study for DLE gas turbine. 

The CDP estimation is proposed by integrating Rowen's Model with regression methods. The learning in regression can be either parametric or non-parametric. Parametric is a method where a finite number of parameter $x$ is predetermined with assumptions whether it is linear, quadratic, cubic, polynomial and etc. This parametric approach are applied in \cite{botelho2017parametric} for DC motors and \cite{magalhaes2017parametric,reis2017increasing} for generator application. The other regression method is non-parametric where the parameter is not predetermined, is infinite and no assumptions take place on the data distribution. There are certain cases where a regression method can be both parametric or non-parametric such as Support Vector Machine (SVM). When the kernel is fixed for the estimation in any form of parametric assumption, it is a parametric method. Some of the examples are Quadratic SVM, Cubic SVM and Gaussian SVM. In the case of the non-parametric method, the kernel of SVM is mapped according to the distribution of data such as SVM with RBF kernel \cite{amari1999improving}. Other non-parametric methods are; non-parametric Hidden Markov Model as in \cite{wang2017hierarchical}, Bayesian non-parametric method as in \cite{lheritier2018sequential,kokalj2017non}, k-Nearest Neighbors, Decision Trees and Gaussian Process Regression (GPR). The most commonly used method is GPR due to its high accuracy and efficient parameter estimation as highlighted in \cite{camps2018physics} with improved flexibility when it is joined with other methods. GPR is not focused on finding the parameters but helps to adapt parameters to represent underlying function and commonly applied to predict the remaining useful life as presented in \cite{aye2017integrated,kwon2015remaining} and state-of-health of Lithium-ion batteries as in \cite{yu2018state,yang2018novel,wang2017state,huang2017gaussian}. GPR estimation also suitable for noisy and erroneous input data such as image dehazing \cite{fan2017two}, telephone speech \cite{jokinen2017intelligibility}, fatigue degradation \cite{keprate2017adaptive}, carbon dioxide emission forecasting as in \cite{fang2018novel}, concrete corrosion as discussed in \cite{liu2017prediction} and short-term solar forecasting \cite{sheng2018short}. Apart from the good advantages of GPR, it also widely modified to reduce complexity as in \cite{bui2017fast} using direct sequence code division multiple access (DS-CDMA), truncated Normal distribution \cite{lee2016prediction}, combination of kernels \cite{heikkinen2018spectral}, stochastic volatility model \cite{han2016gaussian} and different filter such as Kalman-filter \cite{li2017gaussian} in the hybrid of the method.

The main principle of GPR is each observation of $y$ has an underlying relationship to function $f(x)$ through Gaussian noise model as shown in \eqref{GPRformula} with variance, $\sigma^{2}$ as the uncertainty of the model.

\begin{equation}\label{GPRformula}
y=f(x)+\mathcal{N}(0,\sigma^{2}),
\end{equation}

In Gaussian Process, all mean across the distribution are all the same, no matter which angle it is viewed from. It is due to each observation that related to another through $covariance function, k(x,x')$. The function that selected for this study is 'squared exponential' as showed in \eqref{squaredexponential} with hyperparameter $l$.

\begin{equation}\label{squaredexponential}
k(x,x')=\sigma_f^2 \exp\bigg[\frac{-(x-x')^2}{2l^2}\bigg]
\end{equation}

GPR is prepared by calculating the covariance function as in \eqref{covariancefunction} for all possible combination of variable $x$. 

\begin{equation}\label{covariancefunction}
\begin{bmatrix}
k(x_{1},x_{1}) & k(x_{1},x_{2}) & \dots  & k(x_{1},x_{n}) \\
k(x_{2},x_{1}) & k(x_{2},x_{2}) & \dots  & k(x_{2},x_{n}) \\
\vdots & \vdots & \ddots & \vdots \\
k(x_{n},x_{1}) & k(x_{n},x_{2}) & \dots  & k(x_{n},x_{n})
\end{bmatrix}
\end{equation}

Next, the new data is estimated by \eqref{multivariateGaussianDistribution} where $y$ is the trained function and $y^*$ is the predicted responses. The assumption for this process is the data can be presented in multivariate Gaussian Distribution. The $T$ indicates the matrix transpose, 

\begin{equation}\label{multivariateGaussianDistribution}
\begin{bmatrix}
y\\y^*
\end{bmatrix}
\sim\mathcal{N}
\bigg(
0,\begin{bmatrix}
K&K_*^T\\K_*&K_{**}
\end{bmatrix}
\bigg),
\end{equation}

and the matrix is presented in \eqref{Gaussiandistribution} for a simplification.

\begin{equation}\label{Gaussiandistribution}
y^*|{y}\sim\mathcal{N}(K_*K^{-1},K_{**}-K_*K^{-1}K_*^T).
\end{equation}

Thus, the mean of distribution as in \eqref{mean} is the best estimation of $y^*$,
\begin{equation}\label{mean}
\bar{y}^*=K_*K^{-1}y,
\end{equation}

and the variance as in \eqref{variance} captured the uncertainty of the estimation.
\begin{equation}\label{variance}
var(y^*)=K_{**}-K_*K^{-1}K_*^T.
\end{equation}

GPR is chosen from the Non-Parametric regression and tested in this case study to analyze the effect of parametric and non-parametric regression for CDP. Therefore, this paper aims to estimate CDP by integrating Rowen model's variables into Regression methods using parametric and non-parametric analysis in DLE Gas Turbine. 
The rest of this paper is organized as follows; the experiment methodology is described in Section \ref{methodology}. The performance of parametric and non-parametric regression is subsequently verified with a different dataset in Section \ref{ResultDiscussion} and the final section concludes the paper in Section \ref{conclusion}.

\section{CDP Estimation}
\label{methodology}
This study utilizes data of 4.2MW, single shaft DLE gas turbine with a 12-stages axial compressor of a Gas District Cooling Plant at Universiti Teknologi PETRONAS, Malaysia. The power plant supplies electric power and chilled water for the university and plant usage from two gas turbine generator (GTG) set. Both gas turbines are operated during peak hours (6:00 AM to 10:00 PM) and one of the turbines is shut down during the night time. The operating conditions of the respective DLE Gas Turbine are tabulated in Table \ref{Operating}. The study is carried out in three stages, first, the actual data is collected, second, the data is preprocessed to remove unwanted data and third, the models are trained and the error between the output and the actual CDP will be evaluated.

	\begin{table}[]
	\caption{Operating Condition of DLE Gas Turbine Study}
	\label{Operating}
	\begin{tabular}{lll}
		\hline\noalign{\smallskip}
		\multicolumn{1}{c}{Parameter}        & Unit  & Value \\ \noalign{\smallskip}\hline\noalign{\smallskip}
		Output Power, $P_{G}$                  & MW    & 46.4 \\
		Turbine inlet temperature, $T_{3(OC)}$ & $^o$C & 1100  \\
		Exhaust gas temperature, $T_{R}$       & $^o$C & 532   \\
		Ambient Temperature, $T_{1(OC)}$       & $^o$C & 27.3  \\
		Exhaust mass flow, $m_{n(OC)}$         & kg/s  & 438.1 \\
		Fuel                                   &       & Gas   \\
		Fuel Flow, $m_{f(OC)}$                 & kg/s  & 8.34  \\
		Lower heating value of fuel, H         & KJ/kg & 43094 \\ \noalign{\smallskip}\hline
	\end{tabular}
\end{table}

\subsection{Rowen's Actual Data Preparation}\label{DA}
The DLE gas turbine data are collected for one-year operation in 2016. Since the operation of the turbine day-to-day is almost identical, a month of August is selected for the study due to the zero trip data throughout the month. Based on the collection, there were 400 variables for both GTG-A and GTG-B with 200 variables each. Only one gas turbine is selected for the study to remove the complexity and better understanding of the operation. A gas turbine that normally in operation most of the time are considered, which is GTG-B. Thus, the variables are further minimized to 200 and preprocessed in Section \ref{DP}.

\subsection{Data Preprocessing}\label{DP}
All 200 variables were correlated using Pearson Method to investigate the relation with CDP. A coefficient that is more than 0.9 is considered to have a strong relationship. The other condition is the parameter must be available in Rowen's model. The correlation result is explained in Section \ref{cleaning}.

\subsubsection{Data Cleaning}\label{cleaning}
Seven variables were selected out of 200 from the correlation and illustrated in Fig. \ref{fig:combinedcorrelation}. There are six parameters that have more than 0.9 correlation coefficient which are (a) for main gas fuel valve opening with 0.98, (b) for pilot gas fuel valve opening with 0.93, (d) for the inlet guide vane opening with 0.96, (e) for the average thermocouple temperature with 0.94, (f) for the load change with 0.93 and gas turbine speed in (g) with 0.99 relationship with CDP. The parameter in (c) for the inlet air temperature with -0.13 correlation also considered due to its important roles in Rowen's model. All the selected parameters are available in Rowen's model except (b) and still considered due to high correlation to CDP and can be added later to the Rowen's model for future work. The scatter plot shown the increment of CDP when the x-axis increases and most of the data are tabulated in two places, which is 0 region (during shutdown) and the operating region. Thus, the parameters indicate a strong relationship with CDP in both conditions.

 \begin{figure*}
	\centering
	\includegraphics[width=1\textwidth]{combinedcorrelation}
	\caption{Correlation of seven variables with CDP}
	\label{fig:combinedcorrelation}       % Give a unique label
\end{figure*}

\subsubsection{Data Sampling}
The final seven inputs of CDP from the correlation study that associated with the Rowen's model variables which are; Main Gas Fuel Valve, Pilot Gas Fuel Valve, Ambient Temperature, Inlet Guide Vane, Load Changes, Average Thermocouple Temperature and Turbine Speed are listed in Table \ref{CDPTable}. 

\begin{table}[]
	\centering
	\caption{Inputs and output component of CDP estimation}
	\label{CDPTable}
	\begin{tabular}{llll}
			\hline\noalign{\smallskip}
		\multicolumn{1}{c}{Type} & Component                                                                   & Variable    & Unit \\ \noalign{\smallskip}\hline\noalign{\smallskip}
		Input 1                & Main Gas Fuel Valve                                                         & Opening     & \%   \\
		Input 2                & Pilot Gas Fuel Valve                                                        & Opening     & \%   \\
		Input 3                & Ambient Temperature                                                         & Temperature & $^o$C   \\
		Input 4                & Air Inlet Guide Vane                                                        & Opening     & \%   \\
		Input 5                & \begin{tabular}[c]{@{}l@{}}Average Thermocouple \\ Temperature\end{tabular} & Temperature & $^o$C   \\
		Input 6                & Load                                                                        & Power       & p.u   \\
		Input 7                & Speed                                                                       & -           & p.u  \\
		Output                 & \begin{tabular}[c]{@{}l@{}}Compressor Discharge \\ Pressure\end{tabular}   & Pressure    & kPag \\ \noalign{\smallskip}\hline
	\end{tabular}
\end{table}

In data sampling, sample data of 28,800 data points are divided into training data with 70\% of the sample and 30\% for testing data. The input behaviors are explained in this section using one-day data from 12:00 AM to 12:00 AM of the next day and all the input trends are presented in Fig. \ref{fig:inputcombined}.

 \begin{figure*}
	\centering
	\includegraphics[width=1\textwidth]{combinedinputpdfmac}
	\caption{Input variables of DLE gas turbine for CDP estimation.}
	\label{fig:inputcombined}       % Give a unique label
\end{figure*}

Six actual inputs are illustrated in the graphical figures except for the turbine speed due to the constant value of 1 p.u when it is in operation and 0 p.u when it is in the shutdown mode. The opening of the main gas fuel valve is displayed in (a), opening of the pilot gas fuel valve in (b), the inlet temperature trend in (c), the Inlet Guide Vane (IGV) in (d), the average thermocouple temperature of the turbine in (e) with six thermocouples installed next to the combustor and the actual temperature of the combustion are assumed by a calculation due to an extreme temperature in the combustor. The six thermocouples are utilized to leverage the temperature reading in case of any instrument failure during the operation and the load change of DLE gas turbine is in (f).

The input of (f) is very important to be highlighted in this section. The DLE turbine normally shut down during the night and produced 0 measurements and continues to rise at 06:00 AM for start-up. From a practice in the industry, the DLE mode will be activated when the load goes to 50\% of its maximum capacity. In this study, the activation took place at 07:00 AM pointed down from the 0.5p.u from the y-axis. Thus, all the related parameters will act according to the low emission mode setting. The load goes up to 0.87p.u due to the high demand until 02:00 PM with a peak at 11:00 AM. This is the period where most of the research and academic activities took place at the university. A slight deterioration from 01:00 PM to 03:00 PM indicates the lunch hour of the students and staffs and the load demand rise up again after 03:30 PM. The load demand goes down again to almost 0.5p.u after 06:00 PM because of the low consumption from the university and the turbine is transferred back to the non-DLE mode at 10:00 PM.

The first input to be discussed is the behavior of Main Gas Fuel Valve in (a). The spike at 6:00 AM indicates the start-up of the turbine and the valve is moved from the idle position to 30\%. The opening is increased to 40\% at 07:00 AM where the DLE mode is activated. The opening of the valve is modulated between 40\% to 35\% during DLE mode from 07:00 AM to 10:00 PM. The behavior proves the industry practice of the DLE gas turbine that mainly controlled by Pilot Gas Fuel Valve when DLE mode is activated. In (b), valve is opened from 0\% to 95\% of the opening at 06:00 AM and stay at the same position of non-DLE operation until the load reaches 50\% at 07:00 AM. The valve position is abruptly reduced to 55\% opening to meet the DLE requirement for low emission. The IGV in (d)  also activated during the start-up around 06:00 AM to 100\% opening in non-DLE mode. It is further decreased to 30\% at 07:00 AM after the DLE mode is activated. In DLE mode, a slightly identical behavior to the load demand trend is observed. Thus, it can be concluded that IGV opening during the DLE mode is also affected by the load demand. The parameter of (e) also affected by DLE mode setting. The temperature reading is increased to 450$^o$C after the start-up at 06:00 AM and further increased to around 680$^o$C for a DLE mode. The increment has resulted from the main gas fuel valve opening. The combustion temperature is maintained at 680$^o$C in DLE mode and reduced again to 450$^o$C in a non-DLE mode at 10:00 PM. 

The inlet temperature trend in (c) exhibits a different profile from others with a temperature rise from 27$^o$C to 31$^o$C at 12:00 AM. It suddenly decreased during the gas turbine start-up at 06:00 AM. The sensor reads the inlet without airflow and gives the wrong reading. After the turbine was ramped-up, the actual air inlet temperature is around 27$^o$C.
All five methods are trained using the seven inputs training data and tested using a new sample of data. The output of the prediction is compared with the actual CDP value.


\subsection{Data Evaluation}\label{DE}

The CDP outputs from the models were compared with the actual data and evaluated. A Goodness-of-fit approach is implemented in evaluating the output. It can be grouped into two methods: graphical and numerical. A graphical method is able to view the entire data sheet and displays a wide range of relationship. While numerical is more narrowly focused on a particular aspect of data and compress information into a single number. 

The graphical methods to test the fit are prediction plot and residual analysis. The numerical methods are; Mean Absolute Error (MAE) that calculated from \eqref{mae}, Root Mean Square Error (RMSE) as in \eqref{rmse} and R-squared in \eqref{R2}. 

MAE role is to represent the average absolute difference between actual observation and predicted with equal weight without considering the direction. It measures the accuracy of the continues variable.

\begin{equation}\label{mae}
MAE=\frac{1}{n}\sum_{t=1}^{n}|Actual-Predicted|
\end{equation}

RMSE function is to estimate the standard deviation of the random component in the data. Both MAE and RMSE are compared in this section to investigate the variation in the errors of the estimation. 

\begin{equation}\label{rmse}
RMSE=\sqrt{\frac{1}{n}\sum_{t=1}^{n}{(Actual-Predicted)}^2}
\end{equation}

Another evaluation for this section is R-squared that statistically measures how close the data to the fitted line with a best-fit value equal to 1. It also indicates the variation of the data and defined as the ratio of Sum of Squares of the Regression (SSR) as calculated from \eqref{SSR} and total of sum squares as in \eqref{SST}.

With $\hat{y}$ equals to the predicted value of particular $y_i$ and $\bar{y}$ equals to the mean of the $y$.

\begin{equation}\label{SSR}
SSR=\sum_{t=1}^{n}(\hat{y}-\bar{y}) ^2
\end{equation}

The total of SST for every models are equal with $y_i$ is the actual value of $y$ and $\bar{y}$ is the mean of the actual values 

\begin{equation}\label{SST}
SST=\sum_{t=1}^{n}(y_i-\bar{y}) ^2
\end{equation}

The calculation of R-square is presented in \eqref{R2},

\begin{equation}\label{R2}
R-square=\frac{SSR}{SST}
\end{equation}

\section{Results and discussion}\label{ResultDiscussion}
In this section, the output of CDP estimation for all five methods are compared and discussed. 

\subsection{Graphical Method}

Fig. \ref{fig:responseplot}(a) shows the CDP plot for all models and (b) is the magnified results during DLE activation mode. The black line is the actual data plot from the plant. It can be concluded that all five methods capture the actual data trending with GPR is the closest to the actual data line, followed by Linear Regression and Quadratic SVM. Cubic SVM and Gaussian SVM have an increasing deviation throughout the day starting from 10:30 AM, afternoon and the highest deviation is at 09:00 PM to 12:00 AM. 

 \begin{figure*}
	\centering
	\includegraphics[width=0.7\textwidth]{combinedresponseplotpdf}
	\caption{DLE section of CDP Response for five methods.}
	\label{fig:responseplot}       % Give a unique label
\end{figure*}

The best method to estimate CDP from the trend is GPR, followed by Linear Regression, Quadratic SVM, Gaussian SVM and Cubic SVM. The CDP results indicate the ability of GPR in predicting noisy data and outperformed other deterministic models \cite{williams1996gaussian}. Linear Regression also produced less error due to the strong relationship between the inputs and CDP. It is illustrated in the correlation graph with a relationship of more than 0.9. SVM requires a kernels specific parameter setting to reduce the complexity \cite{pal2011support}, such as quadratic, cubic and Gaussian. However, when the data are inseparable, the SVM model tries to maximize the margin and minimize the misclassification error around the pre-determined hyperplane \cite{cortes1995support}. With that behavior, the accuracy is lost in predicting new data sample. The underlying principle in the estimation is discussed using other graphical methods in this section. 

First, the result is evaluated using a prediction plot. The red linear plot of $x=y$ is the desired actual response equal to the predicted response.
Fig. \ref{fig:combinedpredictionplot} shows the scatter plot of predicted response against the actual response of Linear Regression in (a), Quadratic SVM in (b), Cubic SVM in (c), Gaussian SVM in (d) and GPR in (e). Most points for all graphs are scattered at 600kPag to 900kPag that indicates the turbine operation's region. There was no operation took place at 400kPag to 600kPag due to the transition period of CDP from shut-down to the start-up. Few points are observed at 0kPag until 400kPag and multiple points are concentrated at 0kPag where the gas turbine normally in shut-down mode.

 \begin{figure*}
	\centering
	\includegraphics[width=1\textwidth]{combinedprediction}
	\caption{Prediction plot of all estimation techniques of CDP.}
	\label{fig:combinedpredictionplot}       % Give a unique label
\end{figure*}
Out of all five methods, GPR illustrates the best estimation. Most of the points are concentrated at the desired response line and exhibits very minor deviation from it. This explains that the method can estimate the shutdown period as well as the operation of DLE gas turbine. When the DLE gas turbine is in operation at 600kPag to 900kPag, the majority of the data fall on the line with a very minor deviation. The high precision and high accuracy of the method is confirmed. In the case of Linear Regression method, most of the points lie on the linear plot with a minor deviation between each point. However, more than 10 points are still secluded from the desired response line. The low variance of the data along the desired response line and few deviations of data indicates the high accuracy but less precision of the method. However, for Quadratic SVM, most of the data are fall on the line with a bigger variation than LR but only less than 10 points deviate from the desired response line.  This tells that the method has high precision but less accuracy to estimate CDP. The variation of Cubic SVM prediction between data is very small compared to Linear Regression and Quadratic SVM. Some data that fall between 0kPag to 400kPag has a higher accuracy than Quadratic SVM. However, the accuracy of the method is lower than Linear Regression due to the distribution that falls below the red line at 600kPag to 700kPag and slightly above the desired response at 850kPag to 900kPag. Nevertheless, the method shows a very high precision compared to the previous method. For the Gaussian SVM method, the data is unable to capture the shutdown period as the data does not fall on the desired response line from 0kPag to 400kPag. From the distribution at 600kPag to 900kPag, it can be concluded that the Gaussian SVM estimation is slightly precise than LR and Quadratic SVM but lower than cubic SVM. In term of accuracy, it is almost the same as Cubic SVM estimation which is lower than LR and Quadratic SVM.

As a conclusion,  Gaussian Process Regression (GPR) has high accuracy and high precision in both shutdown period and in operation. This is due to the model principle that applies the mean and covariance function that searches for the minimum error in the prediction without any predetermined parameter. For LR, the specific linear line is greatly affected by an "influential observation" \cite{montgomery2012introduction}. As the data points are clustered in two groups along the horizontal direction, it has a significant impact on the slope during the training of the model. Thus, the low accuracy of LR estimation is expected. While for SVM, the hyperplane with decision boundary leads to the inflexibility of the estimation, especially in both conditions of a shutdown and in operation. It may accurate in one condition, but inaccurate in other condition. In order to evaluate this evaluation approach, a second analysis which is residual plot analysis are discussed in the next section.

In the regression analysis, a residual plot is very important to validate the randomness and unpredictability of a model \cite{fernandez1992residual}. Without the analysis, a model cannot be validated. The basic of residual is presented in \eqref{residual} where $e$ is the difference of the observed value, $y$ with the predicted value, $\hat{y}$. In the residual plot, the stochastic error can be assessed whether it is consistent with the observed error (residuals). The aim of this analysis is to make sure that the error is random and unpredictable. Thus, the residuals for every regression model are estimated and plotted around the centered zero line. The best fit is the one that centered at zero and not systematically high or low. 

\begin{equation}\label{residual}
Residual (e) =y -\hat{y}
\end{equation}

 \begin{figure*}
	\centering
	\includegraphics[width=1.1\textwidth]{combinedresidual}
	\caption{Residual plot of all estimation techniques for CDP estimation.}
	\label{fig:combinedresidual}       % Give a unique label
\end{figure*}

The residual plot in estimating CDP for a DLE gas turbine is illustrated in Fig. \ref{fig:combinedresidual} with linear regression in (a), Quadratic SVM in (b), Cubic SVM in (c), Gaussian SVM in (d) and GPR in (e). Out of all five methods, GPR exhibits the most randomly distributed points between negative-axis and positive-axis with a very small observed error. The second best stochastically model is LR as no specific curve along the line is observed. Most of the error lies on the zero line except at the 2000s which is the start-up of the turbine and at 2200s during the change of non-DLE mode to DLE mode. Thus, it indicates that LR produces a stochastic error during the transition period of the estimation. 

The other three methods of SVM portrays either constant positive or constant negative throughout the operation.  The Quadratic and Cubic exhibits a consistent positive both in a shutdown mode or in operation.  The error is larger at the transient operation from shutdown to start-up and from normal operation to DLE mode with a negative trend observed. With this pattern, the residuals value are predictable and not random resulted in a bad fit for the regression. The last model, which is Gaussian SVM has a negative residual during the shutdown and positive residual when it is in operation. Only a few points are in the negative zone from the 2000s to 8500s. Thus, it indicates that the residual distribution is predictable and not the best fit for the estimation.

As a conclusion, the best fit for the estimation when it is evaluated with the residual analysis is the Gaussian Process Regression. It is the nonparametric probabilistic model that capable of handling stochastic observation \cite{wang2018efficient}. The numerical evaluation as in Section \ref{datafit} is performed to further evaluate the estimation results.  

\subsection{Numerical Evaluation Method}\label{datafit}

The summary of the analysis is presented in Table \ref{erroranalysistable}. It can be observed that R-square analysis for all the methods has a good fit with a value of more than 0.99. Both Linear Regression and GPR have the same value of 0.999 better than other methods. For the MAE evaluation, GPR has the lowest error of 2.29 followed by Linear Regression with 5.45 and other methods exhibit an error of more than 10 of MAE. Third evaluation is RMSE with the lowest error of 3.41 by GPR. Second lowest is the Linear Regression that displayed quite a high deviation from mean value compared to GPR with 10.36 of RMSE. Other methods showed a higher percentage of RMSE that is more than 20.  

\begin{table}[]
	\centering
	\caption{Error Analysis Table}
	\label{erroranalysistable}
	\begin{tabular}{llll}
	\hline\noalign{\smallskip}
		\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Method}}} & \multicolumn{3}{l}{\textbf{Results of Evaluation}}                                                             \\ \cline{2-4} 
		\multicolumn{1}{c}{}                                 & \multicolumn{1}{l}{\textbf{R-square}} & \multicolumn{1}{l}{\textbf{MAE}} & \multicolumn{1}{l}{\textbf{RMSE}} \\ \noalign{\smallskip}\hline\noalign{\smallskip}
		Linear Regression                                      & 0.999                                  & 5.45                              & 10.36                              \\ 
		Quadratic SVM                       & 0.996                                  & 16.64                             & 21.17                              \\ 
		Cubic SVM                           & 0.993                                  & 16.64                             & 27.8                               \\ 
		Gaussian SVM                       & 0.991                                  & 27.31                             & 30.8                               \\ 
		GPR                            & 0.999                                  & 2.29                              & 3.41                               \\ \noalign{\smallskip}\hline
	\end{tabular}
\end{table}

Apart from the graphical analysis, the methods were analyzed further in numerical evaluation with the lowest error exhibits by GPR and ~7 difference of RMSE with Linear Regression. From this analysis, the GPR method is selected for CDP estimation of DLE Gas Turbine due to the lowest RMSE and MAE than other methods and a good distribution of fit estimation from prediction plot and residual plot. 

\section{Conclusion}\label{conclusion}
In this paper, the parametric and non-parametric regression methods are applied to estimate the CDP output for DLE gas turbine. The distinguishing feature of this paper is the evaluation of the regression methods in  CDP estimation that can be integrated into Rowen's model. The integration is beneficial for the DLE gas turbine stability study and fault prediction. It was found that the non-parametric method which is Gaussian Process Regression is the best fit and high accuracy for CDP estimation. This supports the GPR advantages that are good for a noisy and transient distribution of data such as CDP, that exhibits different behavior especially in shut-down mode, DLE and non-DLE mode. With the covariance matrix, all possible combination of data were calculated and the accuracy of the estimation wtexas improved compared to the parametric method. The parametric method works best in a specific data tabulation that requires pre-determined parameters and an optimum prediction can be achieved. In future work, the GPR method will be integrated into Rowen's model for DLE gas turbine stability study and fault analysis.

\bibliographystyle{spmpsci}      % mathematics and physical sciences
%\bibliographystyle{spphys}       % APS-like style for physics   % name your BibTeX data base
\bibliography{ICDM2018ref}
%% Non-BibTeX users please use
%\begin{thebibliography}{}
%%
%% and use \bibitem to create references. Consult the Instructions
%% for authors for reference list style.
%%
%\bibitem{RefJ}
%% Format for Journal Reference
%Author, Article title, Journal, Volume, page numbers (year)
%% Format for books
%\bibitem{RefB}
%Author, Book title, page numbers. Publisher, place (year)
%% etc
%\end{thebibliography}

\end{document}
% end of file template.tex