by Christine Apostolopoulou

Multiple Linear Regression with R: Basic Knowledge

Author Notes

Data collection, design of analysis and methodology were found in a 2012 published paper:

2012 paper
Ali, M.Y., Adesta E.Y.T., Rahman, N.A.B.A., & Aris, E.B.M. (2012). Powder Mixed Micro Electro Discharge Milling of Titanium Alloy: Analysis of Surface Roughness.

Advanced Materials Research, Trans Tech Publication Switzerland, 341-342, 142-146. https://doi.org/10.4028/www.scientific.net/AMR.341-342.142

Description of the experiment. Terminology

The micro-electrical discharge machining (micro-EDM) is a non-conventional machining process which utilizes electro-thermal, non-contact effects to remove material from the workpiece by means of electric spark erosion. The metal removal process is performed by applying a pulsating (pulse on-off) electrical charge of high frequency current through the electrode to the workpiece. Both electrode and workpiece are immerged in a solution and some machining and system factors affect the micro-EDM performance, which is measured in terms of material removal rate, surface roughness, tool wear rate etc.

In this experiment, the performance of the mechanical process is improved by adding silicon powder in the dielectric fluid and is evaluated by measuring the surface roughness (measured in $μm$). The powder concentration of silicon carbide in the dielectric fluid (measured in $g/L$) together with the discharge energy during the machining process (measured in $μJ$) are the factors studied to explore their effect on the surface roughness.

Each factor has four levels, and based on a full 4×4 factorial design, 16 experimental trials were conducted. For pre-determined settings of the factors in each trial, micro channels were machined, and the surface roughness of the workpiece was measured by using Veeco optical surface profiler.

The measurements of the experiment are imported from the article’s excel file and are tabulated in Table 1:

Table 1

Factorial Designed Experiment and Measured Responses

In statistical terms, there are two independent variables:
• Powder Concentration• Energy

and one dependent variable:
• Surface Roughness, SR.

What is the multivariate technique that must be used to create a model that  describes both the effect of the input variables (Powder Concentration in dielectric liquid and Discharge Energy) on the output variable (Surface Roughness) and their relationship?

In this experiment, the researchers had set at their will the levels of Powder Concentration and Discharge Energy and then they measured the workpiece’s Surface Roughness, which is changing according to the factors’ settings. The purpose of the analysis is to determine how the independent variables influence the dependent variable.

Briefly, there are some factors (independent variables) whose variations influence another variable (dependent variable), and both of the following conditions are satisfied:

• there is a straightforward distinction between the independent and the dependent variables
• all the variables are metric.

In this case, the recommended technique for establishing relationships is the multiple regression. It must be noted that multiple regression provides a means of assessing simultaneously the magnitude and direction of the relationship of each independent variable with the dependent  variable and it also explains the nature of this relationship.

How can the regression equation be formed, i.e., how can the regression model be fitted to the data?

In multiple regression analysis, we establish a statistical equation that expresses the dependent variable as combination of the independent variables, their powers  and their interactions, if necessary.  The word statistical indicates that weights, predictions, and many more numbers related to the expression are estimations and not exact results like in an algebraic equation. The selection of the independent variables should be based on their theoretical relationship with the dependent variable established in the bibliography. In this mechanical process, it is recommended to form equations with variables raised to the second power, with the purpose to form the best predictor of the dependent variable. The model will be

$$estimated\ surface\ roughness=$$

$$=a_1+a_2\ast\ powder\ concentration+a_3\ast energy+a_4$$

$$\ast\left(\ powder\ concentration\right)^2+a_5\ast\left(energy\right)^2+a_6$$

$$\ast\left(\ powder\ concentration\right)\ast(energy)$$

where $a_1, a_2, a_3, a_4,$ … are numbers that are determined by mathematical methods and are given directly in the R-output.

One of the many uses of the multiple regression model is prediction: by substituting the values of a specific Powder Concentration and Energy setting in the model, we get a value for Surface Roughness, which is the predicted (or fitted value) of Surface Roughness at the given input factors setting.

The goal of the researchers is to form a model that makes “good” predictions, i.e., the predictions are close to the experimental values, or equivalently, the difference between the predictions and the experimental values are “small” numbers. “Small” number is a characterization related to the nature (magnitudes) of the data and we must clearly determine which numbers can be considered as “small numbers” or “large numbers”, as it will be mentioned subsequently.

Baseline Model

Many multiple regression models can be made starting with the linear one and then sequentially adding higher order polynomials.

A first task is to retrieve the package that helps in the creation of the models from the R-library, and then a secondary task is to insert our data from an Excel csv. file.

For the R-commands, we are going to use the following notation for the variables:

• Powder Concentration: $x1$
• Energy: $x2$
• Surface Roughness, $SR$.

Some important numbers of central tendency

for Surface Roughness are given below:

As a starting model, we can use, as always in our daily life, the mean model: we assume that the Surface Roughness value is equal to the mean of Surface Roughness ($\bar{SR}$), 12.195, at any setting of the input variables. So, the predicted value at any of the experimental setting based on this (straightforward)  model is equal to the mean of the output variable. To estimate how accurate is this model, a measure of the absolute distance of each observation from the mean of $\bar{SR}$ (the prediction value)  is required. Subsequently, the difference between experimental values and predictions must be evaluated. For the evaluation, we choose  the sum of the squared distance of each observation from the mean (the sum of squared distances is a reliable way to estimate the absolute distance between two measurements of the same entity), which is called total sum of squares and is denoted as SST:

$$SST=\ \left(14.31-12.195\right)^2+\left(5.38-12.195\right)^2+\ldots+\left(37.12-12.195\right)^2=2654.262$$

Since SST is a large number, i.e., the predictions based on the model are numbers far from the actual observations, therefore the sum of squared distances is a large number, the baseline model is not a good model.

The key idea of the regression analysis is to diminish as much as possible the differences between the predictions and observations by choosing better than the baseline models.

First Order Factors Model

We now form the first  regression model with linear terms only. We will call it $model1$:

Figure 1

R-Output for the Linear Regression Model

$$SR=-15.98+1.608\ast\ x1+0.00079*x2$$

The coefficients, found in the Estimate column of the Coefficients part of the R-output, Figure 1, give the average increase in $SR$ for each additional unit of one independent variable while the other one remains fixed: for each additional g/L unit of Powder Concentration and at a fixed level of Discharge Energy , Surface Roughness is higher on average by 1.608 $μm$. Similarly, for each additional $μJ$ unit of  Discharge Energy and at a fixed level of Powder Concentration, Surface Roughness will increase on average by 0.00079 $μm$. The purpose of the researchers is to minimize Surface Roughness. The extremely small  average change in Surface Roughness due to the change of Discharge Energy arises some suspicions; we will inspect more the Discharge Energy’s regression coefficient.

Statistical Significance Testing of Regression Coefficients

In the first part of the R-output, Figure 2, some statistically tests for the estimated regression coefficients are conducted.

Figure 2

R-Output for $model1$, with Linear Factors Only

The output of the software is extremely important, and it must not be used only for getting p-values and coefficients of determination, because many other parameters will clarify the accuracy of the model.

The estimates of the regression coefficients are computed based on the sample drawn from the population (formulas are not given because they are lengthy and always computed by R). Very probably, if another sample is drawn from the population, the estimates of the coefficients will be different. Are  the coefficients’ estimates of the model generalizable to the whole population?

To answer the question, firstly, the existence of the regression coefficient must be ensured: a statistical significance testing is needed to check if the coefficient of the variable $x1$ , for example, is different from zero. If the coefficient is zero, then the variable $x1$ is not needed in the model, i.e., it does not have any impact on the dependent variable. The null hypothesis $H_0$ , that has to be checked is

$H_0$: The real regression coefficient of $x1$ (appropriate to the population model) is equal to zero.

As it is common in all statistical tests, a test statistic must be computed. The test statistic must be the ratio of the estimated coefficient in the regression model, 1.608, divided by  the standard deviation of the regression coefficients. The standard deviation of the coefficients of a specific variable, in this case $x1$, is the standard deviation of the distribution formed by all the estimated coefficients in the models that are based on all the samples of the same size.  It is called the standard error of the coefficient and is found in the $Std.Error$ column of the R-output.

It has been proven that the test statistic of this type follows a $t\ distribution$ with

(sample size- number of the estimated parameters in the model) $degrees\ of\ freedom$.

Since in the model there are two estimated parameters namely, the intercept and the coefficient of the variable $x1$, the $t\ distribution$ has 14 $degrees\ of\ freedom$.

The $t-statistic$ is

$$t=\frac{1.608}{standard\ error\ of\ the\ regression\ coefficients\ of\ x1}=\frac{1.608}{0.4577}=3.513$$

and is given in the  column in the R-output.

In the $t\ distribution$ with 14 degrees of freedom, we locate the critical value to the right of where the estimated coefficients of models that are exceedingly rare to be selected are placed. In fact, we set that there is probability less than 5% to select samples whose associated models have an estimated coefficient greater than or equal to the chosen critical value.

The probability in $Pr(>|t|\ )$ column, $p-value$, indicates the probability of choosing a sample of size 16 that yields the regression estimate of 1.608, under the condition the null hypothesis is true. Formulated in other words: Under the condition the coefficient of Powder Concentration in the regression model for the whole population is zero, how probable is it to choose a sample from this population which will yield an estimate regression coefficient whose associated value t will be greater than or equal to 3.513? The probability of finding such a sample  is extremely small, so we have evidence that the null hypothesis is not true, and it has to be rejected. Therefore, the estimate does not occur by chance and the variable $x1$ has a considerable impact on the dependent variable.

Confidence Intervals for Regression Coefficients

Then, a second task is to determine a confidence interval for the regression coefficient, i.e.,  a range centered at the given non-zero regression coefficient, which will encompass the vast majority of the estimates of the regression coefficients that are coming from various samples. In statistics we always have approximations, so we must reformulate the above statement by mentioning that the range will be given to a certain degree of confidence.

The standard deviation of the coefficients, i.e., the square root of their variation which is called standard error of the regression coefficients and is given in the $Std.Error$ column of the output, gives an idea on how the coefficients are varying over all sample of a specific size.

Using the value of the standard error of the regression coefficients and the theorem that asserts that the regression coefficients of a variable calculated from models based on all samples of size 16 follow a $t\ distribution$ we can form a 95% confidence interval for the coefficients.

For example, the estimated coefficient of the variable $x1$ in  a linear model, will vary

from  1.608-2.145*0.4577=0.62 to 1.608+2.145*0.4577=2.6,

for most samples of size 16. The confidence interval for the regression coefficient is calculated by using the well-known formula 1.608+/- t*(standard error of the regression coefficient). In this case, $t$ is the $critical value$ which divides the $t-distribution$ into parts that cover some percentages of distribution; they are set at 95% and 5% and represent the level of confidence and significance, respectively.

Predictions or Fitted Values

If we substitute $x1$ and $x2$ with their actual measurements, Table 1, then we get a prediction (fitted) value for the output variable. The purpose of regression analysis is to find a “good” model that minimizes the distance between the fitted and observed values at each setting of input factors, i.e.,  it minimizes the sum of squares of the residuals. We can get the prediction values based of the above model and the difference between actual and prediction values:

Figure 3

The Difference Between Actual and Prediction Values for $model1$

ANOVA Output

Figure 4

R-output for ANOVA Investigation of $model1$

Some particularly important information about the predictions of $model1$ is given in the R-output (Figure 4)

When we examine the analysis of variance, the total sum of squares is divided into two parts: The sum of squared regression, SSR, which is the sum of the squared distances of each fitted value from the mean of $surface\ roughness$: $$SSR=\left(0.1242-12.195\right)^2+\left(0.1314-12.195\right)^2\ldots\left(24.2665-12.195\right)^2 =1292.7$$ and

The sum of the squared residuals, the squared differences of the actual and fitted values, which is the sum of squared errors, SSE: $$SSE=\left(0.1242-14.31\right)^2+\left(0.1314-5.38\right)^2\ldots\left(24.2665-37.12\right)^2 =1361.3$$ It is obviously true that: $$SST= SSR+SSE$$

So, SSE is a measure of the prediction errors and SSR a measure of the prediction success. Intuitively, we understand that we must divide each of these squared sums by an appropriate number for getting something similar to the average dispersion, i.e., the variance. This number is termed $degrees\ of\ freedom$ and has well determined values:

for the sum of squared regression (i.e., SSR) is equal to the number of the independent variables in the regression model

for the sum of squared errors or residuals (i.e., SSE) is equal to (the sample size- number of estimated coefficients in the model; number of estimated coefficients is the number of variables increased by1, because of the intercept in the model).

The division of the sum of the squared distances of the predictions around the mean by its degrees of freedom is equal to the variance of the regression terms:

1292.7/2= 646.34

The degrees of freedom of the residuals is 16-3 = 13. Their variance is 1361.6/13= 104.74

Are There Some Other Important Numbers in the First Part of the R-Output?

Yes! Definitely! If we do not take into consideration the residual standard error (Figure 2) an important part of information will be missed.

In the output, the range of residuals (difference between actual and prediction values) is being calculated. In this example, the residuals range from -14 to +14. The residuals’ large range is due to the remoteness of some observations from the regression line. A close examination of the values of Figure 4 leads to the same conclusion, but tables with the residuals are not easily handled, especially if we deal with a large sample size. However, there is a “unit” of measurement of the spread of residuals, the residual standard error, which is the standard deviation of the residuals about zero (the mean of the residuals is always zero). It is the squared root of the variance of the residuals (the standard deviation is always the square root of the variance) $$Residuals\ standard\ error=\sqrt{variance\ of\ the\ residuals}=\sqrt{SSE/df}=\sqrt{1361.6/13}=10.23$$ If the residual standard error is a large number compared to the given data, this is a first hint about the low predictive power of the model. In our example, the standard deviation of the residuals is 10.23 greater than the median of the Surface Roughness. So, this is a strong evidence that the model with the two independent variables is not an accurate one.

This number is the residuals standard error.

What is the ANOVA F-test?

To estimate the overall significance of the model i.e. if the model has truly a meaning, we perform the F-test , given in Figure 2.

In the F-test, the accuracy of the studying model over the baseline model is checked.

In other words, the F-test examines whether the polynomial model explains (predicts) the data better than the baseline model, where each prediction is equal to the mean of the dependent variable.

The F statistic is $$F= \frac{SSR/2}{SSE/13}=\frac{646.34}{104.7}=6.171$$ and is greater than the F-critical value of 3.81 with 2 and 13 degrees of freedom saying that the model must have a significant value in explaining the dependent variable, i.e. the reduction in error that is obtained when using the model with the two input factors to predict $Surface\ Roughness$ is not a random occurrence.

In fact, the $F-value$ means that, we can explain 6.171 times more variation when using the model with two predictors than when using the average; the test assures that this has not happened by chance.

It is important to observe that the denominator of the $F-statistic$ ratio is the square of the residual standard error, i.e., we compare the mean of sum of squared regression to the square of the residual standard error in order to decide if is a small or a large number.

Certainly, it is not a terribly interesting result since it provides a comparison against the model with no variables at all. Obviously, any polynomial is better than the baseline model.

What is the Coefficient of Determination, Denoted as R2?

The coefficient of determination $R^2$ is the ratio of the sum of squared regression to the total sum of squares and gives the percent of variation explained by the model.

$$R^2=\frac{sum\ of\ squared\ regression}{total\ sum\ of\ squares}=\frac{1291.7}{2654.262}=48.7%$$

Briefly, greater the coefficient of determination strongest the model will be since it will explain more of the total variance around the mean. The coefficient of determination is a number always positive (since it is the ratio of positives) and smaller or equal to 1 since the numerator is smaller than the denominator which represents the total variation. In this example, $R^2$ asserts that 48.7% of the variation of the $Surface\ Roughness$ is explained by the model of the two independent variables.

Is the Coefficient of Determination, R2, an Important Measure for the Accuracy of the Model?

Yes, but…

we must take into account the adjusted coefficient of determination as well, adjusted $(R^2).$ It is well known that the addition of a variable in the model always increases $R^2$. The adjusted. $(R^2)$ is a modified measure that considers the number of independent variables and the sample size and it always is smaller than $R^2$. Although the addition of an independent variable will cause $R^2$ to rise, the adjusted $(R^2)$ may fall if the new variable does not contribute a lot in the explanation of the dependent variable.

We also observe that the coefficient of the variable $x2$ is remarkably close to zero and is not statistically significant. The linear regression model with dependent variable $SR$ and independent variable $x2$ shows that the coefficient of determination is zero! (Figure 5). Also, the sum of squared regression is zero, which means that the variation of the Surface Roughness is not influenced by the variable Energy. So, by using the variable Energy gives no better predictions than using the mean of Surface Roughness. Very probably, the levels of Energy chosen are not the appropriate ones.

Figure 5

R-Output  of The Linear Model with One Independent Variable

Second  Order Factors Model

We continue by adding to the previous multiple regression model the input factors raised to the power of two. Then the corresponding model is

Figure 6

R-Output for $model2$, With Linear and Square Factors

$$SR=73.53-9.898\ast x1+0.037\ast x2+0.33\ast (x1)^2- 0.0004\ast (x2)^2$$

We will call the quadratic model, $model2$. From the R- output for the quadratic model (Figure 6) we see that the coefficients of $x1$ and ${(x1)}^2$ are statistically significant, although the coefficients of $x2$ and ${(x1)}^2$ remain statistically insignificant. The  coefficient of determination increases to 89.42%  since two new variables are added, and the adjusted $R^2$ increases substantially related to the adjusted $R^2$ of the linear model. The overall model is statistically significant since the probability of getting an $F-statistic$ equal to 23.24 is extremely small.

The $Residual\ standard$ error is 5.052 less than the half of the $Residual\ standard\ error$ of the linear model, $model1$ (10.23).

So, by using the linear model for predictions the absolute average distance of the difference between actual and prediction values is the double of the absolute distance of the difference between the actual and predicted values when using the quadratic model. This calculation shows the superiority of the quadratic over the linear model.

Also, the expected range of the regression coefficient of the variable $x1$ across multiple samples of size 16 can be calculated

A 95% confidence interval for the regression coefficient of the variable $x2$ will be centered at -9.89. The standard error of the regression coefficient is used as a measure, as a “safe” radius, for creating the range over which will lie the regression coefficients of most quadratic models associated with other samples of size 16. The formula that is used is: $$\left(-9.89-t_{0.025}\ast1.78,\ \ -9.89+t_{0.025}\ast1.78\right)=(-13.82,\ -5.97)$$

By substituting $x1$ and $x2$ with the actual values of the factors Powder Concentration and Energy we get the fitted values, based on the above squared model for each experimental settings. The difference between actual and fitted values gives a first insight for the accuracy of the model:

Figure 7

The Difference Between Actual and Predicted Values for the Quadratic Model

The R-output of ANOVA (Figure 8)  gives more insight for the quadratic model:

Figure 8

Analysis of Variance for the Quadratic Model

More detailed explanations:

The sum of squared regression, SSR, is the sum of squares of the differences between each predicted value and the mean of the surface roughness (Figure 7): $$SSR=\left(8.31-12.195\right)^2+\left(8.38-12.195\right)^2\ldots\left(32.45-12.195\right)^2 =2373.458$$ But the sum of squared regression due to the linear factors is 1292.7 (Figure 2). So, the sum of squared regression due to the factors of the second order is $$2373.458-1292.7=1080.8$$ In this case, the residuals have 11 degrees of freedom(sample size-number of estimated coefficients in the model=16-5) since in the new model there are 5 estimated coefficients namely, the intercept, the variables of first order and two variables of second order. $$SSE=\left(14.31-8.31\right)^2+\left(5.38-8.38\right)^2\ldots\left(37.12-32.45\right)^2 =280.80$$ As each new variable is entered into the regression equation, R performs a statistical test of nonlinear component which indicates if the new variable is really needed: the linear and quadratic terms in the ANOVA test are statistically significant because of the appearance of $x1$ although $x2$ remains statistically insignificant
The Full Model with Square Terms and Interactions

The full regression model, we will call it $model3$, is the one that includes the factors of the first and second order and their interactions:

Figure  9

R-Output for The Full Model

$$SR=103.51–11.61\ast x1-0.55\ast x2+0.33\ast\left(x1\right)^2-0.0003{\ast\left(x2\right)}^2\\ +\ 0.034\ast x1\ast x2$$

The new variable of interactions is not statistically significant (Figure 9), so not really needed; The same is true for the variables $x2$ and $({x2)}^2$ which were not significant in the previous model as well.

The coefficient of determination, $R^2$, increases to 92.12% and the adjusted $R^2$ is 88.18%. The coefficient of determination, $R^2$, of the full model $(model3)$ has increased related to the coefficient of determination of the quadratic model, and the percentage increase is almost the same as the increase of the adjusted coefficients of determination of the two models. So, the full model contributes substantially to the explanation of the variation about the mean.

The performance of the $F-test$ shows that the overall significance of the full model is statistically significant and the high value of the coefficient of determination cannot occur by chance.

Observing the residuals we see that their standard error, so the standard deviation of the distances between the observed and expected values, has been ameliorated but there are some residuals greater in magnitude related to the ones of the quadratic model (experimental trails 8 and 12)

The differences between the actual and predicted based of the full model values are ameliorated as we can observe:

Figure 10

The Difference Between Actual and Predicted Values for the Quadratic Model

ANOVA for model3

The new variable of interactions is not statistically significant, so not really needed in the model;  the sequential sum of squared regression is equal to only 71.70. The squared sum of errors decreases to 209.10 compared to 280.80 of the quadratic model, although their range is little larger than the range of the residuals of the quadratic model (Figure 11)

Figure 11

Analysis of Variance for $model3$

Must we Test the Assumptions for Multiple Regression after the Model has been Derived?

Yes. Testing the assumptions must occur not only in the initial phase of the multiple regression, but also after the model has been estimated. The four assumptions that must be checked are

• Linearity of the phenomenon measured
• Normality of the error terms
• Constant variance of the error terms (homoscedasticity)
• Independence of the error terms

The above assumptions are usually examined by observing the diagnostic plots, i.e., the plots of the residuals against the observed or predicted values of the dependent variable. The residuals are standardized in order to be comparable. In the following diagnostic plot, we check the linearity of the relationship between the dependent and all the independent variables in the three models that have been derived.

Diagnostic Plot 1

Linearity of the Phenomenon in $model1, model2, model3$

Standardized Residuals Plotted Against Fitted Values

In the first part of the diagnostic plot for linearity, the residuals of the linear model, $model1$, follow a curvilinear pattern, they are not randomly dispersed around the horizontal line, and this is an indication of a non-linear relations between the independent variables and $SR$. So, the statisticians’ choice of a second order model for the description of the relations between variables in micro-EDM experiments is supported by this diagnostic plot. The residuals of the diagnostic plot of the full model, $model3$, are much better than the residuals of $model2$; they are more dispersed and horizontal.

Diagnostic Plot 2

Normal Q-Q Plot. Testing the Normality of Residuals

The normal Q-Q plot detects if the residuals follow a normal distribution and satisfy the second assumption. The residuals are normally distributed if the points follow the dotted line closely. The residuals of the linear model and the full model pass the test of normality easily, although the residuals of $model2$ have some problems.

Diagnostic Plot 3

Scale Location Plot. Testing the Homoscedasticity of Residuals

Residuals Plotted Against Fitted Values

The third diagnostic plot check the third condition, i.e., the variance of the residuals should be reasonably equal across the predicted values. The condition is fulfilled if the points of the plot are spread without a distinct pattern, and if they are spread horizontally is an ideal situation for homoscedasticity. In this case, the residuals of the linear and full model show some patterns of heteroscedasticity, especially for the values up to 15.

Graphical Presentation of The Surface

In graphical terms, a two variables multivariate polynomial equation is portrayed by a surface with a peak or a valley. R offers tools, extremely handful, for creating the response surfaces of the models.  By observing the response surface, we can get an insight about the behavior of the variables and the way they are changing. We also can locate the ranges of the independent values which optimize the surface roughness

Graph 1

Graphical Presentation of The Surface Associated to the Full Model

Graph 2

Contour Plot Associated to the Full Model

error: Content is protected!