How To Determine If A Data Point Is An Outlier

Previously in Lesson 4 we mentioned two measures that nosotros utilise to help identify outliers. They are:

Residuals
Standardized Residuals

We briefly review these measures here. Still, this time, we add a little more detail.

Residuals

Equally you know, ordinary residuals are defined for each observation, i = one, ..., due north as the deviation between the observed and predicted responses:

\[e_i=y_i-\hat{y}_i\]

For example, consider the following very small (contrived) data set containing northward = 4 data points (x, y).

minitab output

The column labeled "FITS1" contains the predicted responses, while the column labeled "RESI1" contains the ordinary residuals. Every bit you can come across, the first residual (-0.ii) is obtained by subtracting 2.2 from 2; the second residual (0.6) is obtained by subtracting four.4 from 5; and then on.

Every bit you know, the major problem with ordinary residuals is that their magnitude depends on the units of measurement, thereby making it difficult to use the residuals every bit a way of detecting unusual y values. Nosotros tin eliminate the units of measurement by dividing the residuals by an judge of their standard deviation, thereby obtaining what are known as standardized residuals.

Standardized residuals

Standardized residuals (sometimes referred to as "internally studentized residuals") are defined for each observation, i = 1, ..., n as an ordinary balance divided past an estimate of its standard deviation:

\[r_{i}=\frac{e_{i}}{s(e_{i})}=\frac{e_{i}}{\sqrt{MSE(1-h_{ii})}}\]

Here, we run across that the standardized residue for a given data indicate depends not only on the ordinary rest, but likewise the size of the mean square error (MSE) and the leverage h_ii .

For example, consider again the (contrived) information set containing north = 4 data points (x, y):

minitab output

The cavalcade labeled "FITS1" contains the predicted responses, the column labeled "RESI1" contains the ordinary residuals, the cavalcade labeled "HI1" contains the leverages h_ii , and the column labeled "SRES1" contains the standardized residuals. The value of MSE is 0.40. Therefore, the first standardized residual (-0.57735) is obtained by:

\[r_{1}=\frac{-0.2}{\sqrt{0.4(i-0.7)}}=-0.57735\]

and the 2nd standardized residual is obtained by:

\[r_{2}=\frac{0.six}{\sqrt{0.4(one-0.three)}}=1.13389\]

and and so on.

The good affair nearly standardized residuals is that they quantify how large the residuals are in standard divergence units, and therefore can exist hands used to place outliers:

An ascertainment with a standardized residual that is larger than 3 (in absolute value) is deemed by some to be an outlier. [It is technically more than right to reserve the term "outlier" for an observation with a studentized residual that is larger than 3 in absolute value—we consider studentized residuals in the next section.]
Some statistical software flags any ascertainment with a standardized residual that is larger than 2 (in absolute value).

Using a cutoff of 2 may exist a fiddling conservative, simply perhaps information technology is amend to be safe than lamentable. The fundamental hither is not to take the cutoffs of either 2 or iii besides literally. Instead, treat them simply equally ruby-red warning flags to investigate the data points further.

Instance #ii (again)

Permit's accept another wait at the post-obit information set (influence2.txt)

scatterplot of y vs x

In our previous look at this data prepare, nosotros considered the carmine data bespeak an outlier, because it does not follow the general trend of the remainder of the data. Let's run into what the standardized residual of the red data betoken suggests:

minitab output

Indeed, its standardized residual (3.68) leads this software to flag the data point every bit being an observation with a "Large remainder."

Why should we intendance nigh outliers?

We certain spend an atrocious lot of fourth dimension worrying about outliers. But, why should nosotros? What bear upon does their existence accept on our regression analyses? I easy mode to learn the answer to this question is to clarify a data fix twice—in one case with and once without the outlier—and to observe differences in the results.

Let'due south attempt doing that to our Instance #2 data set. If nosotros backslide y on x using the data set up without the outlier, we obtain:

minitab output

And if we backslide y on 10 using the full data set up with the outlier, we obtain:

minitab output

What aspect of the regression analysis changes essentially considering of the beingness of the outlier? Did you discover that the hateful square error MSE is essentially inflated from 6.72 to 22.xix past the presence of the outlier? Recalling that MSE appears in all of our conviction and prediction interval formulas, the inflated size of MSE would thereby cause a detrimental increase in the width of all of our confidenceand prediction intervals. Yet, as noted in Department 9.one, the predicted responses, estimated slope coefficients, and hypothesis exam results are not affected past the inclusion of the outlier. Therefore, the outlier in this case is non deemed influential (except with respect to MSE).