Data visualization toolbox
Bivariate data consist of paired samples or measurements of two quantitative variables. The general task with this type of data is to determine how the variables are related. In come cases the variables are a factor (independent variable) and a response (dependent variable). In other cases the variables are not functionally related, but their distributions can be compared.
This chapter includes examples of analyzing several data sets:
The workhorse method for curve fitting is least squares regression -- estimating a curve which has the minimum sum of the squares of the residuals. The most common practice is to attempt to fit a straight line to the data. This is just the first order case of polynomial fitting, which is the most popular family for parametric curve fitting.
Statistical theory holds that the least squares estimators of the regression of Y on X with residuals R are the most efficient of all unbiased estimators provided that: X has no error, the mean of R is 0, the variance of R is independent of X, R is normally distributed, and R is independent. Since these conditions are rarely satisfied in full, it is important to visualize the fit to see that the regression is sensible.
![]() |
Figure 3.1 First and second order polynomial fits to the ganglion data. (book 3.5) |
Figure 3.1 contains the ganglion data and the straight line fit which the experimenters chose to explain it. Clearly a second order fit is much more consistent with these data.
It is desirable to use the lowest order which fits the data. Attempts to use high order polynomial curve fitting often result in numerical instabilities and poor fits. The curves either don't track the data trends or they exhibit an excess of wiggles.
In this case, however, there is no motivation to select the first order fit rather than the second.
![]() |
Figure 3.2 First order polynomial fit to the log transformed data. (book 3.16) |
Log transformation of one or both variables can be useful for "straightening" curvature in the data. Transforming the CP ratio in the ganglion data leads to the plot in Figure 3.2. A linear fit works well with this transformed data.
(loess.m)
![]() |
Figure 3.3 Sixth and thirteenth order polynomial fits to the melanoma data. |
Polynomial curves are unable to track complicated data trends. The oscillatory data in Figure 3.3 provides an example.
![]() |
Figure 3.4 Loess fit to the melanoma data. |
Nonparametric methods have been developed for curve fitting data with complex shapes. A popular method is loess (also spelled lowess). This local regression smoother is able to follow the complex data trend in Figure 3.4 .
The loess curve is calculated one point at a time. For each of these
points, a value is computed based on a weighted regression using the neighboring data
points. There are two loess parameters: smoothness (usually between 0 and 1) and degree
(usually 1 or 2). Consider a loess fit to n data points, (X,Y). To determine the fit value at x0, compute and sort all the distances Let di be the ith smallest distance. Let where alpha is the smoothness parameter. Calculate a distance scale The tricube weight function is The neighborhood weights for the data points are The weighted regression for the fit value at x0 is then calculated by first or second degree weighted regression. This process is repeated for each of the fit points. |
Examining the residuals is an important part of curve fitting. For a high quality fit, the residuals should not display a trend or a varying spread with the factor; and their distribution should be approximately normal. Residuals for three candidate fits to the ganglion data are displayed below. Loess curves are included to help visualize trends in the residuals.
![]() |
Figure 3.5 Residual dependence plot of the first order polynomial fit to the ganglion data. (book 3.12) |
(rdplot.m)
Figure 3.5 displays the residuals for the first order fit to the ganglion data. It reinforces the evidence in Figure 3.1 that the simple linear fit does not describe the data.
The second order fit in Figure 3.7 and the linear fit to the log transformed data in Figure 3.6 both display a satisfactory lack of residual trend with the factor.
![]() |
Figure 3.6 Residual dependence plot of the first order polynomial fit to the log transformed ganglion data. (book 3.17) |
![]() |
Figure 3.7 Residual dependence plot of the second order polynomial fit to the ganglion data. (book 3.13) |
(slplot.m)
Spread location plots in Figures 3.8 and 3.9 can be used to check for homogeneous residuals, in particular for monotone spread. Loess curves are included to help visualize trends in the residual spread. These figures suggest that the log transformation has enabled a better fit with no monotone spread of the residuals.
![]() |
Figure 3.8 Spread location plot for the residuals from the second order fit. (book 3.14) |
![]() |
Figure 3.9 Spread location plot for the residuals from the first order fit to the log transformed data. (book 3.18) |
![]() |
Figure 3.10 Normal Q-Q plot of the residuals from the linear fit to the log transformed data. (book 3.20) |
The least squares approach is based on an assumption that the residuals are normally distributed. Figure 3.10 shows that this is appropriate for the log transformed case.
![]() |
Figure 3.11 R-F spread plot for the linear fit to the log transformed data. (book 3.19) |
The residual fit spread plot (rfplot.m)
consists
of quantiles of the fitted values minus their mean and of the residuals.
Figure 3.11 indicates that this fit accounts for most of the data variation.
For the ganglion data, the linear fit to the log transformed data is the best of the candidate fits. It is consistent with the data. The residuals do not exhibit a trend or a spread with the factor. The residuals are approximately normally distributed. The fit explains most of the data variation.
(bisquare.m)
Experimental data often contain values which look like outliers. How should we deal with these values without being arbitrary? The bisquare method provides a robust approach for these cases. This iterative method applies weights to the data points based on their distances from the estimated mean value.
![]() |
Figure 3.12 Dating data with simple unweighted and robust bisquare linear fits. (book 3.25) |
Figure 3.12 contains the dating data. A simple unweighted linear regression is heavily influenced by the points at thorium ages of 17 and 27. The robust regression using bisquare is more consistent with the bulk of the data.
Bisquare can be used with both parametric and nonparametric fitting methods.
The bisquare method is usually considered a means to suppress the influence of outliers. Examination of the bisquare weights can also be used to identify outliers.
Bisquare fitting is an iterative least squares method which uses bisquare
weights. Start with a standard least squares fit. Calculate the residual r for each data
point. As a measure of spread, s, calculate the median absolute deviation of the
residuals. The bisquare weight function, B, is given by The robustness weight, w, for each data point is given by Use these weights to calculate a new fit with new residuals and repeat until the fit converges. |
(jitter.m)
In many experiments the values are integers or they are rounded to a set of finite steps. This can make a simple scatter plot unsatisfactory. Figure 3.13 is a scatterplot of the fly data. There are 823 data points, but only 102 are visible. Figure 3.14 displays the same data set with both variables jittered.
![]() |
Figure 3.13 Scatterplot of the fly data. (book 3.45) |
![]() |
Figure 3.14 Jittered plot of the fly data. (book 3.46) |
The question is whether the facet number has a linear dependence on temperature. An earlier analysis using analysis of variance concluded that there is a departure from linearity. Figure 3.15 provides a picture to go with this conclusion. However this use of a simple linear regression is inappropriate given the experimenters' report of uncertainty in the reported temperature.
It is all to common that transcription errors occur in real world experiments. If we swap the 23 and 25 degree data, the result is quite linear. (Figure 3.16)
![]() |
Figure 3.15 Mean values for the facet number at each temperature and a linear fit based on all data. (book 3.49) |
![]() |
Figure 3.16 The same data as Figure 3.15 with the 23 and 25 degree values switched. |
![]() |
Figure 3.17 Time series plot of the melanoma data. (book 3.64) |
A special case of bivariate data is the time series. In the ideal time series, the values correspond to uniform time steps with no gaps. Time series often contain variation components at more than one time scale. The incidence of melanoma (a skin cancer) in Figure 3.17 appears to oscillate irregularly about a long term increasing trend.
![]() |
Figure 3.18 Loess fit to the residuals from the trend fit to the melanoma data. (book 3.67) |
An approximately linear loess curve was fit to these data to remove the long term trend and the residuals are plotted in Figure 3.18 together with a less smooth loess curve to show the remaining oscillations.
Is this oscillatory component related to solar activity? The loess curve from Figure 3.18 is plotted in Figure 3.19 together with the number of sunspots observed during the same period. It is easier to see the extent of the agreement between these two time series if we fit a loess curve using the same parameters to the sunspot data and shift it to align the cycles. The result is plotted in Figure 3.20.
![]() |
Figure 3.20 Oscillatory component of the melanoma data compared with smoothed and shifted sunspot numbers. |
![]() |
Figure 3.19 Oscillatory component of the melanoma data compared with sunspot numbers. (book 3.71) |
Some time series contain a seasonal component -- one which is quite regular on a certain time scale. The carbon dioxide data in Figure 3.21 contain a clear annual component as well as a long term trend.
(bank45.m)
![]() |
Figure 3.21 Time series of the CO2 data plotted with aspect ratio set to 1. (book 3.73) |
In Figure 3.21 the data have been plotted at an aspect ratio so that the long term trend is displayed at an angle of approximately 45 degrees. This makes it easy to see that the long term trend is convex upward -- the rate of increase is increasing. Adjusting the aspect ratio of a graph to adjust the orientations of trends is called banking. It enhances the visual perception of line segment orientations. The aspect ratio in Figure 3.21 enhances the long term trend, but it does not reveal much about the annual cycles.
![]() |
Figure 3.22 Time series of the CO2 data with local segments banked to 45 degrees. (book 3.72) |
By banking the local curve segments to 45 degrees as in Figure 3.22, we can see that the annual cycles are not symmetric. Many of the figures in this tutorial use banking to improve the visualization.
(seasonaloess.m)
![]() |
Figure 3.23 Cycle plot of the CO2 data. (book 3.75) |
Seasonal loess can be used to extract a periodic component from a time series for detailed examination. In Figure 3.23 the seasonal component is displayed in a cycleplot -- the component values are grouped by month. This display reveals that the amplitude of the annual oscillation is increasing over time.
![]() |
Figure 3.24 Residuals from the seasonal fit to the CO2 data. (book 3.76) |
The data with the seasonal component removed are plotted in Figure 3.24, which shows the increasing trend more clearly.
1 Introduction | 4 Trivariate Data |
2 Univariate Data | 5 Hypervariate Data |
3 Bivariate Data | 6 Multiway Data |
Send feedback to author@datatool.com | Go to Data visualization home |