Data visualization toolbox

3 Bivariate Data

 

Bivariate data consist of paired samples or measurements of two quantitative variables. The general task with this type of data is to determine how the variables are related. In come cases the variables are a factor (independent variable) and a response (dependent variable). In other cases the variables are not functionally related, but their distributions can be compared.

Data sets

This chapter includes examples of analyzing several data sets:

Ganglion
Data about the distribution of ganglion cells across the surface of the retinas of cats as they develop. The goal of the analysis is to study how the ratio of the number of ganglion cells in the center of the retina to the number in the periphery depends on retinal area. The experimenters reported a linear relationship, which we will see is not accurate.
Dating
Carbon and thorium based ages for 19 coral samples. The analysis goal is to calibrate the carbon ages based on the thorium ages.
Fly
Observations of the number of eye facets on flies hatched from incubators at different temperatures. The goal of the experiment was to see how the number of facets depends on temperature. The experimenters concluded that the dependence is not linear.
Melanoma
Yearly incidences of melanoma in Connecticut. The goal is to see how the incidence varies through time.
Carbon dioxide
Monthly average CO2 concentrations at the Mauna Loa observatory in Hawaii. The goal is to see how CO2 varies over short and long time periods.

 

Curve Fitting: parametric and loess

The workhorse method for curve fitting is least squares regression -- estimating a curve which has the minimum sum of the squares of the residuals. The most common practice is to attempt to fit a straight line to the data. This is just the first order case of polynomial fitting, which is the most popular family for parametric curve fitting.

Statistical theory holds that the least squares estimators of the regression of Y on X with residuals R are the most efficient of all unbiased estimators provided that: X has no error, the mean of R is 0, the variance of R is independent of X, R is normally distributed, and R is independent. Since these conditions are rarely satisfied in full, it is important to visualize the fit to see that the regression is sensible.

Parametric curves

fig_3_5.gif (4254 bytes)
Figure 3.1  First and second order polynomial fits to the ganglion data.  (book 3.5)

Figure 3.1 contains the ganglion data and the straight line fit which the experimenters chose to explain it. Clearly a second order fit is much more consistent with these data.

It is desirable to use the lowest order which fits the data. Attempts to use high order polynomial curve fitting often result in numerical instabilities and poor fits. The curves either don't track the data trends or they exhibit an excess of wiggles.

In this case, however, there is no motivation to select the first order fit rather than the second.


fig_3_16.gif (3256 bytes)
Figure 3.2 First order polynomial fit to the log transformed data. (book 3.16)

Log transform

Log transformation of one or both variables can be useful for "straightening" curvature in the data. Transforming the CP ratio in the ganglion data leads to the plot in Figure 3.2. A linear fit works well with this transformed data.

 



Flexible curves  (loess.m)

alt_3_6.gif (3563 bytes)
Figure 3.3 Sixth and thirteenth order polynomial fits to the melanoma data.

Polynomial curves are unable to track complicated data trends. The oscillatory data in Figure 3.3 provides an example.

 

 

 

 

 

 

alt_3_6a.gif (3404 bytes)
Figure 3.4  Loess fit to the melanoma data.

Nonparametric methods have been developed for curve fitting data with complex shapes. A popular method is loess (also spelled lowess). This local regression smoother is able to follow the complex data trend in Figure 3.4  .

 

 

 

 

 

Details of loess  (loess.m)
The loess curve is calculated one point at a time. For each of these points, a value is computed based on a weighted regression using the neighboring data points. There are two loess parameters: smoothness (usually between 0 and 1) and degree (usually 1 or 2).

Consider a loess fit to n data points, (X,Y). To determine the fit value at x0, compute and sort all the distances

Let di be the ith smallest distance. Let

where alpha is the smoothness parameter. Calculate a distance scale

The tricube weight function is

The neighborhood weights for the data points are

The weighted regression for the fit value at x0 is then calculated by first or second degree weighted regression. This process is repeated for each of the fit points.

 

Visualizing Residuals

Examining the residuals is an important part of curve fitting. For a high quality fit, the residuals should not display a trend or a varying spread with the factor; and their distribution should be approximately normal. Residuals for three candidate fits to the ganglion data are displayed below. Loess curves are included to help visualize trends in the residuals.

fig_3_12.gif (4219 bytes)
Figure 3.5  Residual dependence plot of the first order polynomial fit to the ganglion data. (book 3.12)

Trend  (rdplot.m)

Figure 3.5 displays the residuals for the first order fit to the ganglion data. It reinforces the evidence in Figure 3.1 that the simple linear fit does not describe the data.

The second order fit in Figure 3.7 and the linear fit to the log transformed data in Figure 3.6  both display a satisfactory lack of residual trend with the factor.


fig_3_17.gif (3109 bytes)
Figure 3.6 Residual dependence plot of the first order polynomial fit to the log transformed ganglion data. (book 3.17)
fig_3_13.gif (3015 bytes)
Figure 3.7 Residual dependence plot of the second order polynomial fit to the ganglion data. (book 3.13)


Monotone spread  (slplot.m)

Spread location plots in Figures 3.8 and 3.9 can be used to check for homogeneous residuals, in particular for monotone spread. Loess curves are included to help visualize trends in the residual spread. These figures suggest that the log transformation has enabled a better fit with no monotone spread of the residuals.

fig_3_14.gif (3290 bytes)
Figure 3.8  Spread location plot for the residuals from the second order fit. (book 3.14)
fig_3_18.gif (3118 bytes)
Figure 3.9  Spread location plot for the residuals from the first order fit to the log transformed data. (book 3.18)

 

 

 

 

 


 

fig_3_20.gif (3914 bytes)
Figure 3.10 Normal Q-Q plot of the residuals from the linear fit to the log transformed data. (book 3.20)
 

Residual distribution

The least squares approach is based on an assumption that the residuals are normally distributed. Figure 3.10 shows that this is appropriate for the log transformed case.

 


Explanation of variation

fig_3_19.gif (4210 bytes)
Figure 3.11 R-F spread plot for the linear fit to the log transformed data. (book 3.19)

The residual fit spread plot (rfplot.m) consists of quantiles of the fitted values minus their mean and of the residuals.   Figure 3.11 indicates that this fit accounts for most of the data variation.

 

 

 


For the ganglion data, the linear fit to the log transformed data is the best of the candidate fits. It is consistent with the data. The residuals do not exhibit a trend or a spread with the factor. The residuals are approximately normally distributed. The fit explains most of the data variation.


Robust fitting  (bisquare.m)

Experimental data often contain values which look like outliers. How should we deal with these values without being arbitrary? The bisquare method provides a robust approach for these cases. This iterative method applies weights to the data points based on their distances from the estimated mean value.

fig_3_25.gif (4574 bytes)
Figure 3.12  Dating data with simple unweighted and robust bisquare linear fits. (book 3.25)

Figure 3.12 contains the dating data. A simple unweighted linear regression is heavily influenced by the points at thorium ages of 17 and 27. The robust regression using bisquare is more consistent with the bulk of the data.

Bisquare can be used with both parametric and nonparametric fitting methods.

The bisquare method is usually considered a means to suppress the influence of outliers. Examination of the bisquare weights can also be used to identify outliers.

 

 

 

 

 

 



Details of bisquare  (bisquare.m)
Bisquare fitting is an iterative least squares method which uses bisquare weights. Start with a standard least squares fit. Calculate the residual r for each data point. As a measure of spread, s, calculate the median absolute deviation of the residuals.

The bisquare weight function, B, is given by

The robustness weight, w, for each data point is given by

Use these weights to calculate a new fit with new residuals and repeat until the fit converges.

 

Discrete values of a factor  (jitter.m)

In many experiments the values are integers or they are rounded to a set of finite steps. This can make a simple scatter plot unsatisfactory. Figure 3.13 is a scatterplot of the fly data. There are 823 data points, but only 102 are visible. Figure 3.14 displays the same data set with both variables jittered.

fig_3_45.gif (4150 bytes)
Figure 3.13 Scatterplot of the fly data.  (book 3.45)
fig_3_46.gif (5470 bytes)
Figure  3.14  Jittered plot of the fly data.  (book 3.46)

 

 

 


The question is whether the facet number has a linear dependence on temperature. An earlier analysis using analysis of variance concluded that there is a departure from linearity. Figure 3.15 provides a picture to go with this conclusion. However this use of a simple linear regression is inappropriate given the experimenters' report of uncertainty in the reported temperature.

It is all to common that transcription errors occur in real world experiments. If we swap the 23 and 25 degree data, the result is quite linear. (Figure 3.16)

fig_3_49.gif (3184 bytes)
Figure 3.15 Mean values for the facet number at each temperature and a linear fit based on all data.  (book 3.49)
alt_3_49.gif (3163 bytes)
Figure 3.16 The same data as Figure 3.15 with the 23 and 25 degree values switched.

 


Time series

fig_3_64.gif (2751 bytes)
Figure 3.17 Time series plot of the melanoma data.  (book 3.64)

A special case of bivariate data is the time series. In the ideal time series, the values correspond to uniform time steps with no gaps. Time series often contain variation components at more than one time scale. The incidence of melanoma (a skin cancer) in Figure 3.17 appears to oscillate irregularly about a long term increasing trend.

 


fig_3_67.gif (3318 bytes)
Figure  3.18  Loess fit to the residuals from the trend fit to the melanoma data.   (book 3.67)

An approximately linear loess curve was fit to these data to remove the long term trend and the residuals are plotted in Figure 3.18 together with a less smooth loess curve to show the remaining oscillations.

 

 


Is this oscillatory component related to solar activity? The loess curve from Figure 3.18 is plotted in Figure 3.19 together with the number of sunspots observed during the same period. It is easier to see the extent of the agreement between these two time series if we fit a loess curve using the same parameters to the sunspot data and shift it to align the cycles. The result is plotted in Figure 3.20.

alt_3_71.gif (6155 bytes)
Figure 3.20  Oscillatory component of the melanoma data compared with smoothed and shifted sunspot numbers.
fig_3_71.gif (5970 bytes)
Figure 3.19  Oscillatory component of the melanoma data compared with sunspot numbers. (book 3.71)


Seasonal components of time series

Some time series contain a seasonal component -- one which is quite regular on a certain time scale. The carbon dioxide data in Figure 3.21 contain a clear annual component as well as a long term trend.

Banking  (bank45.m)

fig_3_73.gif (3823 bytes)
Figure  3.21 Time series of the CO2 data plotted  with aspect ratio set to 1.  (book 3.73)

In Figure 3.21 the data have been plotted at an aspect ratio so that the long term trend is displayed at an angle of approximately 45 degrees. This makes it easy to see that the long term trend is convex upward -- the rate of increase is increasing. Adjusting the aspect ratio of a graph to adjust the orientations of trends is called banking. It enhances the visual perception of line segment orientations. The aspect ratio in Figure 3.21 enhances the long term trend, but it does not reveal much about the annual cycles.

 

 


fig_3_72.gif (2878 bytes)
Figure 3.22 Time series of the CO2 data with local segments banked to 45 degrees.   (book 3.72)

By banking the local curve segments to 45 degrees as in Figure 3.22, we can see that the annual cycles are not symmetric. Many of the figures in this tutorial use banking to improve the visualization.

 


 

Seasonal loess  (seasonaloess.m)

fig_3_75.gif (2820 bytes)
Figure 3.23  Cycle plot of the CO2 data.  (book 3.75)

Seasonal loess can be used to extract a periodic component from a time series for detailed examination. In Figure 3.23 the seasonal component is displayed in a cycleplot -- the component values are grouped by month. This display reveals that the amplitude of the annual oscillation is increasing over time.

 

 


fig_3_76.gif (3060 bytes)
Figure 3.24  Residuals from the seasonal fit to the CO2 data. (book 3.76)

The data with the seasonal component removed are plotted in Figure 3.24, which shows the increasing trend more clearly.

 

 

 

 


 

1 Introduction 4 Trivariate Data
2 Univariate Data 5 Hypervariate Data
3 Bivariate Data 6 Multiway Data

 

Send feedback to author@datatool.com

Go to Data visualization home