Data visualization toolbox

2 Univariate Data

 

Univariate data consist of samples or measurements of a single quantitative variable. A fundamental task with this type of data is to characterize its distribution. Other important tasks are comparing the distributions of samples from two or more populations and comparing the data distributions to standard distributions, especially the normal distribution. Transformations of the data are sometimes used to achieve a more desirable distribution.

Data sets

This chapter includes examples of analyzing two data sets:

Singer
Heights of singers in the New York Choral Society. The goal of the analysis is to determine whether voice part is related to height.
Fusion
Times to fuse a random dot stereogram for two groups of subjects. The VV group received visual information before the experiment; the NV group did not. The goal of the analysis is to determine whether the visual information led to a shorter fusion time. The experimenters reported that it did not.

 

Characterizing a distribution

Statistical parameters

Location

Mean
The mean is an excellent parameter for describing the location of a symmetric distribution. It can be misleading for a distribution with significant skew.
Median
The median is a robust measure of location. It is less affected by skew and outliers than the mean.
Trimmed mean  (trimmedmean.m)
The mean can be salvaged by stabilization. The trimmed mean of a distribution is calculated as the mean after a fraction of the largest and the smallest values have been removed. The most common fraction is 0.1.

Spread

Standard deviation
The standard deviation is the optimal measure of spread for a normal distribution. Although often used with other distributions, it can be uninformative in the presence of significant skew, outliers, or heavy tails.
Median absolute deviation  (mad.m)
The median absolute deviation is a more robust measure of spread than the standard deviation. It is simply the median value of the absolute values of the variable minus their median.
Trimmed standard deviation  (trimmedstd.m)
The standard deviation can also be salvaged by stabilization. The trimmed standard deviation of a distribution is calculated as the standard deviation after a fraction of the largest and the smallest values have been removed. The most common fraction is 0.1.
 

Histograms

Although it is common to view univariate data using histograms, these displays can be uninformative and ugly when the number of data values is small. (Figure 2.1) The appearance of the histogram can be improved by kernel smoothing at the cost of fiddling with the kernel and bin widths. (Figure 2.2)

alt_2_2.gif (5147 bytes)
Figure 2.1. Histogram of the Bass 2 heights.
alt_2_2a.gif (5727 bytes)
Figure 2.2. Smoothed histogram of the Bass 2 heights.


The data visualization approach described here uses displays based on quantiles instead of histograms.

 

Details of quantile  (quantile.m)
Consider a fractional measure f  between zero and one. The quantile of data x corresponding to f, qf(x), is a value on the x data measurement scale such that approximately a fraction f of the x data are less than or equal to qf(x). A prudent definition of the fraction f for a data set of n samples is

This definition avoids fractions of 0 and 1 which are problematic for comparison with the normal distribution.

 

fig_2_2.gif (3287 bytes)
           Figure 2.3. Quantile plot of the Tenor 1              data. (book 2.2)

Quantile Plots  (quantileplot.m)

A quantile plot, such as Figure 2.3, is a plot of the data quantiles against f, with linear interpolation between points. This presentation is equivalent to a 90 degree rotation of the usual display of the discrete cumulative distribution plot.

The quantile plot provides an ordered view of the data, showing its range, step size and clustering.

 


alt_2_3.gif (3920 bytes)
          Figure 2.4. Quantile plot for two                distributions.

Comparing distributions

Quantiles are very useful for comparing distributions. It is possible to include the quantiles of additional distributions in a quantile plot as in Figure 2.4.

This plot is one of many ways to display a shift between distributions.

 


fig_2_3.gif (3066 bytes)
Figure 2.5. Q-Q plot of Bass 2 height vs. Tenor 1 height. (book 2.3)

Q-Q Plots  (qqplot.m)

A quantile-quantile plot displays the quantiles of one distribution against the quantiles of another. It readily demonstrates the presence of a shift between two distributions.

The diagonal line in Figure 2.5 indicates equality between the axes. If the distributions of Bass 2 and Tenor 1 heights were essentially the same, the quantile points would be near this line. In Figure 2.5 almost all of the points are above and parallel to the diagonal, indicating that the Bass 2 heights are generally greater than the Tenor 1 heights.

 


fig_2_4.gif (2750 bytes)
Figure 2.6. Tukey mean difference plot of  Bass 2 and Tenor 1 heights. (book 2.4)

Tukey Mean Difference Plots  (mdplot.m)

Estimating the amount of shift between two distributions from a Q-Q plot can be a bit awkward. Also some people just don't get the Q-Q plot. An alternative display is the Tukey mean difference plot which presents the differences of the two sets of quantiles vs. their means as in Figure 2.6.

This figure makes it easy to see that the Bass 2 heights are typically approximately 2.5 inches greater than the Tenor 1 heights.

 


Box Plots

fig_2_8.gif (3999 bytes)
Figure 2.7. Boxplots of the singer distributions. (book 2.8)

Multiple distributions can be compared using lots of Q-Q plots, but boxplots (Figure 2.7) are visually simpler.

A box plot indicates the median, upper and lower quartile, upper and lower adjacent values, and outside values. The median and quartiles are robust statistics which characterize the location and central spread of the distribution. The other box plot parameters are based on the fences. The upper fence is defined as the upper quartile plus 1.5 times the interquartile range. The lower fence is the same distance below the lower quartile. The adjacent values are the most extreme data points within the fences. The outside values are the data points beyond the fences.

In the box plot, the dot indicates the median. The right and left ends of the box are the upper and lower quartiles. The vertical lines beyond the box indicate the adjacent values. Any outside values are plotted as open circles.

In this figure there is a general shift of height distribution with voice part. The width of the distribution does not change consistently with voice part.

Normal Q-Q Plots  (normalqqplot.m)

fig_2_11a.gif (3970 bytes)
Figure 2.8. Normal Q-Q plot of the Tenor 1 data.

The data quantiles can also be plotted against the quantiles of the normal distribution to indicate whether a normal approximation is appropriate. (Figure 2.8) A straight line which passes through the upper and lower quartiles of both distributions is included to help reveal trends.

The pattern in this figure is nearly linear, so the distribution is well approximated by the normal. For this approximation method, the mean is the value of the zero intercept and the standard deviation is the slope of the line.

 

 

 

 

 

fig_2_11.gif (12605 bytes)
Figure 2.9. Normal Q-Q plot of the singer data. (book 2.11)

Normal Q-Q plots of all of the singer data by voice part (Figure 2.9) demonstrate generally linear patterns. Many of the deviations are probably due to the quantization of height in a one inch step size.

 

 

 


fig_2_12.gif (3392 bytes)
Figure 2.10. Dot plot of the sample means of the singer data. (book 2.12)

Dotplots  (dotplot.m)

The values of categorical variables are often compared using bar charts. An alternative method is the dot plot (Figure 2.10), which may make it easier to compare non-adjacent categories.


fig_2_17.gif (3403 bytes)
figure 2.11. Residual-fit spread plot of the singer data. (book 2.17)

R-F Spread Plot  (rfplot.m)

The residual - fit spread plot can be used to display the amount of variation in the data which is explained by the fit.

In this figure the spreads of the fit and residual distributions are comparable, so the fit is significant but does not explain all of the variation.

 


Transforming data

Transformations can often aid in the evaluation and characterization of data. In some cases transformed data may be well approximated by a classical distribution, which can aid in analysis. Transformations can be effective when the range of the data is sufficiently large.

Log transforms

fig_2_19.gif (3597 bytes)
Figure 2.12. Quantile plots of the fusion data. (book 2.19)

The Singer data set had a modest range, little skew in the distributions and reasonable evidence for additive shifts. The fusion data (Figure 2.12) does not share these characteristics. These data are skewed toward large values, particularly in the NV case.

 

 

 

 

fig_2_22.gif (4063 bytes)
Figure 2.13. Normal q-q plots of the fusion data. (book 2.22)

This skewness is also shown as an upward convex pattern in normal Q-Q plots (Figure 2.13).


fig_2_23.gif (2811 bytes)
Figure 2.14. Boxplot of the fusion data. (book 2.23)

This skewness is also evident in a boxplot -- the median point for the VV data is not centered in the quartile box. (Figure 2.14) Both the median and the spread of the NV data are greater than the median and spread of the VV data. This behavior of spread increasing with location is called monotone increasing spread.


fig_2_24.gif (4186 bytes)
Figure 2.15. Normal q-q plots of the log transformed fusion data. (book 2.24)

An approach which works well for data skewed toward large values or with monotone increasing spread is to analyze logs of the data. For the fusion data, log transformation makes the distributions more nearly normal. (Figure 2.15)


Details of log transform
It is often convenient to use base 2 or 10 logarithms, depending on the data range. These bases, although not "natural", produce values with integer steps corresponding to factors of 2 or 10, which are easy to appreciate.

 

S-L Plot  (slplot.m)

fig_2_25.gif (4367 bytes)
Figure 2.16. S-L plot of the fusion times. (book 2.25)

A sensitive look at monotone spread is provided by the spread - location plot, which shows whether the spread or variability of the data changes with location or typical value.  In Figure 2.16 the fusion time medians for the two groups show location. The square roots of the absolute values of the residuals of the data minus the median indicate spread. The dashed line connects the medians of the square root absolute residuals. In the figure this line slopes upward, indicating monotone increasing spread.

If all of the points in this figure had been plotted at their two median values, many of them would be obscured. To avoid this, the medians were jittered  by adding small amounts of uniform random noise. (jitter.m)

 


Power Transformation  (powerxform.m)

fig_2_33.gif (8185 bytes)
Figure 2.17. Normal Q-Q plots of selected power transformations of the VV fusion times. (book 2.23)

The approach of data transformation also includes power transforms. These provide more flexible adjustment to the data distribution than log transformation alone. In Figure 2.17 the VV fusion data are transformed using a range of powers.

For these data the log transformation (power=0) appears to be the best.


Details of power transform  (powerxform.m)
The simplest version of the power transform of  variable x for parameter p is given by

This definition is restricted to positive values of x. If x has nonpositive values, it is possible to add an offset. A more general version is

The tools shown here, quantiles, transforms and special purpose plots,  all offer opportunities to learn more from and make better inferences about the data.

 

1 Introduction 4 Trivariate Data
2 Univariate Data 5 Hypervariate Data
3 Bivariate Data 6 Multiway Data

 

Send feedback to author@datatool.com

To Data visualization home