Data visualization toolbox
Univariate data consist of samples or measurements of a single quantitative variable. A fundamental task with this type of data is to characterize its distribution. Other important tasks are comparing the distributions of samples from two or more populations and comparing the data distributions to standard distributions, especially the normal distribution. Transformations of the data are sometimes used to achieve a more desirable distribution.
This chapter includes examples of analyzing two data sets:
(trimmedmean.m)
(mad.m)
(trimmedstd.m)
Although it is common to view univariate data using histograms, these displays can be uninformative and ugly when the number of data values is small. (Figure 2.1) The appearance of the histogram can be improved by kernel smoothing at the cost of fiddling with the kernel and bin widths. (Figure 2.2)
![]() |
Figure 2.1. Histogram of the Bass 2 heights. |
![]() |
Figure 2.2. Smoothed histogram of the Bass 2 heights. |
The data visualization approach described here uses displays based on quantiles instead of histograms.
![]() |
Figure 2.3. Quantile plot of the Tenor 1 data. (book 2.2) |
(quantileplot.m)
A quantile plot, such as Figure 2.3, is a plot of the data quantiles against f, with linear interpolation between points. This presentation is equivalent to a 90 degree rotation of the usual display of the discrete cumulative distribution plot.
The quantile plot provides an ordered view of the data, showing its range, step size and clustering.
![]() |
Figure 2.4. Quantile plot for two distributions. |
Quantiles are very useful for comparing distributions. It is possible to include the quantiles of additional distributions in a quantile plot as in Figure 2.4.
This plot is one of many ways to display a shift between distributions.
![]() |
Figure 2.5. Q-Q plot of Bass 2 height vs. Tenor 1 height. (book 2.3) |
(qqplot.m)
A quantile-quantile plot displays the quantiles of one distribution against the quantiles of another. It readily demonstrates the presence of a shift between two distributions.
The diagonal line in Figure 2.5 indicates equality between the axes. If the distributions of Bass 2 and Tenor 1 heights were essentially the same, the quantile points would be near this line. In Figure 2.5 almost all of the points are above and parallel to the diagonal, indicating that the Bass 2 heights are generally greater than the Tenor 1 heights.
![]() |
Figure 2.6. Tukey mean difference plot of Bass 2 and Tenor 1 heights. (book 2.4) |
(mdplot.m)
Estimating the amount of shift between two distributions from a Q-Q plot can be a bit awkward. Also some people just don't get the Q-Q plot. An alternative display is the Tukey mean difference plot which presents the differences of the two sets of quantiles vs. their means as in Figure 2.6.
This figure makes it easy to see that the Bass 2 heights are typically approximately 2.5 inches greater than the Tenor 1 heights.
![]() |
Figure 2.7. Boxplots of the singer distributions. (book 2.8) |
Multiple distributions can be compared using lots of Q-Q plots, but boxplots (Figure 2.7) are visually simpler.
A box plot indicates the median, upper and lower quartile, upper and lower adjacent values, and outside values. The median and quartiles are robust statistics which characterize the location and central spread of the distribution. The other box plot parameters are based on the fences. The upper fence is defined as the upper quartile plus 1.5 times the interquartile range. The lower fence is the same distance below the lower quartile. The adjacent values are the most extreme data points within the fences. The outside values are the data points beyond the fences.
In the box plot, the dot indicates the median. The right and left ends of the box are the upper and lower quartiles. The vertical lines beyond the box indicate the adjacent values. Any outside values are plotted as open circles.
In this figure there is a general shift of height distribution with voice part. The width of the distribution does not change consistently with voice part.
(normalqqplot.m)
![]() |
Figure 2.8. Normal Q-Q plot of the Tenor 1 data. |
The data quantiles can also be plotted against the quantiles of the normal distribution to indicate whether a normal approximation is appropriate. (Figure 2.8) A straight line which passes through the upper and lower quartiles of both distributions is included to help reveal trends.
The pattern in this figure is nearly linear, so the distribution is well approximated by the normal. For this approximation method, the mean is the value of the zero intercept and the standard deviation is the slope of the line.
![]() |
Figure 2.9. Normal Q-Q plot of the singer data. (book 2.11) |
Normal Q-Q plots of all of the singer data by voice part (Figure 2.9) demonstrate generally linear patterns. Many of the deviations are probably due to the quantization of height in a one inch step size.
![]() |
Figure 2.10. Dot plot of the sample means of the singer data. (book 2.12) |
(dotplot.m)
The values of categorical variables are often compared using bar charts. An alternative method is the dot plot (Figure 2.10), which may make it easier to compare non-adjacent categories.
![]() |
figure 2.11. Residual-fit spread plot of the singer data. (book 2.17) |
(rfplot.m)
The residual - fit spread plot can be used to display the amount of variation in the data which is explained by the fit.
In this figure the spreads of the fit and residual distributions are comparable, so the fit is significant but does not explain all of the variation.
Transformations can often aid in the evaluation and characterization of data. In some cases transformed data may be well approximated by a classical distribution, which can aid in analysis. Transformations can be effective when the range of the data is sufficiently large.
![]() |
Figure 2.12. Quantile plots of the fusion data. (book 2.19) |
The Singer data set had a modest range, little skew in the distributions and reasonable evidence for additive shifts. The fusion data (Figure 2.12) does not share these characteristics. These data are skewed toward large values, particularly in the NV case.
![]() |
Figure 2.13. Normal q-q plots of the fusion data. (book 2.22) |
This skewness is also shown as an upward convex pattern in normal Q-Q plots (Figure 2.13).
![]() |
Figure 2.14. Boxplot of the fusion data. (book 2.23) |
This skewness is also evident in a boxplot -- the median point for the VV data is not centered in the quartile box. (Figure 2.14) Both the median and the spread of the NV data are greater than the median and spread of the VV data. This behavior of spread increasing with location is called monotone increasing spread.
![]() |
Figure 2.15. Normal q-q plots of the log transformed fusion data. (book 2.24) |
An approach which works well for data skewed toward large values or with monotone increasing spread is to analyze logs of the data. For the fusion data, log transformation makes the distributions more nearly normal. (Figure 2.15)
It is often convenient to use base 2 or 10 logarithms, depending on the data range. These bases, although not "natural", produce values with integer steps corresponding to factors of 2 or 10, which are easy to appreciate. |
(slplot.m)
![]() |
Figure 2.16. S-L plot of the fusion times. (book 2.25) |
A sensitive look at monotone spread is provided by the spread - location plot, which shows whether the spread or variability of the data changes with location or typical value. In Figure 2.16 the fusion time medians for the two groups show location. The square roots of the absolute values of the residuals of the data minus the median indicate spread. The dashed line connects the medians of the square root absolute residuals. In the figure this line slopes upward, indicating monotone increasing spread.
If all of the points in this figure had been plotted at their two median values, many
of them would be obscured. To avoid this, the medians were jittered by adding small
amounts of uniform random noise. (jitter.m)
(powerxform.m)
![]() |
Figure 2.17. Normal Q-Q plots of selected power transformations of the VV fusion times. (book 2.23) |
The approach of data transformation also includes power transforms. These provide more flexible adjustment to the data distribution than log transformation alone. In Figure 2.17 the VV fusion data are transformed using a range of powers.
For these data the log transformation (power=0) appears to be the best.
The simplest version of the power transform of variable x for
parameter p is given by This definition is restricted to positive values of x. If x has nonpositive values, it is possible to add an offset. A more general version is |
The tools shown here, quantiles, transforms and special purpose plots, all offer opportunities to learn more from and make better inferences about the data.
1 Introduction | 4 Trivariate Data |
2 Univariate Data | 5 Hypervariate Data |
3 Bivariate Data | 6 Multiway Data |
Send feedback to author@datatool.com |