Statistical differences between histograms

[HSTATDIF]

                   +----------------------------------+
                   | CALL  HDIFF (ID1,ID2,PROB*,CHOPT) |
                   +----------------------------------+
                                  
Action: Statistical test of compatibility in shape between two histograms using the Kolmogorov test. The histograms are compared and the probability that they could come from the same parent distribution is calculated.

The comparison may be done between two 1-dimensional histograms or between two 2-dimensional histograms. For further details on the method, see [more info] below.

Input parameters:
ID1
Identifier of first histogram to be compared.
ID2
Identifier of second histogram to be compared.
CHOPT
A character string specifying the options desired.
'D'
Debug printout, produces a blank line and two lines of information at each call, including the identifier numbers ID, the number of events in each histogram, the value of PROB, and the maximum Kolmogorov distance between the two histograms. For 2-dimensional histograms, there are two Kolmogorov distances (see below). If option 'N' is specified, there is a third line of output giving the probalility PROB for shape and normalization alone separately.
'F1'
Histogram ID1 has no errors (it is a function).
'F2'
Histogram ID2 has no errors (it is a function).
'N'
Include a comparison of the relative normalization of the two histograms, in addition to comparing the shapes. The output parameter PROB is then a combined confidence level taking into account absolute contents.
'O'
Overflow, requests that overflow bins be taken into account (also valid for 2-dim).
'U'
Underflow, requests that underflow bins be taken into account (also valid for 2-dim).
For 2-dimensional histograms only
'L'
Left, include X-underflows in the calculation.
'R'
Right, include X-overflows in the calculation.
'B'
Bottom, include Y-underflows in the calculation.
'T'
Top, include Y-overflows in the calculation.
Output Parameter:
PROB
The probability of compatibility between the two histograms.

Remark:

  1. Options 'O' and 'U' can also refer to 2-dimensional histograms, so that, for example the string 'UT' means that underflows in X and Y and overflows in Y should be included in the calculation.
  2. The histograms ID1 and ID2 must exist and already have been filled before the call to HDIFF. They must also have identical binning (lower and upper limits as well as number of bins).
  3. The probability PROB is returned as a number between zero and one. A values close to one indicates very similar histograms, and a value near zero means that it is very unlikely that the two arose from the same parent distribution.
  4. By default (no options selected with CHOPT) the comparison is done only on the shape of the two histograms, without consideration of the difference in number of events, and ignoring all underflow and overflow bins.

    Weights and Saturation

    [HWEIGSAT]

    Weighted 1-dimensional histograms

    It is possible to compare weighted with weighted histograms, and weighted with unweighted histograms, but only if HBOOK has been instructed to maintain the necessary information by appropriate calls (before filling) to HBARX. However it is not possible to take into account underflow or overflow bins if the events are weighted.

    Saturated 1-dimensional histograms

    If there is saturation (more than the maximum allowed contents in one or more bins), the probability PROB is calculated as if the bin contents were exactly at their maximum value, ignoring the saturation. This usually will result in a higher value of PROB than would be the case if the memory allowed the full contents to be stored, but not always. It should therefore be realized that the results of HDIFF are not accurate when there is saturation, and it is the user's responsability to avoid this condition.

    2-dimensional histograms

    Routine HDIFF cannot work if the events are weighted, since, in the current version of HBOOK, the necessary information is not maintained. HDIFF will also refuse to compare 2-dimensional histograms if there is saturation, since it does not have enough information in this case.

    Statistical Considerations

    [HSTATCON]

    The Kolmogorov Test

    The calculations in routine HDIFF are based on the Kolmogorov Test (See, e.g. [bib-EADIE], pages 269-270). It is usually superior to the better-known Chisquare Test for the following reasons:

    1. It does not require a minimum number of events per bin, and in fact it is intended for unbinned data (this is discussed below).
    2. It takes account not only of the differences between corresponding bins, but also the sign of the difference, and in particular it is sensitive to a sequence of consecutive deviations of the same sign.

      In discussing the Kolmogorov test, we must distinguish between the two most important properties of any test: its power and the calculation of its confidence level.

      The Power

      The job of a statistical test is to distinguish between a null hypothesis (in this case: that the two histograms are compatible) and the alternative hypothesis (in this case: that the two are not compatible). The power of a test is defined as the probability of rejecting the null hypothesis when the alternative is true. In our case, the alternative is not well-defined (it is simply the ensemble of all hypotheses except the null) so it is not possible to tell whether one test is more powerful than another in general, but only with respect to certain particular deviations from the null hypothesis. Based on considerations such as those given above, as well as considerable computational experience, it is generally believed that tests like the Kolmogorov or Smirnov-Cramer-Von-Mises (which is similar but more complicated to calculate) are probably the most powerful for the kinds of phenomena generally of interest to high-energy physicists. This is especially true for two-dimensional data where the Chisquare Test is of little practical use since it requires either enormous amounts of data or very big bins.

      The Confidence Level for 1-dimensional data

      Using the terms introduced above, the confidence level is just the probability of rejecting the null hypothesis when it is in fact true. That is, if you accept the two histograms as compatible whenever the value of PROB is greater than 0.05, then truly compatible histograms should fail the test exactly 5% of the time. The value of PROB returned by HDIFF is calculated such that it will be uniformly distributed between zero and one for compatible histograms, provided the data are not binned (or the number of bins is very large compared with the number of events). Users who have access to unbinned data and wish exact confidence levels should therefore not put their data into histograms, but should save them in ordinary Fortran arrays and call the routine TKOLMO which is being introduced into the Program Library. On the other hand, since HBOOK is a convenient way of collecting data and saving space, the routine HDIFF has been provided, and we believe it is the best test for comparison even on binned data. However, the values of PROB for binned data will be shifted slightly higher than expected, depending on the effects of the binning. For example, when comparing two uniform distributions of 500 events in 100 bins, the values of PROB, instead of being exactly uniformly distributed between zero and one, have a mean value of about 0.56. Since we are physicists, we can apply a useful rule: As long as the bin width is small compared with any significant physical effect (for example the experimental resolution) then the binning cannot have an important effect. Therefore, we believe that for all practical purposes, the probability value PROB is calculated correctly provided the user is aware that:

      1. The value of PROB should not be expected to have exactly the correct distribution for binned data.
      2. The user is responsible for seeing to it that the bin widths are small compared with any physical phenomena of interest.
      3. The effect of binning (if any) is always to make the value of PROB slightly too big. That is, setting an acceptance criterion of (PROB>0.05 will assure that at most 5% of truly compatible histograms are rejected, and usually somewhat less.

        The Confidence Level for Two-dimensional Data

        The Kolmogorov Test for 2-dimensional data is not as well understood as for one dimension. The basic problem is that it requires the unbinned data to be ordered, which is easy in one dimension, but is not well-defined (i.e. not scale-invariant) in higher dimensions. Paradoxically, the binning which was a nuisance in one dimension is now very useful, since it enables us to define an obvious ordering. In fact there are two obvious orderings (horizontal and vertical) which give rise to two (in general different) Kolmogorov distance measures. Routine HDIFF takes the average of the two distances to calculate the probability value PROB, which gives very satisfactory results. The precautions necessary for 1-dimensional data also apply to this case.