In analyzing experimental data, it is sometimes very important to be able to parametrize relationships among data by means of linear combinations of elementary functions. One of these expansions is the well known Taylor series which is an expansion of an analytic function using as elementary functions (regressors) the powers of the independent variables. In the case of the Taylor series, the coefficients of the linear combination are a well known expression involving the derivatives of the function to be represented at a given (initial) point. Even if the Taylor series is very powerful for its noticeable features and its analytical simplicity, from the point of view of the numerical analist it might be not the best expansion in terms of elementary functions. Ideally one would like to use the smallest possible number of regressors, in order to describe the relationship among data with a given precision, and this is not always guaranteed using the Taylor series. Another feature of the monomials is that they do not form a complete orthonormal set of functions, in the sense of the Hilbert spaces, while other functions, which possess this feature, may be more computationally convenient to use.
The entries HPARAM (for histograms and scatter plots) and HPARMN (for n-dimensional distributions) perform regressions on the data in order to find the best parametrization in terms of elementary functions (regressors).
Let us suppose that we have collected NP
sets of NDIM+1
data, being (
) with
NP
the
set. Here
is the value observed at the coordinates
. NDIM
must be greater or equal to
1 and it
is the dimensionality of our parametrization problem, that is
the number of independent variables that we want to use in
order to parametrize our data.
In the case of a histogram NDIM
is clearly 1, x is the centre
of a histogram channel, y is the contents of the corresponding channel
and NP
is the number of channels of the histogram.
In the case of
a scatter plot NDIM
is 2,
and
are the ordinate and abscissa of
the centre of a cell, y is the contents of the cell and NP
is
NX*NY
, the total number of cells.
In case of histograms or scatter plots
the entry HPARAM should be used.
For data which are not contained in an
HBOOK
histogram or scatter plot and for higher dimensionality sets (NDIM
) the entry HPARMN has to be used.
We could express a possible
expansion of our data in elementary functions in the following way:
| ||
| ||
| ||
| ||
|
where we are summing NCO
elementary functions
each one multiplied by its own coefficient.
In other words we are pretending
that our data points are fitted by the given expansion but for a residual,
namely
.
The expansion itself is the linear combination of
the regressors
computed at the
data point.
The strategy
of the two entries is to minimize the residual variance from fitting
various possible regressors out of a set which is either system or user-defined.
The previous expression can be rewritten in a more synthetical notation:
with
array of |NP| data points | |
matrix of regressors (|NP| lines times |NCO| columns) | |
array of |NCO| coefficients | |
array for the |NP| residuals |
As already said, we want to use the smallest possible number of regressors
for a given set of data points, which yields the desired fit (in the
terms explained below).
That is to say that the rank of the matrix F should be NCO
.
If it were smaller, then some (at least one) of the columns
could be represented as a linear combination of others, and so, rearranging
the coefficients, we could get rid of these regressors.
The residual variance is minimized under the following constraints:
NP
values of the residuals should have a mean of 0.
D
).
The coefficients and regressors are determined iteratively until the
convergence criteria for the fit are reached.
The user has the choice to specify
that the criteria are met either when the residual variance stops
decreasing or when the multiple correlation coefficient reaches the
value specified via the input parameter R2MIN
(see later).
Once the list of all possible regressors has been established,
at each step the one which gives the highest drop in
the residual sum of squares is added to the expansion.
In addition, all previously chosen regressors are tested for
possible rejection, using standard test statistics (F-tests),
as it can happen that a regressor becomes superfluous
at a later stage of the fitting process.
The routines offer three choices for the regressors: standard
elementary functions, user-given elementary functions or user-given regressors.
In the first two cases each regressor is the product of NDIM
elementary
functions, one for each variable (the constant function may be included
in the set).
This means that in the first two cases we will have a matrix of
NDIM*NCO
elementary functions, and the array of regressors will
be the result of the scalar product of the NDIM
columns of this
matrix.
In the last case each regressor is user-defined, and we have just
NCO
regressors to be specified as user functions (see below).
The NCO
regressors have to be linearly combined with the NCO
coefficients
contained in the output array COEFF
to obtain the parametrization.
Each elementary function, either standard or user-defined is identified
by a three digit code, let's say nnm
.
The last digit, m
, is a one digit
code specifying the kind of regressor used, with the following meaning:
NDIM*NCO
of these codes arranged
in an array, so that the first NDIM
codes are the code of the
elementary functions to be multiplied to obtain the first regressor.
For system defined elementary
functions nn
is the degree of the polynomial.
For user-provided regressors,
the meaning of this code is user-defined, in the sense that the user is
free to give any meaning he wishes to this code in the routine he has to supply.
In the case of user-provided regressors, we will have just NCO
regressor
codes in the array ITERM
at the end of the fitting process.
Only one code is needed to define a user-given regressor,
whereas NDIM
codes are
needed to specify a regressor composed of elementary functions.
The parametrization routines will select the most efficient subset of regressors out of the set selected by the user, in the sense of the criteria specified above.
Once the initial set of regressors has been specified, the user still has the opportunity to tell the system how to make a further selection among those regressors, before the fitting process actually starts:
MAXPOW
(array of dimension
NDIM
). This input parameter is mandatory and there is no default
provided for it.
PSEL
(via HSETPR)
to a positive value <NDIM
. PSEL
controls the exclusion
of the combinations of standard polynomials which give the higher order
regressors (see the description of this parameter).
ITERM
(under control of IC
below) may
be used as input to the routines to contain the list of the codes of
the acceptable regressors.
As soon as the best parametrization is obtained, the user may
call the DOUBLE PRECISION
function
[HRVAL]HRVAL(X),
which returns the value of the
parametrization, computed at point X
, where X
is an array of dimension NDIM
.
The program can handle up to 10
dimensions (i.e. NDIM
)
Up to 50 regressors may appear in the final expression.
This is controlled
by the PNCX
parameter set by the routine HSETPR (see below).
Memory requirements: a 1-dimensional histogram with 100 channels and
with the maximum number of regressors PNCX
set to 10, requires in
DOUBLE PRECISION
version, less than 5000 words;
for large values of PNCX
and a large
number of points, memory behaves roughly as
2*NP*PNCX (DOUBLE PRECISION)
.