Next: Histograms and plots Up: Fittingparameterization and Previous: Deprecated fitting routines

Parametrization

In analyzing experimental data, it is sometimes very important to be able to parametrize relationships among data by means of linear combinations of elementary functions. One of these expansions is the well known Taylor series which is an expansion of an analytic function using as elementary functions (regressors) the powers of the independent variables. In the case of the Taylor series, the coefficients of the linear combination are a well known expression involving the derivatives of the function to be represented at a given (initial) point. Even if the Taylor series is very powerful for its noticeable features and its analytical simplicity, from the point of view of the numerical analist it might be not the best expansion in terms of elementary functions. Ideally one would like to use the smallest possible number of regressors, in order to describe the relationship among data with a given precision, and this is not always guaranteed using the Taylor series. Another feature of the monomials is that they do not form a complete orthonormal set of functions, in the sense of the Hilbert spaces, while other functions, which possess this feature, may be more computationally convenient to use.

The entries HPARAM (for histograms and scatter plots) and HPARMN (for n-dimensional distributions) perform regressions on the data in order to find the best parametrization in terms of elementary functions (regressors).

Let us suppose that we have collected NP sets of NDIM+1 data, being ( $y$ _i,x_1,i...x_NDIM,i ) with $1≤i≤$ NP the $i$ ^th set. Here $y$ _i is the value observed at the coordinates $x$ _1,i...x_NDIM,i . NDIM must be greater or equal to 1 and it is the dimensionality of our parametrization problem, that is the number of independent variables that we want to use in order to parametrize our data. In the case of a histogram NDIM is clearly 1, x is the centre of a histogram channel, y is the contents of the corresponding channel and NP is the number of channels of the histogram. In the case of a scatter plot NDIM is 2, $x$ ₁ and $x$ ₂

are the ordinate and abscissa of the centre of a cell, y is the contents of the cell and NP is NX*NY, the total number of cells. In case of histograms or scatter plots the entry HPARAM should be used. For data which are not contained in an HBOOK histogram or scatter plot and for higher dimensionality sets (NDIM $≤10$ ) the entry HPARMN has to be used. We could express a possible expansion of our data in elementary functions in the following way:

$y$ ₁ $=$ $c$ ₁*F₁(x_1,1, ..., x_NDIM,1) + ...

$2cm + c$ _NCO*F_NCO(x_1,1, ..., x_NDIM,1) + u₁

$2cm$

$y$ _NP $=$ $c$ ₁*F₁(x_1,NP, ..., x_NDIM,NP) + ...

$2cm + c$ _NCO*F_NCO(x_1,NP, ..., x_NDIM,NP)+u_NP

where we are summing NCO elementary functions each one multiplied by its own coefficient. In other words we are pretending that our data points are fitted by the given expansion but for a residual, namely $u$ _i . The expansion itself is the linear combination of the regressors $F$ _j computed at the $j$ ^th data point. The strategy of the two entries is to minimize the residual variance from fitting various possible regressors out of a set which is either system or user-defined. The previous expression can be rewritten in a more synthetical notation:

$y = Fc + u$

with

$y$ array of |NP| data points
$F$ matrix of regressors (|NP| lines times |NCO| columns)
$c$ array of |NCO| coefficients
$u$ array for the |NP| residuals

As already said, we want to use the smallest possible number of regressors for a given set of data points, which yields the desired fit (in the terms explained below). That is to say that the rank of the matrix F should be NCO. If it were smaller, then some (at least one) of the columns could be represented as a linear combination of others, and so, rearranging the coefficients, we could get rid of these regressors.

The residual variance is minimized under the following constraints:

The NP values of the residuals should have a mean of 0.
The residuals should be independently distributed. This means that their covariance matrix has the form of a positive diagonal matrix (we call it D).
The regressors should be linearly independent.

Hypothesis 2 implies that

(F

^tF)^-1 exists, where

D (F

^tF)^-1 is the covariance matrix of the coefficients.

The coefficients and regressors are determined iteratively until the convergence criteria for the fit are reached. The user has the choice to specify that the criteria are met either when the residual variance stops decreasing or when the multiple correlation coefficient reaches the value specified via the input parameter R2MIN (see later). Once the list of all possible regressors has been established, at each step the one which gives the highest drop in the residual sum of squares is added to the expansion. In addition, all previously chosen regressors are tested for possible rejection, using standard test statistics (F-tests), as it can happen that a regressor becomes superfluous at a later stage of the fitting process.

The routines offer three choices for the regressors: standard elementary functions, user-given elementary functions or user-given regressors. In the first two cases each regressor is the product of NDIM elementary functions, one for each variable (the constant function may be included in the set). This means that in the first two cases we will have a matrix of NDIM*NCO elementary functions, and the array of regressors will be the result of the scalar product of the NDIM columns of this matrix. In the last case each regressor is user-defined, and we have just NCO regressors to be specified as user functions (see below). The NCO regressors have to be linearly combined with the NCO coefficients contained in the output array COEFF to obtain the parametrization. Each elementary function, either standard or user-defined is identified by a three digit code, let's say nnm. The last digit, m, is a one digit code specifying the kind of regressor used, with the following meaning:

0: Standard elementary function
1: User-provided elementary function
2: User-provided regressor

The first two digits define the function number. In the first two cases, we have NDIM*NCO of these codes arranged in an array, so that the first NDIM codes are the code of the elementary functions to be multiplied to obtain the first regressor. For system defined elementary functions nn is the degree of the polynomial. For user-provided regressors, the meaning of this code is user-defined, in the sense that the user is free to give any meaning he wishes to this code in the routine he has to supply. In the case of user-provided regressors, we will have just NCO regressor codes in the array ITERM at the end of the fitting process. Only one code is needed to define a user-given regressor, whereas NDIM codes are needed to specify a regressor composed of elementary functions.

The parametrization routines will select the most efficient subset of regressors out of the set selected by the user, in the sense of the criteria specified above.

Once the initial set of regressors has been specified, the user still has the opportunity to tell the system how to make a further selection among those regressors, before the fitting process actually starts:

The maximum degrees of the standard polynomials for each independent variable are given by MAXPOW (array of dimension NDIM). This input parameter is mandatory and there is no default provided for it.
The total degree of a regressor made up from the product of standard polynomials, may be limited by setting PSEL (via HSETPR) to a positive value <NDIM. PSEL controls the exclusion of the combinations of standard polynomials which give the higher order regressors (see the description of this parameter).
Each initial regressor, whose code is passed to the logical function HSELBF, written by the user, may be accepted or rejected according to user criteria
The argument ITERM (under control of IC below) may be used as input to the routines to contain the list of the codes of the acceptable regressors.

As soon as the best parametrization is obtained, the user may call the DOUBLE PRECISION function [HRVAL]HRVAL(X), which returns the value of the parametrization, computed at point X, where X is an array of dimension NDIM.

The program can handle up to 10 dimensions (i.e. NDIM $≤10$ ) Up to 50 regressors may appear in the final expression. This is controlled by the PNCX parameter set by the routine HSETPR (see below).

Memory requirements: a 1-dimensional histogram with 100 channels and with the maximum number of regressors PNCX set to 10, requires in DOUBLE PRECISION version, less than 5000 words; for large values of PNCX and a large number of points, memory behaves roughly as 2*NP*PNCX (DOUBLE PRECISION).

Next: Histograms and plots Up: Fittingparameterization and Previous: Deprecated fitting routines

Last update: Tue May 16 09:09:27 METDST 1995

$y$ ₁	$=$	$c$ ₁*F₁(x_1,1, ..., x_NDIM,1) + ...
		$2cm + c$ _NCO*F_NCO(x_1,1, ..., x_NDIM,1) + u₁
		$2cm$
$y$ _NP	$=$	$c$ ₁*F₁(x_1,NP, ..., x_NDIM,NP) + ...
		$2cm + c$ _NCO*F_NCO(x_1,NP, ..., x_NDIM,NP)+u_NP

$y$	array of \|NP\| data points
$F$	matrix of regressors (\|NP\| lines times \|NCO\| columns)
$c$	array of \|NCO\| coefficients
$u$	array for the \|NP\| residuals