HBOOK -- H2Parametrization

Parametrization

In analyzing experimental data, it is sometimes very important to be able to parametrize relationships among data by means of linear combinations of elementary functions. One of these expansions is the well known Taylor series which is an expansion of an analytic function using as elementary functions (regressors) the powers of the independent variables. In the case of the Taylor series, the coefficients of the linear combination are a well known expression involving the derivatives of the function to be represented at a given (initial) point. Even if the Taylor series is very powerful for its noticeable features and its analytical simplicity, from the point of view of the numerical analist it might be not the best expansion in terms of elementary functions. Ideally one would like to use the smallest possible number of regressors, in order to describe the relationship among data with a given precision, and this is not always guaranteed using the Taylor series. Another feature of the monomials is that they do not form a complete orthonormal set of functions, in the sense of the Hilbert spaces, while other functions, which possess this feature, may be more computationally convenient to use.

The entries HPARAM (for histograms and scatter plots) and HPARMN (for n-dimensional distributions) perform regressions on the data in order to find the best parametrization in terms of elementary functions (regressors).

Let us suppose that we have collected NP sets of NDIM+1 th data, being (y ,x . ..x ) with 1< =i< =NP the i set. i 1,i NDIM,i

Here y is the value observed at the coordinates x .. .x . i 1,i NDIM,i NDIM must be greater or equal to 1 and it is the dimensionality of our parametrization problem, that is the number of independent variables that we want to use in order to parametrize our data. In the case of a histogram NDIM is clearly 1, x is the centre of a histogram channel, y is the contents of the corresponding channel and NP is the number of channels of the histogram. In the case of a scatter plot NDIM is 2, x and x are the ordinate and abscissa 1 2 of the centre of a cell, y is the contents of the cell and NP is NX*NY, the total number of cells. In case of histograms or scatter plots the entry HPARAM should be used. For data which are not contained in an HBOOK histogram or scatter plot and for higher dimensionality sets (NDIM <=10) the entry HPARMN has to be used. We could express a possible expansion of our data in elementary functions in the following way:

       y     =    c  *F (x   ,. .., x      )+ ...
        1          1   1  1,1        NDIM,1
                             +c   * F   (x   ,. .., x      )+ u
                               NCO   NCO  1,1        NDIM,1    1
                             .
                             .
      y      =    c  *F (x    ,. .., x       )+ ...
       NP          1   1  1,NP        NDIM,NP

                             +c   * F   (x    ,. .., x       )+ u
                               NCO   NCO  1,NP        NDIM,NP    NP

where we are summing NCO elementary functions each one multiplied by its own coefficient. In other words we are pretending that our data points are fitted by the given expansion but for a residual, namely u . i The expansion itself is the linear combination of the regressors

th F computed at the j data point. The strategy of the two entries is to j minimize the residual variance from fitting various possible regressors out of a set which is either system or user-defined. The previous expression can be rewritten in a more synthetical notation:

                                 y= Fc+u

                y    array of NP data points
                F    matrix of regressors (NP lines
                     times NCO columns)
                c    array of NCO coefficients
                u    array for the NP residuals

As already said, we want to use the smallest possible number of regressors for a given set of data points, which yields the desired fit (in the terms explained below). That is to say that the rank of the matrix F should be NCO. If it were smaller, then some (at least one) of the columns could be represented as a linear combination of others, and so, rearranging the coefficients, we could get rid of these regressors.

The residual variance is minimized under the following constraints:

The NP values of the residuals should have a mean of 0.
The residuals should be independently distributed. This means that their covariance matrix has the form of a positive diagonal matrix (we call it D).
The regressors should be linearly independent.

t -1 t -1 Hypothesis 2 implies that (F F) exists, where D(F F) is the covariance matrix of the coefficients.

The coefficients and regressors are determined iteratively until the convergence criteria for the fit are reached. The user has the choice to specify that the criteria are met either when the residual variance stops decreasing or when the multiple correlation coefficient reaches the value specified via the input parameter R2MIN (see later). Once the list of all possible regressors has been established, at each step the one which gives the highest drop in the residual sum of squares is added to the expansion. In addition, all previously chosen regressors are tested for possible rejection, using standard test statistics (F-tests), as it can happen that a regressor becomes superfluous at a later stage of the fitting process.

The routines offer three choices for the regressors: standard elementary functions, user-given elementary functions or user-given regressors. In the first two cases each regressor is the product of NDIM elementary functions, one for each variable (the constant function may be included in the set). This means that in the first two cases we will have a matrix of NDIM*NCO elementary functions, and the array of regressors will be the result of the scalar product of the NDIM columns of this matrix. In the last case each regressor is user-defined, and we have just NCO regressors to be specified as user functions (see below). The NCO regressors have to be linearly combined with the NCO coefficients contained in the output array COEFF to obtain the parametrization. Each elementary function, either standard or user-defined is identified by a three digit code, let's say nnm. The last digit, m, is a one digit code specifying the kind of regressor used, with the following meaning:

0: Standard elementary function
1: User-provided elementary function
2: User-provided regressor

The first two digits define the function number. In the first two cases, we have NDIM*NCO of these codes arranged in an array, so that the first NDIM codes are the code of the elementary functions to be multiplied to obtain the first regressor. For system defined elementary functions nn is the degree of the polynomial. For user-provided regressors, the meaning of this code is user-defined, in the sense that the user is free to give any meaning he wishes to this code in the routine he has to supply. In the case of user-provided regressors, we will have just NCO regressor codes in the array ITERM at the end of the fitting process. Only one code is needed to define a user-given regressor, whereas NDIM codes are needed to specify a regressor composed of elementary functions.

The parametrization routines will select the most efficient subset of regressors out of the set selected by the user, in the sense of the criteria specified above.

Once the initial set of regressors has been specified, the user still has the opportunity to tell the system how to make a further selection among those regressors, before the fitting process actually starts:

The maximum degrees of the standard polynomials for each independent variable are given by MAXPOW (array of dimension NDIM). This input parameter is mandatory and there is no default provided for it.
The total degree of a regressor made up from the product of standard polynomials, may be limited by setting PSEL (via HSETPR) to a positive value . PSEL controls the exclusion of the combinations of standard polynomials which give the higher order regressors (see the description of this parameter).
Each initial regressor, whose code is passed to the logical function HSELBF, written by the user, may be accepted or rejected according to user criteria
The argument ITERM (under control of IC below) may be used as input to the routines to contain the list of the codes of the acceptable regressors. As soon as the best parametrization is obtained, the user may call the DOUBLE PRECISION function HRVAL(X), which returns the value of the parametrization, computed at point X, where X is an array of dimension NDIM. The program can handle up to 10 dimensions (i.e. NDIM<=10) Up to 50 regressors may appear in the final expression. This is controlled by the PNCX parameter set by the routine HSETPR (see below). Memory requirements: a 1-dimensional histogram with 100 channels and with the maximum number of regressors PNCX set to 10, requires in DOUBLE PRECISION version, less than 5000 words; for large values of PNCX and a large number of points, memory behaves roughly as 2*NP*PNCX (DOUBLE PRECISION). Histograms and plots +----------------------------------------------------+ |CALL HPARAM (ID,IC,R2MIN,MAXPOW,COEFF*,ITERM*,NCO*) | +----------------------------------------------------+ Input parameters: ID Histogram or plot identifier. IC Control word (see below) R2MIN Maximum required value of multiple correlation coefficient MAXPOW Maximum degrees of standard polynomials (INTEGER array of size NDIM) ITERM Acceptable function codes (see explanation above) (INTEGER array of size <=PNBF, see explanation above) Output parameters: NCO number of regressors in final expression. COEFF Coefficients of terms (DOUBLE PRECISION array of size NCO) ITERM Accepted regressors codes (INTEGER array of size NDIM*NCO or NCO, see explanation above) IC is a coded integer with the following fields: 7 6 5 4 3 2 IC= N*10 +R* 10 + C*10 +B*10 +T* 10 + W*10 + P*10 +S S This switch controls the superimposition of the result when printing the histogram, it is effective only for 1-dimensional histograms 1 Resulting parametric fit superimposed on histogram 0 No superposition P This switch controls the amount of information printed during the fitting process 0 Minimal output: the residual sum of squares is printed 1 Normal output: in addition, the problem characteristics and options are printed; also the standard deviations and confidence intervals of the coefficients 2 extensive output: the results of each iteration are printed with the normal output W This switch controls the weights to applied to the data during the fit 0 weights on histogram contents are already defined via HBARX or HPAKE. If not they are taken to be equal to the square-root of the contents 1 weights are equal to 1 T This switch controls the system-defined elementary functions set to use 0 Monomials will be selected as the standard elementary functions 1 Chebyshev polynomials with a definition region: [-1,1] 2 Legendre polynomials with a definition region: [-1,1] 3 shifted Chebyshev polynomials with a definition region: [0,1] 4 Laguerre polynomials with a definition region: [0,+1] 5 Hermite polynomials with a definition region: [-1,+1] B This switch controls the kind of regressor used in the fit 0 Regressors are products of standard polynomials (see preceeding switch) 1 Regressors are products of user-defined elementary functions. The user should write a DOUBLE PRECISION function HELEFT(IEF,X) where IEF is the elementary function code in the sense explained above: 10 times the function code plus 1. The function code number can vary from 1 to PNEF, number of user supplied elementary functions. PNEF has to be specified by the user before calling HPARAM, via a call to the routine HSETPR described below. The seconf parameter X is the point where the function should be calculated. 2 Regressors are user-defined, in this case the variable PNBF, number of basic functions, must be set by the user via a call to HSETPR (see below). User regressors values are to be returned by the user-supplied DOUBLE PRECISION function HBASFT(IBF,X) where IBF is the user-defined regressor code in the sense explained above: 10 times the regressor code plus 2. The regressor code number can vary from 1 to PNBF, number of basic functions used for the parametrization. PNBF has to be specified by the user before calling HPARAM, via a call to the routine HSETPR described below, in the case that the switches B or C of the control parameter IC have the value 2. The parameter X is an array of length NDIM defining the coordinate where the regressor should be calculated. C This switch controls the selection of the regressors 0 Regressors are selected by the system. The parameter PSEL and the array MAXPOW help the user in controlling the total degree of each regressor (see below). 1 for each elementary function or for each regressor, the logical function HSELBF(ELEF) gets called once before the beginning of the actual fitting procedure to set up the table of available elementary functions or regressors. IELEF is the code of the regressor or of the function being tested for acceptance. A value of .TRUE. returned will cause the regressor/elementary function to be accepted. The default version of this function which is supplied in HBOOK always returns the value .TRUE., while the user may want to write his own version of this function to exclude some of the regressors or of the elementary functions (according to the value of the B switch he has selected). 2 The input array ITERM contains a list of selected regressors or elementary functions according to the value of the switch B. The array has to be at least NDIM*PNBF words and the variable PNBF, maximum number of elementary functions or regressors, should have been set via a call to HSETPR. In the case of elementary functions, the element ITERM(IDIM+(IBF-1)*NDIM) is the function of the variable x to be multiplied by the coefficent number IDIM IBF, given a maximum expected number of PNBF coefficents. This can be of course reduced by the fitting program under considerations of linear dependency, as stated above, returning only NCO coefficents, thus after the fit ITERM will identify which regressors were actually used. In the case of user-defined regressors, only the first PNBF position of the ITERM array need to be filled, and the array will contain directly the code of the selected user-defined regressors. R This switch controls the system selection mechanism for the regressors: 0 Stepwise regression, the regression is calculated as explained before, and selected regressors may be eliminated in a further step of the procedure 1 Forward regression. The backward stage is suppressed: a regressor included at one time will always remain in the regression later on 2 Fixed-term regression: all selected regressors will appear in the final expression N This switch controls the normalization of the X range during the computation. 0 No normalization of X-range 1 X scaled in the range [-1,1] 2 X scaled in the range [0,1] 3 X scaled in the range [0,+1] The value of R2MIN is used to determine the satisfactory end of the fitting process. If 0<R2MIN<1, this value will be compared to the current sum of the squares of the residuals to determine whether the fit should be stopped or continued. If R2MIN>=1, the fitting process will be halted when the residual variance is minimal. Various parameters which are relevant for the parametrization can be set via the routine HSETPR. +----------------------------+ | CALL HSETPR (CHNAME,VALUE) | +----------------------------+ Input parameters: CHNAME Name of the parameter to be set (of type CHARACTER) VALUE Value assigned to parameter CHNAME (of type REAL) Possible values for CHNAME are: 'PNEF' This parameter specifies to the system the number of elementary functions that are supplied by the user, in case the switch B of the IC parameter is set to 1. The system will build PNBF basic regressors made up by products of NDIM elementary functions, possibly user selected, as specified by the switch C of the parameter IC. 'PNBF' This parameter must be specified in case user regressors are used (B switch of IC set to 2), or in case the ITERM contains on input the selected regressors or basic elementary functions (C switch of IC parameter set to 2). The parameter is ignored in the other cases. This is the total number of user-defined regressors or elementary functions which the system will use for the parametrization process. Please note that in case of user given regressors, if the C switch of the IC parameter is either 0 or 1, the logical function HSELBF(ELEF) will always be called in order to verify the inclusion of the regressor IELEF in the list for the current fitting operation. 'PSEL' This parameter is used only in case of system-supplied elementary functions. It is the maximum limit for the sum of the terms POW(I)/MAXPOW(I) in each regressor, where POW(I) is the power of the system-supplied elementary function th for the I variable (1<=I<NDIM) and MAXPOW(I) is the user supplied maximum value for the degree of the system-supplied th elementary functions for the I variable. This value must be 0. Note that setting PSEL to 0 selects the constant term for all the regressors. Setting PSEL to a value >=NDIM has the effect of removing any limitation on the total degree of the regressors, leaving MAXPOW(I) as the only effective limitation on the degree of the the elementary functions. This means that the total degree of the regressors can be equal to the sum of the NDIM elements of MAXPOW(I). The system supplied default is 1. 'PFLV' This parameter determines the F-test significance level for rejection of already included regressors; by default it is set to 1. A higher value makes it more difficult for regressors to remain in the regression. The value of PFLV has to be >0. 'PLUN' This parameter indicates the logical unit for writing the Fortran code of the function FPARAM(X). which gives the value of the parametrization at point X. This Fortran code can be compiled and used in subsequent jobs. By default no code is written. 'PNBX' Maximum number of regressors that may be entered as input to the fitting process (after selection). By default PNBX is set to 500; setting PNBX to an appropriate value will avoid wasting space in dynamic area /PAWC/. This parameter may be set to the same value of PNBF or PNEF whichever is appropriate to save space. 'PNCX' Maximum number of regressors that may appear in the final expression of the parametrization (this number has to be <=50). Default setting is 50. The same remark applies as for PNBX. Distributions +----------------------------------------------------------------+ |CALL HPARMN (X,Y,EY,NP,NVAR,IC,R2MIN,MAXPOW,COEFF*,ITERM*,NCO*) | +----------------------------------------------------------------+ Input parameters: X Coordinates of data points (array of size NP*NVAR) Y Data to fit (array of length NP) EY Errors on data (array of length NP) NP Number of points to fit NVAR Dimension of X-space MAXPOW see HPARAM IC see HPARAM R2MIN see HPARAM Output parameters: See HPARAM. Remark: See HPARAM The only difference concerns option W of IC W = 0 errors are taken from EY W = 1 errors are set to 1

Parametrization

Histograms and plots

Distributions