LINTRA -- Principal components analysis program
Authors: R. Brun DD/EE/79-5
M. Hansroul DD/US/70
J. Kubler Revised 80.05.10
1. INTRODUCTION
In many applications of various fields of research, the treat-
ment of large amounts of data requires powerful techniques capable of
rapid data reduction and analysis. Usually, the quantities most con-
veniently measured by the experimentalist are not necessarily the most
significant for the classification and analysis of his observations.
It is then very helpful to have a way of selecting an optimal set of
variables necessary for the recognition process and reducing the di-
mensionality of the problem resulting in easier classification proce-
dure.
This paper describes the implementation at CERN of one such meth-
od of feature selection, namely the principal components analysis.
This multi-dimensional technique is well known in the field of pattern
recognition (Ref. 1) and its use in Particle Physics has been docu-
mented elsewhere (Refs. 2, 3 and 4).
1.2 EXAMPLE OF APPLICATION
______________________
Let us consider the case where the prototypes in the sample are
trajectories of particles passing through a spectrometer. If one meas-
ures the passage of the particle at say, 8 fixed planes, the trajecto-
ry is then described by an 8-component vector
õ
x = (x(1),...,x(8)) in an 8-dimensional pattern space.
One proceeds then by generating a representative tracks sample
and building up the covariance matrix C. Its eigenvectors and eigenva-
lues are computed by standard methods and thus a new basis is obtained
for the original 8-dimensional space. The expansion of the prototypes
õ 8 õ õt õ
x(m) = ÷ a(m,i)e(i) with a(m,i) = x(m).e(i)
i=1
allows the study of the behaviour of the coefficients a(m,i) for all
the tracks in the sample. The eigenvectors which are insignificant for
the trajectory description in the expansion will have their corre-
sponding coefficients a(m,i) close to zero for all the prototypes. On
one hand a reduction of the dimensionality is then obtained by omit-
ting these least significant vectors in the subsequent analysis. On
the other hand, in the analysis of real data these least significant
variables can be used for the pattern recognition problem of extract-
ing the valid combinations of coordinates describing a true trajectory
from the set of all possible wrong combinations.
The program described here performs this principal components
analysis on a sample of data provided by the user. It computes the co-
variance matrix, its eigenvalues and corresponding eigenvectors and
exhibits the behaviour of the principal components [a(m,i)], thus pro-
viding to the user all the means of understanding his data.
A short outline of the method of Principal Components is given in
the Appendix.
1.3 STRUCTURE OF THE PACKAGE LINTRA
_______________________________
The LINTRA package has been written in FORTRAN and can readily be
implemented on various computers*. The block diagram below shows the
structure of LINTRA :
+----------------+
| |----õ+-----------------------------------------+
| L I N T R A | | LINIT : initialisation of control |
___________
| | | variables |
| |Ê----+-----------------------------------------+
| |
| |----õ+-----------------------------------------+
| | | LCREAD : data cards reading |
| |Ê----+-----------------------------------------+
| Initialisation |
| |----õ+-----------------------------------------+
| | | LZBOOK : initialisation of data blocks |
| |Ê----+-----------------------------------------+
| |
| |----õ+----------------------------+--õ+--------+
| | | LINEAR: computes covariance| | LINONE |
| | | matrix, its eigenvector|Ê--+--------+
| | | and eigenvalues |
| |Ê----+----------------------------+
| |
| |----õ+-----------------------------------------+
| | | LHBOOK : booking of histograms |
| |Ê----+-----------------------------------------+
| |
| Computation |----õ+-----------------------------------------+
| | | LMATOU : prints results of LINEAR |
| |Ê----+-----------------------------------------+
| of |
| |----õ+-----------------------------------------+
| | | LPUNCR : writes a FORTRAN subroutine |
| transformation | | performing the principal |
| | | components transformation |
| |Ê----+-----------------------------------------+
| |
| |----õ+-----+
| | |LTEST|
| | | +-|--õ+-------------------------------+
| | | | | | LINONE : reads one event |
| | | | |Ê--+-------------------------------+
| | | | |
| | | | |--õ+-------------------------------+
| Test | | | | | LTOXSI : computes principal |
| | | | | | components |
| | | | |Ê--+-------------------------------+
| | | | |
| of | | | |--õ+-------------------------------+
| | | | | | LTOX : computes original event|
| | | | | | from principal compo- |
| | | | | | nents |
| transformation | | | |Ê--+-------------------------------+
| | | | |
| | | | |--õ+-------------------------------+
| | | | | | LUSER : user subroutine |
| | | | |Ê--+-------------------------------+
| | | | |
| | | | |--õ+-------------------------------+
| | | | | | LSUMSQ : computes residuals |
| | | +-|Ê--+-------------------------------+
| |Ê----+-----+
| |
| |----õ+-----------------------------------------+
| | | LQUIT : prints final results |
| |Ê----+-----------------------------------------+
+----------------+
* The source code of LINTRA is available from the CERN Program Li-
brary.
LINTRA is essentially divided into 3 parts :
i) the initialisation
ii) the principal components analysis of the training sample
leading to the linear transformation to go from the original
vectors to the new description of the prototypes in terms of
the principal components
iii) the application of the transformation to a test sample.
The package utilises several utilities available from the CERN Pro-
gram Library. Among them, let us mention
- the histogramming package HBOOK (Ref. 5)
- the dynamic memory management system ZBOOK (Ref. 6)
- the free format input card reading subroutine FFREAD (Ref. 7).
The usage of these facilities is completely transparent and the
user need not know much about them in order to use LINTRA.
In case the user has asked for histograms, a simple call to the
subroutine HISTDO of the package HBOOK after a call to LINTRA is suf-
ficient to obtain the printing of the histograms (see Section 3.4).
When LINTRA is to be called from an existing program where the
blank common is used, it is important to know that HBOOK also utilises
the blank common. In that case, it is necessary to make a call to the
subroutine HISTGO of HBOOK in order to define the starting address of
the HBOOK area in the blank common.
1.4 USAGE
_____
The subroutine LINTRA can be called from any existing user program.
LINTRA uses two labelled common blocks :
i) the common LFLAG has a length of 115 words and contains main-
ly control variables
ii) the common LINT is dimensionned at 2000 words and contains
the data blocks created by the package ZBOOK.
Its length is sufficient for prototype vectors of at most 20
_______
components.
When more than 20 component vectors are to be treated, the
length of the common LINT should be defined by the calling
program according to the formula :
Length = 10 PNT Nvar + 4 ù Nvarý + 200
where NVAR is the number of components in the original proto-
types (see example in Section 3.4).
In addition, care must be taken that the logical files needed by
LINTRA (see data cards PUNC and OUTP in Section 3.2) must be declared
in the calling program.
In a first approach, the user has to provide a small subprogram
(LINONE) whose action is to read in one element from the sample (Chap-
ter 2) and also a small set of compulsory data cards (Chapter 3).
Additional data cards are available in order to select a methodo-
logical option or some different output options (Section 3.2).
A subroutine (LUSER) may also be optionally provided by the user,
in case he wishes to interact with the program and output some more
information (Section 3.3).
2. USER SUBPROGRAM
2.1 SUBROUTINE LINONE
_________________
The task of this subroutine is to provide one element of the
training sample to the principal components analysis program.
õ
An element consists of the vector X of its corresponding NVAR
variables.
In order to compute the transformation matrix, the subroutine is
called as many times as decided by the user. When the number of ele-
ments requested by the user has been reached, the logical variable
EOFDAT has to be set to .TRUE.
The subroutine is called again in order to apply the transforma-
tion matrix to a sample of data and fill various histograms. This
sample can be the same as the previous one.
The call to this subroutine has the form :
CALL LINONE(X,NVAR,EOFDAT)
where the meaning of the parameters is as follows :
NVAR = number of variables used by the principal
components analysis (set by the user via
data card)
(X(I),I=1,NVAR) = values of the NVAR variables for the
element read
EOFDAT = logical variable to be set .TRUE. when
the number of elements requested has
been reached (EOFDAT is initially set
to .FALSE.).
2.2 BLOCK DIAGRAM OF ROUTINE LINONE
_______________________________
LINONE(X,NVAR,EOFDAT)
+--------------------+
| Initialise |
| |
| |
| LUNIT = 1 Ê--------- corresponds to the logical file
| | TAPE1 declared in the program MAIN
| |
| NV = 0 Ê--------- counter set to zero
| |
| NMAXEV = n Ê--------- maximum number of events to be read
| | from TAPE1 - defined by the user
+--------------------+
|
|
¾
+----------------------------+
| read from the logical unit |
| LUNIT one event of the |
| form |
| |
| (X(I),I=1,NVAR) |
| |
| NV=NV+1 Ê--------------- increment the counter
+----------------------------+
|
|
¾
+----------------------------+
| |
| if (NV<NMAXEV) then RETURN |
| |
| |
| else set EOFDAT=.TRUE. |
| NV=0 ‡ |
| ‡ Ê----------- reset the counter for the test
| NMAXEV=m ‡ | sample and redefine the maximum
| | number of events to be read
| | from it
+----------------------------+
|
|
¾
+----------------------------+
| |
| if the test sample is the |
| same as the training sample|
| |
| then |
| REWIND LUNIT |
| else |
| LUNIT=2 Ê------------- corresponds to the logical file
| | TAPE2 declared in the program
| | MAIN
+----------------------------+
|
|
¾
+--------------+
| RETURN |
+--------------+
2.3 FORTRAN EXAMPLE
_______________
SUBROUTINE LINONE(X,NVAR,EOFDAT)
C
DIMENSION X(NVAR)
C
LOGICAL EOFDAT
C
C INITIALISATION
C
DATA LUNIT,NV,NMAXEV/1,0,1000/
C
READ (LUNIT) (X(I),I=1,NVAR)
NV=NV+1
C
IF (NV.LT.NMAXEV) GO TO 999
C
C RESET VARIABLES FOR TEST SAMPLE
C
EOFDAT=.TRUE.
NV=0
NMAXEV=500
C
C IN THIS EXAMPLE THE TEST SAMPLE IS READ ON UNIT 2
C
LUNIT=2
999 RETURN
END
3. DATA CARDS
The program uses the subroutine FFREAD in order to read the data
cards in free format.
The first four characters are used as a keyword by the program,
in order to identify the group of variables set in the remaining part
of the card.
The remaining pieces of information are values to be assigned to
certain control variables used by the program.
The data cards are divided into a compulsory set and an optional
one. The END card must be read last, all other cards may be placed
in any order.
3.1 COMPULSORY DATA CARDS
_____________________
+--------------------------------------------------------+
| | |
| KEYWORD | VARIABLES |
| | |
|---------|----------------------------------------------|
| | | |
| NVAR | NVAR | Number of variables (< 100) |
| | | |
|---------|---------|------------------------------------|
| | | |
| TEST | | Number of principal components |
| | NTEST | used to reconstruct the original |
| | | variables. |
| | | (by default, NTEST is set to NVAR) |
| | | |
|---------|---------|------------------------------------|
| | | |
| END | | Terminates the data cards reading. |
| | | (this card has to be the last one) |
| | | |
+--------------------------------------------------------+
3.2 OPTIONAL DATA CARDS
___________________
Methodological option :
+----------------------------------------------------+
| | |
| KEYWORD | VARIABLES |
| | |
|---------|------------------------------------------|
| | |
| NORM | Allows the user to transform original |
| | variables to normalized variables i.e. |
| | |
| | X(I) = (X(I) - X(I) ) / X(I) |
| | NEW OLD AVERAGE STANDARD |
| | DEVIATION |
| | |
| | This option is useful in the case |
| | the ranges of the variables are very |
| | different. |
| | |
+----------------------------------------------------+
Output options :
+---------------------------------------------------------+
| | |
| KEYWORD | VARIABLES |
| | |
|---------|-----------------------------------------------|
| | | |
| DEBU | 1 | Debug of the events between event |
| | (IDEMIN)| number IDEMIN and event number |
| | 2 | IDEMAX. For each event, original |
| | (IDEMAX)| variables, the coordinates of the |
| | | event on each principal component |
| | | and the reconstructed variables |
| | | using the NTEST first components |
| | | are successively printed |
|---------|---------|-------------------------------------|
| | | |
| HIST | 1 | Output of the histograms of the |
| | 2 | principal components number 1..2... |
| | 3 | ..NVAR (see example). |
| | . | |
| | . | |
|---------|---------|-------------------------------------|
| | | |
| RESI | | Output of the sum of squares of |
| | | the residuals for each variable, |
| | | taking into account from 1 to NVAR |
| | | principal components for the |
| | | reconstruction |
| | | |
|---------|---------|-------------------------------------|
| | | |
| PUNC | LPUNCH | writes on logical unit LPUNCH the |
| | | fortran code of the subroutine |
| | | XTOXSI which computes the coordina- |
| | | tes for each event in the new system|
| | | of axes defined by the principal |
| | | components |
| | | |
|---------|---------|-------------------------------------|
| | | |
| OUTP | LPRINT | writes on logical unit LPRINT all |
| | | the results of the principal |
| | | components analysis. |
| | | Default = line printer file |
| | | |
+---------------------------------------------------------+
The logical file numbers set by the data cards PUNC and OUTP
Remark :
must of course be defined by the PROGRAM card of the main
program supplied by the user (see 2.1) on CDC, or in the JCL
on IBM.
Example of data cards :
LIST ACTIVATES THE PRINTING OF THE DATA CARDS
NVAR 8 NUMBER OF INITIAL VARIABLES PER EVENT
TEST 5 NUMBER OF VARIABLES USED FOR THE RECONSTRUCTION
DEBU 1,10 DEBUG OF EVENTS ONE TO TEN
RESI THE CALCULATION OF THE RESIDUALS IS MADE
HIST 1=1 3=1 5=1 HISTOGRAM OF PRINC. COMP. ONE, THREE, FIVE
NORM THE VARIABLES ARE NORMALIZED
PUNC 4 FORTRAN CODE WRITTEN ON LOGICAL UNIT FOUR
END
3.3 OPTIONAL SUBROUTINE
___________________
The user subroutine LUSER allows one to display some more infor-
mation (such as histograms ..) and/or to perform some computation.
The subroutine LUSER is called during the second phase of the
program, when the test of the adjustment is made. In this case, the
data sample used by the program is the test sample. For each event
read by the program from the test sample, the subroutine gives access
to the coordinates of this event and to each of the NVAR principal
components.
The call to this subroutine has the form :
CALL LUSER(X,XSI,NVAR)
where the meaning of X, XSI and NVAR is described below.
Example :
SUBROUTINE LUSER(X,XSI,NVAR)
DIMENSION X(1),XSI(1)
C
C NVAR = NUMBER OF ORIGINAL VARIABLES
C (X(I),I=1,NVAR) = VALUE OF THE NVAR ORIGINAL VARIABLES FOR
C THE EVENT READ FROM THE TEST SAMPLE
C (XSI(I),I=1,NVAR) = COORDINATES OF THE ORIGINAL EVENT IN THE
C NEW SYSTEM OF AXES DEFINED BY THE NVAR
C PRINCIPAL COMPONENTS
C THIS ROUTINE COMPUTES THE DISTRIBUTION OF THE DISTANCE
C TO THE HYPERPLANE FORMED BY THE NTEST FIRST
C PRINCIPAL COMPONENTS (NTEST < NVAR)
C
DATA NTEST /5/
DATA IENTRY /0/
IF (IENTRY.NE.0) GO TO 200
C
C FIRST ENTRY
C
IENTRY=1
CALL HBOOK1(999,29HDISTRIBUTION OF THE DISTANCE$,100,0.,0.,0.)
C
C SECOND ENTRY
C
200 CONTINUE
C
C COMPUTE THE DISTANCE TO THE HYPERPLANE OF DIMENSION NTEST
C
D=0.
NTT=NTEST+1
DO 10 I=NTT,NVAR
10 D=D+XSI(I)**2
D=SQRT(D)
C
C FILLING OF THE HISTOGRAM
C
CALL HFILL(999,D,0.,1.)
RETURN
END
at this point, the user may conveniently use the subroutine
Remark :
LTOXSI and LTOX available in the LINTRA package. The respec-
tive calls have the form
CALL LTOXSI(X,XSI,NVAR,1,NVAR)
CALL LTOX(XSI,X,NVAR,NTEST)
where X, XSI,NVAR and NTEST are defined as in the example
above. LTOXSI computes the NVAR principal components XSI
from the original vector X.
LTOX reconstructs the original vector X from the NTEST first
principal components XSI.
3.4 EXAMPLE AND DECK SET-UP (CERN CDC 7600 SYSTEM)
______________________________________________
JOB,T10.
ACCOUNT,user,div,account.
FTN,L,R.
FIND,LLIB,LINTRALIB,ID=DE100.
CERNLIB,LLIB,GENLIB.
FILE,TAPE4,RT=Z,FL=80.
FIND,TAPE1,DATASAMPLE,ID=USERID.
FIND,TAPE2,TESTSAMPLE,ID=USERID.
LGO.
CATALOG,TAPE4,XTOXSI,ID=userid,ST=CCP,RP=999.
End-of-record
_____________
PROGRAM MAIN(INPUT,OUTPUT,TAPE1,TAPE2,TAPE4,..)
COMMON /LINT/ DUMMY(3000)
C THE COMMON LINT HAS A LENGTH SUFFICIENT FOR
C 25-DIMENSIONAL VECTORS
CALL LINTRA
C PRINT HISTOGRAMS
CALL HISTDO
END
SUBROUTINE LINONE(X,NVAR,EOFDAT)
.
.
.
END
SUBROUTINE LUSER(X,XSI,NVAR)
.
.
.
END
.
.
.
End-of-record
_____________
NVAR 25
TEST 6
DEBU 1 10
RESI
HIST 1=1 4=1 6=1 15=1
NORM
PUNC 4
END
End-of-file
___________
3.5 EXAMPLE AND DECK SET-UP (CERN IBM SYSTEM)
_________________________________________
//uuunn JOB uuu$gg
// EXEC NFORTCG,
// LLB1='DL.PUB.LINTRA.LIB',
// LLB2='CR.PUB.GENLIB'
//C.SYSIN DD *
COMMOM//B(5000)
COMMON/LINT/DUMMY(3000)
CALL HLIMIT(5000)
CALL LINTRA
CALL HISTDO
END
:
:
:
END
/*
//G.SYSIN DD *
LIST
:
:
:
END
/*
//G.FT01F001 DD DSN=gg.uuu.name1,DISP=(OLD)
//G.FT02F001 DD DSN=gg.uuu.name2,DISP=(OLD)
//G.FT04F001 DD DSN=gg.uuu.name3,
// DISP=(NEW,CATLG,DELETE),
// UNIT=SYSDA,
// DCB=(RECFM=F,BLKSIZE=133,LRECL=133),
// SPACE=(TRK,(1,1),RLSE)
4. REFERENCES
1. Andrews H. C.
Introduction to Mathematical techniques in pattern recognition,
Wiley Interscience, 1972.
2. H. Grote, M. Hansroul, J. C. Lassalle, P. Zanella
Identification of Digitized Particle Trajectories,
CERN - DD/73/23.
3. M. Hansroul, D. Townsend, P. Zanella
The application of Multi-dimensional techniques to the process-
ing of event data from large spectrometers,
CERN - DD/73/31.
4. H. Wind
Function Parametrization, Proc. of the 1972 CERN Computing and
data processing school, CERN 72-21.
5. HBOOK : HBOOK User Guide, CERN DD/EE/81-1, DD/US/72.
6. ZBOOK : ZBOOK User Guide, DD/US/73.
7. FFREAD : FFREAD User Guide, CERN DD/EE/78-2, DD/US/71.
Acknowledgements
________________
We would like to express our thanks to F. G. De Bilio to prepare
this document, which was produced with the text-editing program
SCRIPT/SYSPUB.
Appendix A
PRINCIPAL COMPONENTS METHOD
Let us consider a sample of M prototypes each being characterized
by P variables X1,....,Xp. Each prototype is a point or a column vec-
tor in a P-dimensional pattern space
_____________
õ t
x = (x(1),x(2),...,x(P))
where each x(n) represents the particular value associated with the
n-th dimension and t means "transpose".
Those P variables are the quantities accessible to the experimen-
ter but are not necessarily the most significant for the classifica-
tion purpose.
The Principal Components Method consists in applying a linear
____________________________ ______
transformation to the original variables. This transformation is de-
scribed by an orthogonal matrix and is equivalent to a rotation of the
original pattern space to a new set of coordinates vectors, which
hopefully provide easier feature identification and dimensionality re-
duction.
Let us define the covariance Matrix
õõt
C = <yy> (1)
õ õ õ
where y = x - <x> and the brackets indicate mean value over the sam-
ple of M prototypes.
This matrix is real positive definite and symmetric and will have
all its eigenvalues greater than zero. It will now be shown that,
among the family of all complete orthonormal bases in the pattern
space, the basis formed by the eigenvectors of the covariance matrix
and belonging to the largest eigenvalues corresponds to the most sig-
nificant features for the description of the original prototypes.
Let the prototypes be expanded into the set of N basis vectors
õ
(e(n), n=1,..,N,N+1,..,P)
õ N õ
y(i) = ÷ a(i,n)e (n) i = 1,..,M
(2)
n=1 N < P
õ
The "best" feature coordinates e(n) will be obtained by minimizing the
error due to this truncated expansion, i.e.
+ õ N õ +”
minimize E(N) = <| y (i) -÷ a(i,n)e (n) | >
(3)
+ n=1 +
õ õ
with the orthonormal conditions e(k).e(j) = 1 if k=j
(4)
= 0 if kØj
õt
Multiplying (2) by e(n) and using (4), one gets
õt õ
a(i,n) = y(i).e(n)
(5)
The error becomes
+ P õ + + P õt õ õ +
E(N) = <| ÷ a(i,n) e(n) | > = <| ÷ (y(i).e(n))e(n) | >
+ n=N+1 + + n=N+1 +
P õt õ õt õ p õt õ
= < ÷ e(n)y(i)y(i)e(n) > = ÷ e(n)Ce(n)
(6)
n=N+1 n=N+1
õt õ
The minimisation of the sum in (6) is obtained when each term
e(n)Ce(n) is minimum, as C is positive definite. By the method of La-
grange multipliers and the condition (4), one gets
P õt õ õt õ
E(N) = ÷ (e(n)Ce(n) - l(n)e(n).e(n) + l(n))
n=N+1
õ t
The minimum condition dE(N) / de(n) = 0 leads to the equation
õ õ
Ce(n) = l(n)e(n)
(7)
õ
which shows that e(n) is an eigenvector of the covariance matrix C and
l(n) the corresponding eigenvalue. The estimated minimum error is then
given by
P õt õ P
E(N) = ÷ e(n).l(n)e(n) = ÷ l(n)
(8)
n=N+1 n=N+1
where l(n), n=N+1,...,P are the eigenvalues associated with the omit-
ted eigenvectors in the expansion (2). Thus, by choosing the N largest
eigenvalues and their associated eigenvectors, the error E(N) is mini-
mized.
The transformation matrix to go from the pattern space to the feature
õ õ
space consists of the ordered eigenvectors e(1),...,e(p) for its rows
+ õt +
| e(1) |
| |
| õt |
| e(2) |
| |
| . |
| |
T = | . |
| |
| . |
| |
| . |
| |
| õt |
| e(p) |
+ +
This is an orthogonal transformation or rotation of the pattern space
and the feature selection results in ignoring certain coordinates in
the transformed space.