DATA ANALYSIS AND
INFORMATICS, V
E. Diday (editor)
© Elsevier Science Publishers B.V. (North-Holland), 1988
461
THE
LANGUAGE FOR INTERROGATING DATA - L.I.D. -
Jean-Marc BERNARD*, Robert
BALDY** and Henry ROUANET*
* Groupe
Mathématiques et Psychologie (CNRS, UA 1201), Université René Descartes, 12 rue Cujas,
75005 Paris, France.
** Institute of
Psychiatry, University of London, Great-Britain; now at Quantime Ltd., London,
Great-Britain.
The LID language can be
characterised as an interrogation language for a statistical data base. In this paper we present the underlying
concepts of LID. Starting from a basic data set, a LID request first generates
a derived data set by restriction, pooling, averaging,
and residual derivations, then specifies the operations to be performed on it: representation as a table or as a graph,
statistical calculations, etc. Last we sketch the program EYE-LID 1 † in
which emphasis h as been put on graphical] representations.
I.
INTRODUCTION
The language for interrogating data (L.I.D.)
presented in this communication constitutes the command language of the computer program EYE-LID 1. This program,
recently developed by R.. Baldy and J.-M. Bernard, is
a multivariate data analysis package in which the emphasis has been put on descriptive statistics and graphical representations. This paper
is mainly concerned with the LID
language itself, as the program EYE-LID 1 will be presented within the software
demonstrations of these meetings.
The LID language
can be described as a language for the analysis of a statistical structured data base (protocol); the user proceeds to requests of
analysis, through the formulae of the language, which initiate the search of relevant data
and their representation (tables, graphs, etc.), as well as the statistical
procedures to be applied to them.
The LID language is the last one of a series of
command languages devised for the statistical analysis of structured data, such as the ones
implemented in the VAR3 (see [l] and [2]) and
the STEL programs (see [3]). It is to be noted that VAR3 has become a
"routine" analysis of variance program for a whole community of experimental psychologists in this country. This
experience has demonstrated the
interest in the availability of a data interrogation language in which the
researcher can accurately translate
his own questions, within the frame of the experimental design. While
elaborating the LID language we have enlarged the set of
acceptable types of data sets, enriched the
language itself, and merged the descriptive approach of exploring
data with the traditional inferential approach of analysis of variance.
We will first present the underlying concepts of the
LID language (protocol, factors, variables), then the request formulae
(set-theoretic or linear), and the options for the representation of the
results. Last we will briefly
describe the program EYE-LID 1.
2. FUNDAMENTAL CONCEPTS
2.1. The
"INOP" data set, an example
The fundamental concepts on which the LID language is based will be
exemplified on a simple statistical data set borrowed from the "INOP"
survey (see [4] and [5]). Twenty subjects (school children) are divided into 4 groups of 5 pupils
each, according to teaching method (modern vs
† The program EYE-LID 1 presented in this paper has
been partly supported by a French-British (CNRS
- SSRC) joint project (ATP #
95.51.91).
462
J.-M.
Bernard. R. Baldy and H.
Rouanet
traditional) and social
environment (underprivileged vs privileged). The
children have been tested on 2 occasions (middle and end of school
year). On each occasion 3 variables have been measured yielding a
combinatorial, a probability and a logic score.
2.2. Protocol, factors, units and variables
The description of this data structure, that is the protocol,
first consists in the identification of the
protocol factor and of their modalities:
. factor "subjects" S
with 20 modalities: (s1 to s20)
. factor
"teaching method" A with 2 modalities: modern(a1), traditional (a2)
. factor "social environment" B with 2 modalities:
underprivileged (b1), privileged (b2)
. factor
"occasions" T with 2 modalities: middle (t1), end (t2) of school year
The set of all relations between
factors constitutes the protocol design. This design can be indicated by a LID formula by means of the factors, the nesting symbol
"<>", the crossing
symbol "*", and the composition
symbol "&" (see section 4.). Several designs may be envisaged
depending on the specification of relations: for example the design
S&A&B&T only states that data are indexed by the composition of
factors S, A, B and T, whereas the design S<A*B>*T tells us, in addition,
that S, nested in the crossing of A and B, is itself crossed with T. On the
right side of this formula one indicates the
variables designated by V1, V2, V3, hence the protocol design: S<A*B>*T→V1,V2,V3.
Each observed combination of the
factor modalities is called a unit; the set of these 40 units (20 modalities for S crossed with 2
modalities for T) constitutes the protocol support. With each unit is associated its identification by factor modalities, its values for all
variables, and its weight (equal to 1
for an elementary protocol). The following diagram summarizes all the
necessary components for the definition of a protocol:
|
|
|
|
|
|
|
|
|
Number of |
|
|
|
|||||||
|
|
1 |
|
|
nf |
|
units |
|
factors |
|
variables |
|
|||||||
factors |
|
S |
A |
B |
T |
|
nu |
20 |
nf |
4 |
nv |
3 |
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
number of modalities |
|
20 |
2 |
2 |
2 |
|
design |
|
S<A*B>*T → |
V1,V2,V3 |
|
|
|||||||
|
|
|
|
|
|
|
weight |
|
values |
of |
variables |
|
||||
|
|
1 |
|
|
nf |
|
|
|
|
1 |
|
nv |
||||
description |
1 |
1 |
1 |
1 |
1 |
|
1 |
|
1 |
3.9 |
3.2 |
7.5 |
||||
of units |
|
1 |
1 |
1 |
2 |
|
1 |
|
|
5.5 |
6.0 |
7.4 |
||||
according |
|
: |
: |
: |
: |
|
: |
|
|
: |
: |
: |
||||
to factors |
|
: |
: |
: |
: |
|
: |
|
|
: |
: |
: |
||||
|
nu |
20 |
2 |
2 |
2 |
|
1 |
|
nu |
5.4 |
8.9 |
7.1 |
||||
3.
GENERAL FEATURES OF THE LID LANGUAGE
3.1. The LID request* of analysis
Once a basic protocol has been declared
according to the preceding structure, the LID language enables one to proceed
to requests of analysis. In other words, the basic protocol is a statistical
data base whose indexation System is constituted by the factors, and LID a
language for interrogating such a data base. Briefly stated, a request defines
a derived protocol, the operations to reach it and what is to be done
with it. The general syntax of a LID request is "keyword
formula → varlist":
The
language for interrogating data - L. I. D.
463
. the
formula characterizes the derived protocol, that is on one hand its
structure (units, factors) and on the other hand the type of derivation to be
used to calculate the variable values from the
basic protocol.
. the varlist part specifies which
variables of the basic protocol are selected for this derived protocol.
. the
keyword indicates which procedure to apply to the derived protocol, in
particular: representation as a graph or as a table, various statistical
calculations, sending to a file, etc.
3.2. Notion of derived protocol
3.2.1. Structure of a derived
protocol
As any protocol, a derived protocol is constituted by
a set of units. With each of these derived units
is associated the part of all basic units that correspond to it; what we
call a present derived unit is one for which this part is not empty. The structure of a derived protocol is
then obtained by the two following
elementary operations: restriction
of the set of basic units and pooling
of the remaining basic units for each derived unit.
Each derived unit
has a label expressed in terms of the factors of the derived protocol
(that are a subset of the basic factors), by a structure such as the one showed
in the following table, which may be read "a1
and b2 and t1":
S |
A |
B |
T |
/ |
a1 |
b2 |
t1 |
The part of the basic units associated with
this derived unit is constituted by all basic units simultaneously indexed by
the modalities a1, b2 and t1, so 5 basic units in this case: "s6a1b2t1", "s7a1b2t1", "s8a1b2t1", "s9a1b2t1" and "s10a1b2t1".
3.2.2. Calculation of values associated with a derived unit
As far as set-theoretic LID formulae are concerned (see section
4.) the numerical value for a derived unit and a given variable will
typically be a central value of the part of the associated basic units: mean, mode, median, for example. The choice among
these options will be introduced in LID using another keyword on the right side
of a request. For the time being only the weighted mean option is considered.
As we will see in section 5., linear formulae
introduce other modes of derivation, such as residual derivations (i.e.
deviations from means).
3.3. Symbols of the LID language
A LID request is made of symbols that are either
single characters or character strings. These symbols
may be grouped in 5 categories:
Operands: fact =
factor designated by an upper letter, such as "A"
Mod =
modality of a factor, made of a lower letter
followed by
a modality number, such as "al",
var =
variable designated by the letter "V* and the
variable
number, for example "V2".
Set-theoretic ^ = modality concatenation
("and"); by definition
operators:
this symbol is not written in actual formulae.
_ = modalities pooling
("or") in a part.
, =
parts separator in a family of parts.
& = composition of 2 family of parts.
* =
crossing of 2 family of parts.
<> =
nesting of a factor within a family of parts.
Linear operators: ( ) =
within derivation.
. = interaction dérivation.
: = contrast
dérivation.
Selection operator: → =
variables sélection.
464
J.-M. Bernard, R. Baldy
and H. Rouanet
Keyword (this list
is not exhaustive): "TAB", "GRA", "HIST",
"WEIG", "MEAN", "MIN",
"MAX", "MED"
"STD", "DF", "SSP", "MSP",
"CORR", "FILE", etc.
The interpretation of a formula
is performed accord in g to the relative priorities of the operators, as no symbol allows to force these
priorities such as the parentheses in a mathematical formula. The following table shows these
priorities:
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
→ |
( ) |
<> |
: |
. |
* |
& |
, |
- |
^ |
We will now present the LID
requests, first considering the core of the request, i.e. the formula, which can be either
set-theoretic or linear. We will then consider the varlist component and at
last the possible operations on a
derived protocol indicated by the keyword. The language will first
be presented intuitively with
examples; its formal synthetic description will later be given in section
8.
4.
SET-THEORETIC FORMULAE
A set-theoretic (ST) formula
only uses as operators the set-theoretic ones given in section 3.3. As mentioned earlier, a ST formula
leads to the calculation of weighted means associated with each derived unit. For this reason we
will here only tell how the formula determines the labels of these derived units.
The operators "^",
"_" and "," and the operands of a "mod" type
define an elementary set-theoretic language that generates family of parts. Let us go through these
elementary formulas and their interpretation:
s1 designates an elementary
modality, i.e. a modality of a basic factor; generates 1 derived unit with label "s1".
s1t1 the concatenation operation
allows to create compound modalities, obtained by the composition of different factors; generates 1
derived unit with label "s1 and t1".
s1,s2 the "_" operation
allows the pooling of elementary or compound modalities that pertain to the same factors; it defines a part;
generates 1 derived unit with label "s1 or s2".
s1,s2_s3 the "," operator
separates parts (i.e. derived units) and
defines a family of parts; the labels of the different parts must contain the
same factors; this formula generates 2 units with respective labels
"s1" and "s2 or s3".
The others set-theoretic elements
of the language don't introduce any other formal object but allow a more concise definition of
family of parts.
A designates the family of parts
constituted by the enumeration of the factor modalities; this formula is equivalent to
"al,a2".
a1,a3&T the "&" operation
combines two family of parts that do not include the same factors into a single
one; each part of this new family is obtained by combining a part of the first one and a part of the
second one but only the present derived units are finally kept; this formula can then be rewritten as
"s1t1,s3t1,s1t2,s3t2".
A*T the crossing operator
"*" operates as the composition one "&" but also
induces the checking of
the crossing relation between the two families of parts, i.e. that each generated derived unit is
present.
S<a1,a2> the nesting operator "<>"
is not symmetrical and operates on a factor (the
"nested" written outside the "<>") and a family of parts (the "nesting"
written inside); again it operates
as "&" but also induces the checking of the nesting relation,
i.e. that each
The
language (or interrogating data- L. t. 0.
465
modality
of the factor generates a present derived unit for, at most, one part of the
nesting family of parts.
5.
LINEAR FORMULAE
The three linear operators "( )",
"." and ":* introduce other modes of derivation than weighted
averaging. A linear formula may be
divided into two components: an associated ST formula that defines the labels
of the derived units, and a particular mode of derivation for the calculation
of values associated with each unit. The validity of a linear formula is
only determined by that of the associated
ST one.
S(A) the within derivation operator "( )" is not
symmetrical; it requires a factor (outside the "( )") and a family of
parts (inside the "( )").
One obtains the associated ST formula by replacing "( )" by
"&". This derived protocol structure is then the same as that of "S&A".
The values for "S(A)" are calculated by difference between those of
the derived protocols "S&A" and "A".
A B the interaction derivation operator "."
operates on any couple of formulae. The associated ST formula is obtained by
replacing"." by "*", as well as the two composing formulae
by their own associated ST one. The structure of this derived protocol is then the same as that of " A*B".
Values are the residual values of "A*B" once the effects of "A"
and "B" have been taken out (for a definition of interaction see
[l]).
S:tl,t£ the contrast derivation operator ":" is not
symmetrical and operates on any formula (on its left side) and a family of 2
parts (on its right side). Apart from that, it may be used anywhere a
"*" can be. The family of two parts defines a contrast, here on the T factor. The associated ST formula is
the one associated with the left formula, here "S" Values are
calculated by difference between those of the two derived protocols "S*t1"
and "S*t2".
6.
SELECTION OF VARIABLES
Any formula is followed by the "→"
operator and the list of variables (separated by ",") to be considered for the derived protocol, for example
"S → V1,V3".
Considering again the definition
of a protocol, we see that with each
unit and each factor is associated the
modality number of this factor for this unit, just as with each unit and each
variable is associated a numeric
value; we may then speak about the variable associated with a factor, which we
will denote as the factor itself.
These variables may equally appear in the "varlist". In that case
though, the fundamental distinction between factors and
variables is still kept since averaging other modality numbers is not allowed.
7.
SOME EXAMPLES OF SET-THEORET1C AND LINEAR FORMULAE
S<A*B>*T→ V1,V2,V3 The basic protocol.
S<a1b1,a2b2>*t1→ V1 Values of V1 for the occasion t1 for each
subject of groups
a1b1 and a2b2.
A*B→A,V1
For each group albl, alb2, a2bl, a2b2, values of variables
A
and V1; the graph of this
protocol is the classical
"interaction
diagram" between A and B
for V1.
t1,t2 → V3,V2
Grand mean of variables V3 and
V2.
S(a1b2) → V1
For variable V1, deviation of the mean of each
subject of
Group
a1b2 from the grand
mean of this group.
S(A&B).T → V1
For variable V1, within-group residual protocol.
T:a1b1,a1b2 →
V2 For variable V2, for each occasion t1 and t2,
difference
between
means of groups a1b1 and a1b2.
A*B:t1,t2 → V1
For variable V1, for each of the four groups of
subjects,
Difference between mean scores of occasions
t2 and t1.
466
J.-M. Bernard. R. Baldy and H. Rouanet
8. SUMMARY
OF THE FORMAL SYNTAX OF LID REQUESTS
We now give a more formal description of the LID language that shows how
to construct a request from the various objects of the language and these from
the initial objects given in section 3.3.
Formal rewriting |
rule |
Conditions of
validity |
modcomp →
→ part
→
→ famipar → → → → → → |
mod modcomp^mod modcomp part_modcomp part famipar ,part fact fact<famipar> famipar*famipar famipar&famipar |
different factors (albl) same factors (a1b1_a2b2) same factors
(a1b1,a2b2) nesting relation crossing relation different factors |
formula → → → → → |
famipar fact(famipar)
formula. formula formula*formula formula:famipar |
factor not present in famipar crossing relation crossing relation crossing rel., family of 2 parts |
varlist → → → |
fact var varlist, varlist |
|
request → |
keyword formula → varlist |
9. REPRESENTATION OF A DERIVED
PROTOCOL
We have just seen how a LID formula defines a derived
protocol. A keyword appearing on the left side of a formula allows the
selection of a mode of representation for this derived protocol or to realise
some operations on it. These keywords may not be considered as proper elements of
the LID language as they basically
constitute output commands of a computer program using LID, EYE-LID1 in the
present case. For this reason we will only give here a few typical keywords
implemented in EYE-LID 1.
Representation
of the derived protocol:
‘TAB’ The derived protocol is represented as a table
of values indexed by derived units
labels.
‘GRA’ The derived protocol is
displayed as a graph; it is then possible to modify this
graph interactively, for example to select graphical
attributes according to the
labels
of the derived units.
HIST Histogram of the values of the derived protocol for a
univariate protocol.
Statistical calculation on the derived protocol:
'MEAN' ('MED') Mean (median) for each variable
'MIN' ('MAX') Minimum (maximum) value for each variable
'Q1' ('Q3') First (third) quartile for each variable
'SD'
Standard deviation for each variable
'DF'
Number of degrees of freedom of the formula
'SSP' Su m» of squares and products matrix
'MSP' Mean squares and products matrix
'CORR' Correlation matrix
The language for interrogating data - L.I.D.
467
Re-use of the derived protocol:
‘FILE’ The derived protocol is sent to a disk file for a later use by the
program as a new
basic protocol.
10. BRIEF PRESENTATION OF
EYE-LID 1
The program EYE-LID 1 that has been developed by two
of the authors (see [6]), is the first computer program that fully implements
the LID language with all the specifications that have been presented here.
Three major points deserve to be mentioned.
First, for the implementation of LID into the
program, tools such trees, lists and recursivity have been constantly used;
this has led us to choose the C programming language: a recursive language that
allows definition of structures and the use of pointers. It also allows a
completely dynamic storage allocation, and so EYE-LID 1 does not put a priori
limits on the size or the complexity of data and requests.
Secondly, the formal objects of the LID language are
still present inside the various representation modules. In particular, the
graphic module gives access to a language of graphic commands which is
compatible with LID and allows for interactive modifications of the graphs.
In the end, for the specific features of the program
let us mention its window system and its availability for both compatible micro
computers (operating under MS-DOS) and larger computers (operating under UNIX).
Let us conclude with three output samples from EYE-LID
1:
TAB
S<A*b1>:t2,t1 ->V1,V2,V3
│
COMBINATORIAL PROBABILITY
LOGIC
A
a1 a2 a1 a2 a1 a2
B
b1 b1 b1 b1 b1 b1
S
s1 |
1.600 | 2.800 | -0.100
s2 |
2.000 | 1.300 | 1.200
s3 |
0.900 | -1.200 | -0.200
s4 |
1.400 | 2.300 | 0.200
s5 |
-0.200 | 0.300 | 2.300
s11 | -1.200 | -0.800 |
-2.800
s12 | 4.700 | 1.500 |
1.200
s13 | 0.500 | 0.300 |
0.700
s14 | -0.700 | 2.000 |
-0.200
s15 | 1.000 |
2.600 | 1.500
SSP
S(A*B).T ->V1,V2,V3
Sums of
squares & products matrix:
COMBINAT PROBABIL
LOGIC
COMBINAT |19.02600
PROBABIL | 4.70200
|22.72800
LOGIC | 1.39500 | 3.16800
|27.86400
468
J.-M. Bernard, R.
Baldy and H. Rouanet
REFERENCES
[1] Rouanet H., Lépine D. (1977) « L'analyse des comparaisons
pour le traitement des données expérimentales. »,
Informatique et Sciences Humaines, 33-34.
[2] Rouanet H., Lépine D., Lebeaux M.-O.
(1977) « L'approche algébrique de l'analyse des données expérimentales : principales
réalisations informatiques. », Journées "Analyse des données et Informatique", INRIA.
[3J Duquenne V. (1976) "Un programme de
description de données.", Cahiers de Psychologie, 19, ppl09-118.
[4] Rouanet H., Lépine D., Pelnard-Considére J. (1976) "Bayes-fiducial
procedures as practical substitutes for misplaced significance
testing: An application to educationnal data". In D. N. M. de Gruijter
& L. J. T. Van der Kamp (Eds.), Advances in psychological and educationnal measurement, New York: Wiley.
[5] Rouanet H. (1988)
"Some aspects of Bayesian multivariate analysis", Communication to the Multivariate section of the
Royal Statistical Society, London.
[6] Bernard J.-M., Baldy R.
(1986) "EYE-LID 1: A new program for graphical inspection of multivariate
data.", Seminar of the Biometric's Unit, Institute
of Psychiatry, University of London.