DATA ANALYSIS AND INFORMATICS, V

E. Diday (editor)

THE LANGUAGE FOR INTERROGATING DATA - L.I.D. -

Jean-Marc BERNARD*, Robert BALDY** and Henry ROUANET*

* Groupe Mathématiques et Psychologie (CNRS, UA 1201), Université René Descartes, 12 rue Cujas, 75005 Paris, France.

** Institute of Psychiatry, University of London, Great-Britain; now at Quantime Ltd., London, Great-Britain.

The LID language can be characterised as an interrogation language for a statistical data base. In this paper we present the underlying concepts of LID. Starting from a basic data set, a LID request first generates a derived data set by restriction, pooling, averaging, and residual derivations, then specifies the operations to be performed on it: representation as a table or as a graph, statistical calculations, etc. Last we sketch the program EYE-LID 1 † in which emphasis h as been put on graphical] representations.

I. INTRODUCTION

The language for interrogating data (L.I.D.) presented in this communication constitutes the command language of the computer program EYE-LID 1. This program, recently developed by R.. Baldy and J.-M. Bernard, is a multivariate data analysis package in which the emphasis has been put on descriptive statistics and graphical representations. This paper is mainly concerned with the LID language itself, as the program EYE-LID 1 will be presented within the software demonstrations of these meetings.

The LID language can be described as a language for the analysis of a statistical structured data base (protocol); the user proceeds to requests of analysis, through the formulae of the language, which initiate the search of relevant data and their representation (tables, graphs, etc.), as well as the statistical procedures to be applied to them.

The LID language is the last one of a series of command languages devised for the statistical analysis of structured data, such as the ones implemented in the VAR3 (see [l] and [2]) and the STEL programs (see [3]). It is to be noted that VAR3 has become a "routine" analysis of variance program for a whole community of experimental psychologists in this country. This experience has demonstrated the interest in the availability of a data interrogation language in which the researcher can accurately translate his own questions, within the frame of the experimental design. While elaborating the LID language we have enlarged the set of acceptable types of data sets, enriched the language itself, and merged the descriptive approach of exploring data with the traditional inferential approach of analysis of variance.

We will first present the underlying concepts of the LID language (protocol, factors, variables), then the request formulae (set-theoretic or linear), and the options for the representation of the results. Last we will briefly describe the program EYE-LID 1.

2. FUNDAMENTAL CONCEPTS

2.1. The "INOP" data set, an example

The fundamental concepts on which the LID language is based will be exemplified on a simple statistical data set borrowed from the "INOP" survey (see [4] and [5]). Twenty subjects (school children) are divided into 4 groups of 5 pupils each, according to teaching method (modern vs

† The program EYE-LID 1 presented in this paper has been partly supported by a French-British (CNRS - SSRC) joint project (ATP # 95.51.91).

462 J.-M. Bernard. R. Baldy and H. Rouanet

traditional) and social environment (underprivileged vs privileged). The children have been tested on 2 occasions (middle and end of school year). On each occasion 3 variables have been measured yielding a combinatorial, a probability and a logic score.

2.2. Protocol, factors, units and variables

The description of this data structure, that is the protocol, first consists in the identification of the protocol factor and of their modalities:

. factor "subjects" S with 20 modalities: (s1 to s20)

. factor "teaching method" A with 2 modalities: modern(a1), traditional (a2)

. factor "social environment" B with 2 modalities: underprivileged (b1), privileged (b2)

. factor "occasions" T with 2 modalities: middle (t1), end (t2) of school year

The set of all relations between factors constitutes the protocol design. This design can be indicated by a LID formula by means of the factors, the nesting symbol "<>", the crossing symbol "*", and the composition symbol "&" (see section 4.). Several designs may be envisaged depending on the specification of relations: for example the design S&A&B&T only states that data are indexed by the composition of factors S, A, B and T, whereas the design S<A*B>*T tells us, in addition, that S, nested in the crossing of A and B, is itself crossed with T. On the right side of this formula one indicates the variables designated by V1, V2, V3, hence the protocol design: S<A*B>*T→V1,V2,V3.

Each observed combination of the factor modalities is called a unit; the set of these 40 units (20 modalities for S crossed with 2 modalities for T) constitutes the protocol support. With each unit is associated its identification by factor modalities, its values for all variables, and its weight (equal to 1 for an elementary protocol). The following diagram summarizes all the necessary components for the definition of a protocol:

Number of

units

factors

variables

factors

number of modalities

design

S<A*B>*T →

V1,V2,V3

						weight	values		of		variables
		1			nf			1				nv
description	1	1	1	1	1	1	1	3.9		3.2		7.5
of units		1	1	1	2	1		5.5		6.0		7.4
according		:	:	:	:	:		:		:		:
to factors		:	:	:	:	:		:		:		:
	nu	20	2	2	2	1	nu	5.4		8.9		7.1

3. GENERAL FEATURES OF THE LID LANGUAGE

3.1. The LID request* of analysis

Once a basic protocol has been declared according to the preceding structure, the LID language enables one to proceed to requests of analysis. In other words, the basic protocol is a statistical data base whose indexation System is constituted by the factors, and LID a language for interrogating such a data base. Briefly stated, a request defines a derived protocol, the operations to reach it and what is to be done with it. The general syntax of a LID request is "keyword formula → varlist":

The language for interrogating data - L. I. D. 463

. the formula characterizes the derived protocol, that is on one hand its structure (units, factors) and on the other hand the type of derivation to be used to calculate the variable values from the basic protocol.

. the varlist part specifies which variables of the basic protocol are selected for this derived protocol.

. the keyword indicates which procedure to apply to the derived protocol, in particular: representation as a graph or as a table, various statistical calculations, sending to a file, etc.

3.2. Notion of derived protocol

3.2.1. Structure of a derived protocol

As any protocol, a derived protocol is constituted by a set of units. With each of these derived units is associated the part of all basic units that correspond to it; what we call a present derived unit is one for which this part is not empty. The structure of a derived protocol is then obtained by the two following elementary operations: restriction of the set of basic units and pooling of the remaining basic units for each derived unit.

Each derived unit has a label expressed in terms of the factors of the derived protocol (that are a subset of the basic factors), by a structure such as the one showed in the following table, which may be read "a1 and b2 and t1":

The part of the basic units associated with this derived unit is constituted by all basic units simultaneously indexed by the modalities a1, b2 and t1, so 5 basic units in this case: "s6a1b2t1", "s7a1b2t1", "s8a1b2t1", "s9a1b2t1" and "s10a1b2t1".

3.2.2. Calculation of values associated with a derived unit

As far as set-theoretic LID formulae are concerned (see section 4.) the numerical value for a derived unit and a given variable will typically be a central value of the part of the associated basic units: mean, mode, median, for example. The choice among these options will be introduced in LID using another keyword on the right side of a request. For the time being only the weighted mean option is considered.

As we will see in section 5., linear formulae introduce other modes of derivation, such as residual derivations (i.e. deviations from means).

3.3. Symbols of the LID language

A LID request is made of symbols that are either single characters or character strings. These symbols may be grouped in 5 categories:

Operands: fact = factor designated by an upper letter, such as "A"

Mod = modality of a factor, made of a lower letter

followed by a modality number, such as "al",

var = variable designated by the letter "V* and the

variable number, for example "V2".

Set-theoretic ^ = modality concatenation ("and"); by definition

operators: this symbol is not written in actual formulae.

_ = modalities pooling ("or") in a part.

, = parts separator in a family of parts.

& = composition of 2 family of parts.

* = crossing of 2 family of parts.

<> = nesting of a factor within a family of parts.

Linear operators: ( ) = within derivation.

. = interaction dérivation.

: = contrast dérivation.

Selection operator: → = variables sélection.

464 J.-M. Bernard, R. Baldy and H. Rouanet

Keyword (this list is not exhaustive): "TAB", "GRA", "HIST", "WEIG", "MEAN", "MIN", "MAX", "MED" "STD", "DF", "SSP", "MSP", "CORR", "FILE", etc.

The interpretation of a formula is performed accord in g to the relative priorities of the operators, as no symbol allows to force these priorities such as the parentheses in a mathematical formula. The following table shows these priorities:

1	2	3	4	5	6	7	8	9	10
→	( )	<>	:	.	*	&	,	-	^

We will now present the LID requests, first considering the core of the request, i.e. the formula, which can be either set-theoretic or linear. We will then consider the varlist component and at last the possible operations on a derived protocol indicated by the keyword. The language will first be presented intuitively with examples; its formal synthetic description will later be given in section 8.

4. SET-THEORETIC FORMULAE

A set-theoretic (ST) formula only uses as operators the set-theoretic ones given in section 3.3. As mentioned earlier, a ST formula leads to the calculation of weighted means associated with each derived unit. For this reason we will here only tell how the formula determines the labels of these derived units.

The operators "^", "_" and "," and the operands of a "mod" type define an elementary set-theoretic language that generates family of parts. Let us go through these elementary formulas and their interpretation:

s1 designates an elementary modality, i.e. a modality of a basic factor; generates 1 derived unit with label "s1".

s1t1 the concatenation operation allows to create compound modalities, obtained by the composition of different factors; generates 1 derived unit with label "s1 and t1".

s1,s2 the "_" operation allows the pooling of elementary or compound modalities that pertain to the same factors; it defines a part; generates 1 derived unit with label "s1 or s2".

s1,s2_s3 the "," operator separates parts (i.e. derived units) and defines a family of parts; the labels of the different parts must contain the same factors; this formula generates 2 units with respective labels "s1" and "s2 or s3".

The others set-theoretic elements of the language don't introduce any other formal object but allow a more concise definition of family of parts.

A designates the family of parts constituted by the enumeration of the factor modalities; this formula is equivalent to "al,a2".

a1,a3&T the "&" operation combines two family of parts that do not include the same factors into a single one; each part of this new family is obtained by combining a part of the first one and a part of the second one but only the present derived units are finally kept; this formula can then be rewritten as "s1t1,s3t1,s1t2,s3t2".

A*T the crossing operator "*" operates as the composition one "&" but also induces the checking of the crossing relation between the two families of parts, i.e. that each generated derived unit is present.

S<a1,a2> the nesting operator "<>" is not symmetrical and operates on a factor (the "nested" written outside the "<>") and a family of parts (the "nesting" written inside); again it operates as "&" but also induces the checking of the nesting relation, i.e. that each

The language (or interrogating data- L. t. 0. 465

modality of the factor generates a present derived unit for, at most, one part of the nesting family of parts.

5. LINEAR FORMULAE

The three linear operators "( )", "." and ":* introduce other modes of derivation than weighted averaging. A linear formula may be divided into two components: an associated ST formula that defines the labels of the derived units, and a particular mode of derivation for the calculation of values associated with each unit. The validity of a linear formula is only determined by that of the associated ST one.

S(A) the within derivation operator "( )" is not symmetrical; it requires a factor (outside the "( )") and a family of parts (inside the "( )"). One obtains the associated ST formula by replacing "( )" by "&". This derived protocol structure is then the same as that of "S&A". The values for "S(A)" are calculated by difference between those of the derived protocols "S&A" and "A".

A B the interaction derivation operator "." operates on any couple of formulae. The associated ST formula is obtained by replacing"." by "*", as well as the two composing formulae by their own associated ST one. The structure of this derived protocol is then the same as that of " A*B". Values are the residual values of "A*B" once the effects of "A" and "B" have been taken out (for a definition of interaction see [l]).

S:tl,t£ the contrast derivation operator ":" is not symmetrical and operates on any formula (on its left side) and a family of 2 parts (on its right side). Apart from that, it may be used anywhere a "*" can be. The family of two parts defines a contrast, here on the T factor. The associated ST formula is the one associated with the left formula, here "S" Values are calculated by difference between those of the two derived protocols "S*t1" and "S*t2".

6. SELECTION OF VARIABLES

Any formula is followed by the "→" operator and the list of variables (separated by ",") to be considered for the derived protocol, for example "S → V1,V3".

Considering again the definition of a protocol, we see that with each unit and each factor is associated the modality number of this factor for this unit, just as with each unit and each variable is associated a numeric value; we may then speak about the variable associated with a factor, which we will denote as the factor itself. These variables may equally appear in the "varlist". In that case though, the fundamental distinction between factors and variables is still kept since averaging other modality numbers is not allowed.

7. SOME EXAMPLES OF SET-THEORET1C AND LINEAR FORMULAE

S<A*B>*T→ V1,V2,V3 The basic protocol.

S<a1b1,a2b2>*t1→ V1 Values of V1 for the occasion t1 for each subject of groups

a1b1 and a2b2.

A*B→A,V1 For each group albl, alb2, a2bl, a2b2, values of variables

A and V1; the graph of this protocol is the classical

"interaction diagram" between A and B for V1.

t1,t2 → V3,V2 Grand mean of variables V3 and V2.

S(a1b2) → V1 For variable V1, deviation of the mean of each subject of

Group a1b2 from the grand mean of this group.

S(A&B).T → V1 For variable V1, within-group residual protocol.

T:a1b1,a1b2 → V2 For variable V2, for each occasion t1 and t2, difference

between means of groups a1b1 and a1b2.

A*B:t1,t2 → V1 For variable V1, for each of the four groups of subjects,

Difference between mean scores of occasions t2 and t1.

466 J.-M. Bernard. R. Baldy and H. Rouanet

8. SUMMARY OF THE FORMAL SYNTAX OF LID REQUESTS

We now give a more formal description of the LID language that shows how to construct a request from the various objects of the language and these from the initial objects given in section 3.3.

Formal rewriting

rule

Conditions of validity

modcomp →

→

part →

→

famipar →

→

mod

modcomp^mod

modcomp

part_modcomp

part

famipar ,part

fact

fact<famipar>

famipar*famipar

famipar&famipar

different factors (albl)

same factors (a1b1_a2b2)

same factors (a1b1,a2b2)

nesting relation

crossing relation

different factors

formula →

→

famipar fact(famipar) formula. formula formula*formula formula:famipar

factor not present in famipar crossing relation

crossing relation

crossing rel., family of 2 parts

varlist →

→

fact

var

varlist, varlist

request →

keyword formula → varlist

9. REPRESENTATION OF A DERIVED PROTOCOL

We have just seen how a LID formula defines a derived protocol. A keyword appearing on the left side of a formula allows the selection of a mode of representation for this derived protocol or to realise some operations on it. These keywords may not be considered as proper elements of the LID language as they basically constitute output commands of a computer program using LID, EYE-LID1 in the present case. For this reason we will only give here a few typical keywords implemented in EYE-LID 1.

Representation of the derived protocol:

‘TAB’ The derived protocol is represented as a table of values indexed by derived units

labels.

‘GRA’ The derived protocol is displayed as a graph; it is then possible to modify this

graph interactively, for example to select graphical attributes according to the

labels of the derived units.

HIST Histogram of the values of the derived protocol for a univariate protocol.

Statistical calculation on the derived protocol:

'MEAN' ('MED') Mean (median) for each variable

'MIN' ('MAX') Minimum (maximum) value for each variable

'Q1' ('Q3') First (third) quartile for each variable

'SD' Standard deviation for each variable

'DF' Number of degrees of freedom of the formula

'SSP' Su m» of squares and products matrix

'MSP' Mean squares and products matrix

'CORR' Correlation matrix

The language for interrogating data - L.I.D. 467

Re-use of the derived protocol:

‘FILE’ The derived protocol is sent to a disk file for a later use by the program as a new

basic protocol.

10. BRIEF PRESENTATION OF EYE-LID 1

The program EYE-LID 1 that has been developed by two of the authors (see [6]), is the first computer program that fully implements the LID language with all the specifications that have been presented here. Three major points deserve to be mentioned.

First, for the implementation of LID into the program, tools such trees, lists and recursivity have been constantly used; this has led us to choose the C programming language: a recursive language that allows definition of structures and the use of pointers. It also allows a completely dynamic storage allocation, and so EYE-LID 1 does not put a priori limits on the size or the complexity of data and requests.

Secondly, the formal objects of the LID language are still present inside the various representation modules. In particular, the graphic module gives access to a language of graphic commands which is compatible with LID and allows for interactive modifications of the graphs.

In the end, for the specific features of the program let us mention its window system and its availability for both compatible micro computers (operating under MS-DOS) and larger computers (operating under UNIX).

Let us conclude with three output samples from EYE-LID 1:

TAB S<A*b1>:t2,t1 ->V1,V2,V3

│

COMBINATORIAL PROBABILITY LOGIC

A a1 a2 a1 a2 a1 a2

B b1 b1 b1 b1 b1 b1

s1 | 1.600 | 2.800 | -0.100

s2 | 2.000 | 1.300 | 1.200

s3 | 0.900 | -1.200 | -0.200

s4 | 1.400 | 2.300 | 0.200

s5 | -0.200 | 0.300 | 2.300

s11 | -1.200 | -0.800 | -2.800

s12 | 4.700 | 1.500 | 1.200

s13 | 0.500 | 0.300 | 0.700

s14 | -0.700 | 2.000 | -0.200

s15 | 1.000 | 2.600 | 1.500

SSP S(A*B).T ->V1,V2,V3

Sums of squares & products matrix:

COMBINAT PROBABIL LOGIC

COMBINAT |19.02600

PROBABIL | 4.70200 |22.72800

LOGIC | 1.39500 | 3.16800 |27.86400

468 J.-M. Bernard, R. Baldy and H. Rouanet

REFERENCES

[1] Rouanet H., Lépine D. (1977) « L'analyse des comparaisons pour le traitement des données expérimentales. », Informatique et Sciences Humaines, 33-34.

[2] Rouanet H., Lépine D., Lebeaux M.-O. (1977) « L'approche algébrique de l'analyse des données expérimentales : principales réalisations informatiques. », Journées "Analyse des données et Informatique", INRIA.

[3J Duquenne V. (1976) "Un programme de description de données.", Cahiers de Psychologie, 19, ppl09-118.

[4] Rouanet H., Lépine D., Pelnard-Considére J. (1976) "Bayes-fiducial procedures as practical substitutes for misplaced significance testing: An application to educationnal data". In D. N. M. de Gruijter & L. J. T. Van der Kamp (Eds.), Advances in psychological and educationnal measurement, New York: Wiley.

[5] Rouanet H. (1988) "Some aspects of Bayesian multivariate analysis", Communication to the Multivariate section of the Royal Statistical Society, London.

[6] Bernard J.-M., Baldy R. (1986) "EYE-LID 1: A new program for graphical inspection of multivariate data.", Seminar of the Biometric's Unit, Institute of Psychiatry, University of London.