IBM CAMBRIDGE SCIENTIFIC CENTER

TECHNICAL REPORT NO. 320-2094


MAY 1973




DESIGN CONSIDERATIONS FOR
INTEGRATED TEXT PROCESSING SYSTEMS



Charles F. Goldfarb






IBM CORPORATION
CAMBRIDGE SCIENTIFIC CENTER
545 TECHNOLOGY SQUARE
CAMBRIDGE, MASSACHUSETTS 02139




CONTENTS


ABSTRACT

The processes involved in text processing applications are analyzed without regard to artificial distinctions between "publishing" and "information retrieval" activities. Functions performed by a system are distinguished from the means of invoking them, so that areas of commonality between processes can clearly be identified. An integrated system should have

  1. a set of operators for all generally required functions,
  2. a language and environment which allow combination and extension of operators to perform applications, and
  3. an application-independent representation of text on which the operators can operate in any sequence.

I. INTEGRATED TEXT PROCESSING

The areas of publishing and information retrieval are often thought of as comprising distinct applications for a computer system. In fact, the functions of the two overlap significantly, and both could be served best by an integrated system. (2)

This functional overlap can be observed in the manual creation of a technical report, a typical example of text processing. For example, the author's research frequently requires searching for documents on the same subject as his report. He then enters his thoughts on paper, correcting any minor or typographical mistakes as he writes. Usually, some alteration of those basic thoughts takes place as the author rearranges or modifies the text.

He then decides on an acceptable format and calculates the space requirements of any graphs or charts to be inserted. The document might then be analyzed to determine appropriate subject terms which are sorted alphabetically for an index. At any point, an error or change in thought may necessitate reiteration of any or all of these processes.

The final version of the output is a clean typescript, which will be reproduced and distributed. The report might be updated periodically, which would require repetition of the processes for preparation of addenda or a revised report.

All text processing, whether manual or computer-aided, involves the same basic processes identified in this example:

  1. Entry & Correction: Converting a document into the form processed by the system (and making necessary changes while doing so).
  2. Alteration: Changing a document that is in the system.
  3. Formatting: Determining the visual form a document will have when it is output from the system and displayed.
  4. Search: Determining which documents are associated with particular information.
  5. Analysis & Association: Deriving information about a document to be associated with it for subsequent search.
  6. Sorting.
  7. Calculation.
  8. Output: Converting a document to a form used outside of the system (for display or external processing).

While the foregoing list is not rigorous ("sorting," for example, can be subsumed under "calculation"), it provides a useful tool for analyzing text processing applications in a way which recognizes their essential commonality.

A particular text processing application, then, would emphasize one or more of these processes, sometimes tailored for a particular class of user or type of textual material. For example, the composition application emphasizes formatting and output, editing emphasizes entry and alteration, and information retrieval emphasizes search and analysis. (A tutorial analysis of these applications is presented in (1)).

Specialized information retrieval programs have been produced, for example, for military law and for technical literature; composition programs have been tailored for technical manuals and freight tariffs. The basic processes, however, are the same in all cases.

It follows, therefore, that in designing computer systems for integrated text processing, one must support these processes in a way which exploits the underlying functional commonality. At the same time, one must provide flexibility for differing application emphasis, and user interface idiosyncracies.


II. THE BASIC PROCESSES

A system design effort must begin with an examination of the basic processes. The following discussion briefly identifies the functions involved in each, and attempts to differentiate these from the means of invoking them. The latter are more likely to change from one application or user to another, while the functions themselves remain constant.

Entry & Correction

Entry & Correction refers to the complete process by which a unit of information, called a "document," is converted from a form used outside the system into the system representation of text.

Entry can involve initial keying or scanning of character-strings or pictorial material (graphics) on hard copy or microform, and then a series of transmission and translation steps. The original source could also be a computer file of formatted data, or text not in the system representation.

Entry can be offline into a medium (i.e., buffer) that is later transmitted in bulk, such as paper tape perforators or magnetic medium typewriters (e.g., IBM's CMCST or MTST). It can also be online via a terminal.

Alteration

Alteration is the process of making changes within a document that is in the system. It involves inserting, replacing, and deleting parts of the text.

The user might identify parts to be altered by content ("contextual search"), as well as by "line" number or similar physical addresses. When CRT terminals are employed for character-string alteration, he can use cursor positioning as well.

For altering page layouts, illustrations, and other pictorial material, a CRT terminal provides the most convenient means of expressing user requests and verifying their performance. It should be buffered, and have the equivalent of vector display capabilities, a tablet, and an alphameric keyboard.

Formatting

Formatting is the process of determining the visual representation a document will have when it Is output from the system and displayed.

Operations are performed on the various structural elements of the text (headings, paragraphs, etc.) in accordance with graphic specifications supplied by the user. The resulting text contains codes which are translated during output (see below) into instructions for particular composition devices. These codes indicate such aspects of the visual representation as line endings, page endings, and hyphenation points.

Formatting processors vary greatly in complexity, depending upon such factors as the number and power of the user graphic specifications supported (e.g., font changes, tabbing, skewing, floating keeps). User requirements range from simple manuscript formatting capability, to high quality book production, to specialized production of documents such as freight tariffs and technical manuals.

A display terminal like that described in connection with alteration is also useful in formatting. It can facilitate the entry and alteration of page layout specifications, which are frequently too complex to describe conveniently with character strings. It also aids interactive proofreading of pagination results.

Search

Search is the process of analysing a user's request for information and determining which documents satisfy his request. (This should be distinguished from "retrieval," which is the mechanical process of accessing the documents identified by the search.)

The request is formalized as Boolean combinations of statistical, positional and bibliographic attributes. The user may enter the formal request directly, or the system may derive it from analysis (see below) of a "free form" description of the information desired.

A search request can also specify a ranking or sorting algorithm for ordering of the search results.

Analysis & Association

This is the process of deriving information about a document and associating it with the document, so that the information can be used to identify the document in a subsequent search. (This should be distinguished from the mechanical process of actually storing the fact that the document and the information are associated.)

Recall that the term "document" is relative, and simply means a logical addressable unit of information. The analysis and association process is logically the same, whether identifying one report among many in a data base, or one page among many in a book.

The associated information may be subject terms and/or bibliographic information. It can be supplied entirely by the user, the system, or some combination of the two.

For example, a concordance processor may create a "document" which is actually a list of the significant words (and their locations) in some other document. The concordance can be altered by inserting or deleting terms, until it contains only those which the user wishes to associate with the document.

Both analysis and search involve the maintenance and accessing of wordlists. The lists may contain stopwords, synonyms, concept classes, controlled vocabularies, etc. Although an application's response to a "hit" in a wordlist lookup will vary according to the nature of its contents, the lookup and maintenance functions are the same.

Sorting

Sorting and collating functions are used for a range of purposes, from alphabetizing index terms to sequencing search results.

Calculation

Arithmetic and logical operations are used in frequency calculations, search request processing, copy fitting, statistics on search argument effectiveness, etc.

Output

Output is the process by which a document in the system representation is converted to a form used outside the system, such as for display on an output device. Like entry, it may involve a series of translation and transmission operations.

These may be to an offline device, an external program, another user, or to an online terminal. The device-independent description of the external representation, which was created by formatting. is translated into a sequence of instructions and data that will cause a particular device to create the desired representation.

If the output is to an external program, formatting and output cause the creation of source data for that program.


III. TEXT PROCESSING OPERATORS

We have seen that a text processing application can be analyzed as a series of processes. A process is a grouping of functions which are unchanged from one application to another, although the application user's view of them may vary.

Applications, Processes, and Operators

While the basic processes are a useful tool for application analysis, an application system design should have its "building blocks" at the function level. This is particularly true because the same functions are frequently involved in more than one process.

For example, in the formatting process, the width of each word in the text must be determined. The text is scanned for word delimiters, and successive words are located for examination. The same thing occurs during analysis and association, when a concordance is being created. We can therefore identify a "NEXTWORD" function as being involved in both processes.

NEXTWORD could be defined precisely as an operation which returns the next word encountered in a string of text, starting at a designated current location. In the cases described it would probably be embodied in some larger program, although it could conceivably be useful in an interactive indexing session as well.

But regardless of the process being performed, the means of invocation, or the type of document or application, the function performed by NEXTWORD is unchanged. This means it could be implemented as a service routine (i.e., command, subroutine, called function, or whatever) with a formal argument list (i.e., calling sequence, parameter string, etc.).

(Since the terminology for this concept varies from one command language or programming language to another, we will use the terms "operator" and "arguments." The value returned by an operator is its "result." All three are directly addressable units of information, or "entities.")

Figure 1 illustrates the relationship between NEXTWORD, another operator called LOOKUP, and the analysis and formatting processes for a group of applications. The arguments of LOOKUP are a word and a two-dimensional table; its result is the data associated with that word in the table. The type of data, and the reason for the association, are immaterial. In formatting, the data may be the hyphenation breakpoints. In analysis, it might be the concept class of which the word is a member.

OPERATOR PROCESS APPLICATION
NEXTWORD ANALYSIS KWIC-INDEX
NEXTWORD FORMATTING TECH MANUALS
NEXTWORD FORMATTING KWIC-INDEX
LOOKUP ANALYSIS THESAURUS
LOOKUP FORMATTING BOOK PUBLISHING

Figure 1. Relationship between operators, processes, and applications

Instruction Set

A functional analysis of the applications to be performed is essential in designing a set of primitive operators for a text processing system. It is also helpful to examine the instruction sets of context editors (9), composition systems (15,16), string and list processors (4), document retrieval systems (18,19), and interactive command languages (12,13,20). Representative examples are cited in the references.

Operators in the following categories will required:

The processes, then, can be defined in terms of operators which overlap significantly from one process to another. So, too, can the operators themselves be defined in terms of a small group of "atomic" operators.

The set of primitive operators supplied in an integrated text processing system should not be restricted to the atomic operators. It should also include all of the "derived" operators which are generally applicable to the anticipated use of the system. To meet individual user requirements, a means by which the set can be extended must also be provided (see Part IV).

Data Structures

A precise definition of text processing operators requires an equally precise definition of the arguments and results -- in other words, of the view of information, or "data model," supported by the system.

Recall that each uniquely identifiable (i.e., directly accessable) piece of information is called an "entity." "Entity" is thus technically synonymous with 'document," although the latter term is usually reserved for entities one would intuitively consider to be documents: those containing text and/or illustrations.

An entity can have many attributes. One of these is its unique identifier, or "ID." Another is its actual contents, or "data-item." The data-item can be a single component of information, such as a character or numeric value. It can be a structure, such as a string, which contains many components.

A structure whose components are the IDs of other structures is called a "compound structure." In particular, a string whose components are the IDs of other strings is called a "file."

In most text processing systems, documents are represented by files. The syntax of the processing language must make it clear whether an operation on a file is to be performed on the "pointer structure" of the file -- i.e., the string of IDs -- or on a "member" -- a string whose ID is in the pointer structure.

Frequently, this is done by having two different names for the operator. For example, in the CMS Editor (13,14), the pointer structure is altered by the DELETE, INSERT, and MOVE operators. The analogous operations on members are performed with the CHANGE operator.

More often, an operation is only available for one type of argument. Thus, CMS permits the catenation and separation of pointer structures through the COMBINE and SPLIT operators, but a similar operator is not available for members.

The logical structure (as distinguished from the structure of a particular representation in a system) of many documents is a hierarchy, or ordered tree. However, this structure is rarely supported as such by the data model of text processing systems. Techniques such as the IMBED facility of the SCRIPT/370 formatter (15) are used to support ordered trees at a lower level. The trees are represented in the system as files whose members are themselves files, and so on, to any desired height.

Matrices and higher-order arrays are important structures for the calculation process, and for the representation of tables. They are supported quite generally in implementations of APL(20), but only in a limited way -- if at all -- in text processing systems. APL also provides powerful structural operators, but these are of limited text processing value in the absence of support for files and other compound structures.

A document can involve structures that are more complex than any described here so far. The representation of footnotes, page layouts, and illustrations are examples. Such structures can be created and altered with complete generality, with only a small set of operators, by viewing them as relations between the components (5).

This view has been implemented in an experimental storage control system called Relational Memory (RAM), and has been used successfully for a variety of applications (7),(8)). However, the relational view of data is not presently supported in any general-purpose text processing system. The power of a text processing instruction set, then, is affected by the variety of data structures for which it is defined. The syntax of the processing language must allow the user to indicate whether an entire data-item, or only a designated group of its components, is the argument of an operator. When the entity is a compound structure, it must also be possible to distinguish between the pointer structure and the members.

Data Types

The non-structural, or "semantic," attributes of the arguments also affect the utility of the operator set. These are the attributes related to the meaning, or "type," of an entity. An operator might be defined for only one type of argument, or may accept a number of types, but perform variations of its basic function for each one.

Some examples of entity types are characters and numeric values. Mathematical operators would be defined only for numeric entities, and Boolean operators only for the subset of numeric entities that contain only zeros and ones.

While a structural operator, by definition, can accept an operand argument regardless of its entity type, those of its arguments which are actually control parameters must conform to type requirements. For example, an operator which splits a designated number of components from the front of a string could be invoked for a string of any type. The argument which designates the number of components, though, must be numeric.

Another entity type commonly found in text processing systems is the counter. This is an entity whose data-item is a numeric value which is checked and altered automatically by the system when it is referenced in certain ways. A counter is most frequently employed to maintain the current index, or position, within a structure.

In text editing programs, the current "line number" is a counter entity. Operators such as NEXT, UP, and DOWN actually consist of an addition or subtraction operator which accepts counter entities as arguments, and an alteration operator, which checks that its new value is a valid index of the structure with which it is associated. If not, an error condition may be signaled (e.g., "end of file"), or the value may be modified to come within the permitted range ("wraparound").

Formatting programs also use counter entities, usually to maintain the current page number. A formatting operator for page skipping implicitly invokes alteration of the counter.

Text processing systems usually support operations on multiple entity types by providing corresponding multiple operator names. In most formatters, therefore, we find operators like change font, set tab stops, indent left margins, etc. These involve nothing more than alteration of an entity whose value must satisfy certain constraints, and/or the insertion of control characters into the text.

The 1130 Composition System (17), on the other hand, is a formatter which explicitly recognizes such functions to be alteration of system variables. The variables can therefore be used as arguments for arithmetic and other operators, including the specification of other variables. This affords the user great flexibility in defining the processing which will produce the desired visual representation.

Patterns are another entity type encountered in text processing. The data-item of a pattern can contain special components ("control elements") which enable it to describe a class of entities which conform to the pattern. Examples are control elements which denote repetition of a component, all possible values of a component, or the absence of a designated component.

When a pattern is the search argument for a matching operator, a number of possible strings could cause a match. If the same data-item were in a character entity, rather than a pattern, the control elements would have no particular significance, and only a string identical to the search argument would satisfy the match.

The logical effect of a pattern argument, then, is to cause the match to be attempted repeatedly with all of the strings that can be generated from (i.e., that conform to) the pattern. Other aspects of the matching operation are unchanged.

As with counter entities, patterns tend to be given specialized treatment in text processing systems, even with respect to operations in which they should be treated no differently from normal character or numeric entities.

The ability (or lack of it) to ignore an entity's type when it is not logically significant is most important where operator entities are concerned. An operator should be treated specially only when it is being executed. For purposes of creation and maintenance, its structure should be of a kind supported for data entities.

Where this is the case, as in CMS, the full power of the system's instruction set is available for operator creation. CMS operator entities can be "modules," which are produced through normal programming techniques, or "EXEC" procedures. The latter are files, created in the same way as documents, whose members are CMS commands.

APL/360 user-created operators are also files, but this structure is not supported for data entities. As a result, a special "function definition" language must be used, which is much less powerful than APL itself. As with CMS EXEC files, however, the members are normal APL commands. Both systems thus permit an end-user to extend the instruction set without learning a special programming language.


IV. PROCESSING LANGUAGE AND ENVIRONMENT

We have seen that the functions required for text processing can be defined as a set of operators, in a way that makes them independent of the processes and applications in which they will be used. The arguments and results of operators must conform to a data model which requires distinct identification of structural and semantic attributes of information.

We now discuss the considerations in designing a command language and processing environment in which such operators can be implemented. A design must distinguish between the function provided, the user view of the function, and the means by which the real resources of a computer system are employed to implement the function.

Functional Capabilities

In integrated text processing, there is no standard sequence of operations -- the particular operators used and their order can vary from one task to the next. The user must be able to control not only the sequence, but the extent to which that sequence is predetermined ("batched"), or is responsive to user interaction during execution. Such control must be exercisable conversationally, since text processing is inherently subjective, and requires free interaction between the user and the system.

The user must be able to execute an operator simply by typing a statement containing its name and the names of its arguments. Such a statement is a "command" in the system language.

It must be possible to invoke a number of operators in the same statement. Thus, the result returned by one operator could immediately be used as the argument for another. A command language interpreter must analyze the statement, determine the sequence of operators to be executed, and the arguments of each.

A flexible approach to extensibility is required to make it easy to combine and customize standard operators in order to support applications. It must be possible to evaluate the tradeoffs between ease of creation and speed of execution in each individual case, and choose the optimum combination of two methods of operator creation:

  1. Operator definition, in which operators are written in the text processing command language. This would permit relatively rapid creation and debugging, since individual statements could be tested interactively.
  2. Operator programming, in which operators are written in assembler language or a compiler. This would be relatively costly in terms of creation and debugging, but might improve execution-time performance.

With these capabilities, a user or application developer could also create defined operators by combining programmed operators in a desired sequence, using the arithmetic and logical operators to do testing and branching between them. This is the 'batching" referred to above. It need not affect execution times significantly.

The language should view the user's information as a "workarea" in which all of his operator and data entities are stored, and to which all operators have access. Parameter values could then be passed to operators by altering designated entities, which will be examined during execution. With such "default values," one can build "processing environments" for interactive applications, and thereby reduce argument lists to convenient lengths.

Although the workarea effectively shields the user from the computer system, the language can still make the resources outside of the workarea available. Entities could be associated with external objects, such as I/0 devices, background programs, shared data bases, and other users. Each would have an appropriate entity type. Alteration of such an entity would cause a system program to send the data to the external object, in the same way, for example, that alteration of a counter entity causes a program to check the value.

If entities are self-describing, the syntax of the language can be kept quite simple, since special operator names will not be required as a method of designating operand attributes. External objects could therefore be referenced in the same way as those within the workarea.

Furthermore, as long as the syntax permits an arbitrarily long argument list, user-created operators can have the same syntax as primitives. These features mean that a user who knows only one application can perform another simply by learning additional operators -- not a whole new language.

It must be possible to save the status of a workarea, and subsequently to restore it. This permits a lengthy processing procedure to be interrupted part way and later resumed, and can also serve as a checkpoint/restart capability.

The user would also save the status of a workarea when it reached a condition of logical integrity during extensive alteration of a document. He would restore it only if irrecoverable errors occurred prior to reaching a subsequent state of integrity.

User Interface

The design of an integrated text processing system must maintain the distinctions among the following:

  1. Purpose: the user's objective, his application, the thing he really wants done for which he requires computer assistance.
  2. Service: the thing a computer can do. The user analyzes his objective and determines what services the computer can perform to help him achieve it.
  3. Request: what the computer recognizes to be a reference to a service. After determining the services, the user must determine the proper requests to invoke them.
  4. Communication: how the user creates the request and gets it to the computer's attention.
  5. Implementation: how the computer actually performs the service.

The means of formulating and communicating a request have the greatest effect on the human factors acceptability of the system. For applications in text processing, human factors have an extra significance because the information is similar to that dealt with for personal reasons. Users have deeply ingrained habits of a lifetime, and are confident of the correctness of their way of doing things.

This is quite different from business and scientific data processing, where the user -- even without a computer -- must adapt to specialized information and a specialized way of handling it. There he is more willing to accept communication and language constraints imposed by the computer system.

The text processing language must therefore make it easy to think of the sequence of operators needed to perform an application, and to request them from the computer.

The ability to create new operators, particularly by combining existing ones, assures a sufficiently rich operator set. However, as noted previously, identical functions are frequently perceived as being different when they are used in different applications. A system should therefore provide a general facility for equating names to entity IDs, with synonyms and abbreviations permitted.

Primitive operators could refer to entities by their IDs, while users employed whatever names they found appropriate for their application or working environment. Names, and the tables equating them to entities, should themselves be data entities processable by system operators.

Terminal devices can be used to simplify the formulation and transmittal of commands, as discussed in Part II. Typical examples are the use of "programmed function" keys as single-keystroke operator names, and light-pen or cursor selection of operator and argument names from a displayed "menu."

In general, observance of these design principles permits a single operator set to serve a variety of purposes. The user's view of the computer's services is changed as needed, to customize the system for a particular application or working environment.

System Interface

The functional view of an integrated text processing system should he independent of any particular implementation. This permits the implementor to accommodate changes in device configurations and software without affecting the application developer or user.

The text processing language interpreter can itself be somewhat system-independent. It determines which operator and data entities must be accessed, as described above, and requests them from the workarea management component of the system.

The latter translates the request for entities into demands on the real main and secondary storage resources of the computer. Data entities are brought into main memory, and their addresses returned to the interpreter. Programmed operator entities are loaded and executed as requested, and their results returned as entities.

The size and complexity of the workarea management program will depend upon the richness of the data model supported by the language. It is also affected by the operating system.

In many cases, functions of the text processing system can be implemented in specialized loosely-coupled processors, rather than software. For example, consider a display terminal designed for newspaper copy editing. Using it as a terminal, the user enters a command which invokes the "newsedit" operator, designating a particular file as the operand.

When the interpreter requests that "newsedit" be loaded and executed the file is transmitted to the terminal's control unit, which is given control. The terminal conducts a dialogue with the user, and on completion gives control back to the interpreter, returning the edited file as the result of the "newsedit" operator.

In most installations, many users will need to share the computer system simultaneously. Since text processing systems do not have unique requirements in this respect, sharing is best done by the operating system or other general-purpose time-sharing support. A virtual machine facility is particularly appealing for this purpose, because it makes resource sharing transparent even to the workarea management implementation (10,11).


V. SYSTEM REPRESENTATION OF TEXT

The computer representation of text in an integrated processing system must permit information to be shared by different users for different applications in varying sequences. This requires a generalized markup language and character-set conventions which provide for application-independent document description, iterative processing, multiple character sets, and easy transferability.

Logical and System Structure

The logical attributes of a document and the attributes of its representation in a system will not always coincide. For example, as we saw earlier, a document whose logical structure is an ordered tree may actually be stored as a file. In that case, the reason was that the system data model did not support trees.

Such a disparity can also arise when a system, although recognizing a structure in its data model, is not implemented in a way that supports it efficiently. The document is therefore stored in a simpler representation and the true structure is created dynamically as parts of it are processed.

Another reason for storing a simpler representation of a document is to maintain a correlation with an external representation. In text editing, for example, it is easier for the user when the structure he is altering corresponds to that displayed.

An integrated text processing system must therefore incorporate some means of document annotation, or "markup," which will preserve a document's true attributes, regardless of its system representation. This requires a substantial modification to the way markup is usually performed today.

The Markup Process

In present computerized text processing systems, markup consists of interspersing processing commands in the contents of a document. This serves two purposes:

  1. it breaks the document into logical components which can serve as operands for the system's operators; and
  2. it specifies the application function which is to be performed on each component.

For example, consider marking up a technical report for formatting. The user (often without recognizing it) first analyzes the information structure of the text. That is, he identifies each given section of text as a paragraph, a heading, a table, or some other predefined type. He must then determine, either from memory or a style book, the commands which will produce the format desired for that type of component.

Unfortunately, his next step is to put the commands themselves into the text, thereby losing the information about the document structure.

For example. if he decides to center headings and photo captions in this application, all that will appear in the text is a center command, with no indication whether the text is a heading or a caption. If in a future formatting application he decides to left-justify the headings, he will have to mark up the text again by hand.

Similarly, if he wishes to use the document for information retrieval, he will not be able to distinguish the text of headings (which might be very significant in information content) from the text of anything else that was centered.

Generalized Markup

This analysis of the markup process suggests that it should be possible to design a generalized markup language so that markup would be useful for more than one application or computer system.

Such a language would restrict markup within the document to identification of the document's structure and other attributes. This could be done, for example, with mnemonic "tags." The designation of a component as being of a particular type would mean only that it will be processed identically to other components of that type. The actual processing commands, however, would not be included in the text, since these could vary from one application to another, and from one processing system to another.

Instead, application functions would be implemented by creating application-oriented operators, using the operator definition facilities of the integrated processing system. The association between a component type, and the operator which will apply to components of that type, would be made dynamically when the application is run.

To be independent of the system representation a markup language must convey all of the structural information reflected in the format of the printed page. This means identifying and relating all textual information, including page layouts, illustrations, and non-coded (i.e., device-specific) information, as well as character strings.

Non-structural attributes -- those meaningful only with respect to information outside the document -- must also be describable. Examples are publication and revision dates, editors' names, catalogue numbers, etc. The principle of separating document description from application function makes it possible to describe the attributes common to all documents of the same type. For example, different daily editions of a newspaper, or different volumes of a series of books. This would allow substantial economies to be effected in the markup of individual documents (6).

Furthermore, the availability of such "type descriptions" could add new function to the text processing system. Programs could supply markup for an incomplete document, or interactively prompt a user in the entry of a document by displaying the markup.

A generalized markup language, then, would permit full information about a document to be preserved, regardless of the way the document is used or represented. Such markup would be more useful, and probably less expensive, than that currently employed.

Iterative Processing

In present text processing programs, the representation of the text before it is processed (the "input format") is different from the representation of the text after it is processed (the "output format"). For many applications, this places the burden on the user of maintaining both an "input" and an "output" copy of his text. Furthermore, he is unable to use the output of one program as input to another unless the programs were designed to be run in that particular sequence at all times.

For integrated text processing, where programs (i.e., operators) must operate in varying sequences, depending upon the needs of the user's particular task, there can be only a single representation of text. This implies that a program must be able to process its own output as input -- hence, iterative processing. It requires that no information be removed from the text by any program, and that any information added to the text be identified as having been added by a program, so that it can be ignored by the same program if the same text is re-processed.

Multiple Character Sets

The system representation of text must provide for a variety of character sets, for foreign-language alphabets, mathematical symbols, and the like. Each would contain certain common values, which might include the digits, function codes (end line, program-supplied hyphen, etc.), and a code allowing escape to a new character set."

Character sets should be numbered, with certain numbers reserved for standard sets to be defined by ANSI or industry groups, other numbers reserved for computer and peripheral manufacturer-defined sets, and the remainder for "private" user-defined sets. It must be possible to associate meanings with character set numbers (for example, 32 is French, 1 is English, etc.), so programs can distinguish foreign languages.

In order to handle multiple character sets correctly, the system must recognize the distinction between a character set, a keyboard, and an output font. If the English alphabet is considered to be a character set, then the character "a" has the same semantic value regardless of whether, in a visual representation, it is printed in a serif or sans-serif font.

A symbol in a font, then, is determined by a character in a character set, plus certain graphic specifications. A keyboard, on the other hand, can only enter characters in a character set, not graphic values. The latter are supplied by the formatting process as a means of portraying the structure of the text.

Transferability

Integrated text files must be readily transferable, because of the different functional emphasis among establishments, and the fragmentation of text processing across organizational boundaries.

For example, book publishers send copy to book printers to be composed, governmental and business establishments utilize both in-plant and commercial printers, and publishing and printing are often separated within an organization. Furthermore, indexes and abstracts which form part of published material are often prepared by outside information analysis establishments.

Integrated files should be storable in any medium which supports sequential files. Text could then be transported between different computer systems, even those with different operating systems. The processing system would interpret the generalized markup language and store the document in the most appropriate way permitted by the system's data model.


VI. EXPERIMENTAL WORK

This paper is not intended as a report on current experimental work. However, it is helpful in understanding the concepts introduced to consider their application. The discussion will be confined to a brief comment on previously reported work, and a general outline of the scope of our present efforts.

The INTIME Project

Our earliest experience in applying the concepts discussed here to the implementation of text processing applications is described in (3). That system, called INTIME (Interactive Textual Information Management Experiment), ran in a CMS environment under CP-67. (The CP-67/CMS system environment is substantially similar to that obtainable under the more widely available VM/370 and its CMS.)

INTIME used the CMS EDIT and utility programs, a simple SCRIPT formatter (predecessor of the more powerful SCRIPT/370), and a modification of the Document Processing System (18). They served as implementations of seven of the "basic processes." The eighth, calculation, was not available per se, but calculation functions were of course employed by the others.

Although INTIME was able to accomplish useful work, it fell short of being an integrated system in many respects.

For one thing, the alteration, formatting, and search processes each created its own specialized "sub-environment" within the overall CMS environment. Hence, most operators were limited to a single process, which made it impossible to attain complete integration. As a concomitant of this, many operators were implemented more than once, frequently with minor differences in their definition, and usually with different names.

These difficulties were overcome to some extent, however, through use of the "STACK" facility of CMS EXEC. This permitted processing to occur successively in a number of sub-environments, within the same defined operator.

Another problem with INTIME was the lack of a system representation of text which met the criteria described in this paper. Although all documents were accessible by all processes application-dependent markup was used. Furthermore, the logical structure of the document was in part described by the representation in the system.

The lack of a calculation facility at the processing language level was another difficulty, which had greater impact than we had anticipated. It was not then clear how important "numerical" operators were to what was traditionally considered "non-numerical data processing."

Current Research Activity

The experimental system currently used by the Integrated Text Processing Project at the Cambridge Scientific Center is a considerable improvement over INTIME. Formatting and editing are now at the VM/370 level, which provides manv additional features. Among these are macro facilities which make it possible to interpret a generalized markup language.

The APL(CMS) program (21) now provides a calculation facility, albeit in a unique sub-environment. However, it is possible to pass information between APL workspaces and CMS files, and this has proven useful, for example, in the publication of survey results. Data reduction and analysis was performed interactively with APL, then stored in a CMS file, marked up, and formatted.

We are in the process of replacing the Document Processing System search component with the corresponding part of STAIRS (19). STAIRS offers additional function, and -- unlike Document Processing System -- was designed to be installed under a variety of interactive hosts.

The system, in its present state, is in daily use for the production of Scientific Center reports (including this one), office correspondence, and the like. It supports our research into the definition of text processing instruction sets, the design of a generalized markup language, and the use of the integrated text processing methodology in solving complex application problems.


VII. CONCLUSION

In designing systems for text processing applications, it is necessary to ignore customary distinctions between publishing and information retrieval activities, since a given application usually involves both. Instead, the work to be performed should be analyzed in terms of the basic processes of entry, alteration, formatting, analysis, search, sorting, calculation, and output.

To avoid problems of "bridging" between processes, with the attendant expense of multiple data bases, conversion, and redundant programming, the processes themselves should be implemented at a "functional building block" level. This requires a system with the following features:

  1. The system should have a well-defined set of primitive operators which can be combined as needed to perform the processes. The data model recognized by the operators should encompass operators themselves as a data type. This permits the system to be used to create new operators, by defining them in terms of already existing ones.
  2. These building blocks should be made available in a processing environment which shields the application developer and user from the real computer system. It must support his own private data storage, while providing access to a public data base and other external processes and devices. Its command language must permit conversational invocation of operators in any desired sequence, using names designated by the user.
  3. Documents must be represented in the system so that their logical attributes are fully described, independently of the applications to be performed on them.

The integrated text processing methodology, then, differs from the usual "total system," or "envelope," approach to application support. An integrated design does not try to anticipate everything that might be done, in every possible sequence, and include a program for it.

Instead, an integrated text processing system provides a non-restrictive nucleus of basic operators, which can be combined and extended to suit specific application requirements.


VIII. ACKNOWLEDGMENTS

The original version of this paper was prepared in November, 1971, but was never published. I would like to thank Bob Poland, Byron Jackson, and Pat Santarelli for their contribution to that work.

Many of the ideas herein stem from the work done by Edward J. Mosher, Dr. Andrew J. Symonds, Raymond A. Lorie, and myself on the Cambridge Scientific Center Integrated Text Processing Project. I am grateful to them for many technical conversations and thoughtful criticism.

I am especially indebted to Ed Mosher. He and I have worked together actively since the original INTIME project, and he has influenced the formulation of much of the substance of this paper, most notably in connection with the system representation of text.


BIBLIOGRAPHY

Papers published as Cambridge Scientific Center reports are available from the IBM Cambridge Scientific Center, 545 Technology Square, Cambridge, Massachusetts 02139.

IBM Manuals are available from IBM branch offices.

    General Publications
  1. Charles F. Goldfarb, Edward J. Mosher, Theodore I. Peterson, "Integrated Text Processing for Publishing and Information Retrieval," IBM Cambridge Scientific Center Report G320-2065, April 1971.

  2. Charles F. Goldfarb, Edward J. Moshere Theodore I. Peterson, "An Online System for Integrated Text Processing," Proc. American Association for Information Science 7, 147-150 (1970). (Reprinted in Reference 1, 21-27.)

  3. Charles F. Goldfarb, Edward J. Mosher, Theodore I. Peterson, "Integration of the Text Processing Functions in an Interactive Environment," Proc. Fourth Hawaii International Conference on System Sciences. Honolulu: University of Hawaii, 1971. (Reprinted in Reference 1, 29-33.)

  4. R.E. Griswold, J.F. Page, I.P. Polonski. "The SNOBOL4 Programming Language." Englewood Cliffs, N.J.: Prentice-Hall, Inc., 1968.

  5. R.A. Lorie, A.J. Symonds, "Use of a Relational Access Method under APL." Proceedings of Courant Computer Science Symposium 6, Randall Rustin Editor, Englewood Cliffs, N.J.: Prentice-Hall Inc. (1972), 99-124.

  6. Edward J. Mosher, "Information system economies with a generalized markup language," presented at American Society for Information Science, First mid-year regional conference, May 1972.

  7. Theodore I. Peterson, Edward J. Mosher, Charles F. Goldfarb, "Programming of Professional Society Meetings with an Integrated Text Processing System," IBM Cambridge Scientific Center Report G320-2074, Sept. 1971.

  8. J. Ravin, M. Schatzoff, "An interactive graphics system for the analysis of business decisions under uncertainty," to be published in the IBM Systems Journal.

  9. "Context Editors, Part I: A Conversational Context-Directed Editor," IBM Cambridge Scientific Center Report No. G320-2041, March 1969.

    VM/370, CP-67, and CMS
  10. IBM Virtual Machine Facility/370: Introduction, IBM Corporation Publication GC20-1800.

  11. CP-67/CMS System Description Manual, IBM Corporation Publication GH20-0802.

  12. IBM Virtual Machine Facility/370: Command Language User's Guide, IBM Corporation Publication GC20-1804.

  13. CP-67/CMS User's Guide, IBM Corporation Publication GH20-0859.

  14. IBM Virtual Machine Facility/370: EDIT Guide, IBM Corporation Publication GC20-1805.

    Formatters
  15. Script/370 Text Processing Facility: Program Description/Operations Manual, IBM Corporation Publication SH20-1114.

  16. PAGINATION/360 Application Description, IBM Corporation Publication GE20-0328.

  17. 1130 Composition System: Program Description/Operations Manual, IBM Corporation Publication SB21-0206.

    Document Retrieval Programs
  18. IBM System/360 Document Processing system: Program Description and Operations Manual, IBM Corporation Publication GE20-0477.

  19. Storage and Information Retrieval System (STAIRS): Program Reference Manual, IBM Corporation Publication SH12-5407.

    APL Systems
  20. APL/360 User's Manual, IBM Corporation Publication GH20-0683.

  21. APL(CMS): Program Description / Operations Manual, IBM Corporation Publication SH20-1088.



Thanks to Dr M. A. Message, St Catharine's College, Cambridge, United Kingdom, for encouraging me to find this paper and make it available. He also performed the OCR conversion and created the original HTML markup.