A proposal for structured indexing with the Meta-Tag


Version: March 18th 1996, Author: Heinrich C. Kuhn


Overview

  1. Reasons for this proposal and core of this proposal:
    1. Why we need structured indexing information
    2. Davide Musella's draft
    3. general traits of a proposal to improve this draft
  2. Applications of this proposal:
    1. structured indexing of Keywords and related indexing information
    2. structured indexing of information on authors
    3. structured indexing of abstracts
    4. structured miscellaneous information
  3. What user-agents could do with such more structured information
  4. Conclusion

  5. Related information: Paper read at the 20th annual meeting of the Gesellschaft für Klassifikation (Society for Classification) at Freiburg, March 9th 1996 (the paper and the report about the discussion are in German)


Reasons for this proposal and core of this proposal

Need for structured indexing information

The number of documents on the web that contain information relevant to one or several scholarly communities is rapidly rising. This type of documents is more resemblant to articles in journals than to monographic literature. For printed scholarly documents of both types it has been necessary for a long time already to use well structured indexing information in order to permit readers to retrieve the material they are looking for. With printed documents librarians and users of libraries differentiate between several types of authorship, and the indexers of databases like e.g. Medline add controlled information for indexing. Pure names of authors and mere "author's keywords" have proven to be not sufficient, and it is to be expected, that that sort of "minimal information" won't be sufficient for scholarly documents on the web for a very long time to come. Therefore we should seek for ways to provide for scholarly documents on the web the type of structured indexing that has been proven to be necessary already in the "World of Printed Documents".


Davide Musella's proposal

Davide Musella proposes in his INTERNET DRAFT on The META Tag of HTML some standards for names to use with the META-tag in order to map certain types of content (e.g. absctracts) to certain flags for this type of content. In my opinion this is certainly useful and probably necessary.

Although I think Davide Musella's proposal to be a good one, I feel it does not (yet) permit the amount of structuring of meta-indformation, that is necessary in some cases. Such cases are:


Therefore I here propose to permit a use of the META-tag, that permits authors to give further, and more structured information about their documents.

This could be done either by permitting named anchors as the content of a META-item; the content of this named anchor would then be the section of the document that contains the more structured indexing information.
E.g.:


<META NAME="IndexInfo" 
CONTENT="#IndexPartOfDoc">
...
Diverse content
...
<a name="IndexPartOfDoc"> 
This here then would contain the structured indexing information:
- Information about authors
- Keywords and related indexing information
- abstracts
- miscellaneous indexing information
 </a>
_
Diverse content
_

Or this could be done by introducing a new pair of tags (<index> and </index>) and a new boolean value for the META-tag, that says "Yes" if the document contains such a section with structured indexing information. E.g.:


<META NAME="IndexInfo" CONTENT="Yes">
...
Diverse content
...
<index>
This here then would contain the structured indexing information:
- Information about authors
- Keywords and related indexing information
- abstracts
- miscellaneous indexing information
</index>
_
Diverse content
_

See below for further information on how this might look like.


[ Overview ] [ related information ]

structured indexing of Keywords and related indexing information

Many scientific disciplines and larger libraries cannot rely on mere "author's keywords" to index their information in a way that permits focussed retrieval by the reades. They have therefore introduced means of classification and indexing like e.h. Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), Medical Subjectheadings (MeSH), Controlled Keywords and the like. More than one of these ways of indexing can be used for one and the same document.

An application for an electronic document could e.g. look like this:

<index>

<RSWK> "Schlagwort / Kette / Eins", 
"Kette / Schlagwort / Eins", 
"Schlagwort / Kette / Zwei", 
"Kette / Schlagwort / Zwei"</RSWK> 

<LOC-SH> Subject-Headding 1, 
Subject-Heading 2 <LOC-SH>

<DDC>1.2.3.</DDC>

<MeSH> Meshterm_1, *Meshterm_2, 
Meshterm_3</MeSH>

<BiosisBioCode>Biocode1, *Biocode2, Biocode3, 
*Biocode4 </BiosisBioCode>
<BiosisConceptCode>ConceptCode1, ConceptCode2 
</BiosisConceptCode>

<CARef> 123456, 123457, 123458 </CARef>

<AuthorsKeywords> Keyword1, Keyword2, Keyword3 
</AuthorsKeywords>

<MPG-GV-AZ>25842, 2535 </MPG-GV-AZ>

</index>

The abbreviations used for several types of classification are in many cases more or less standard and well know to the members of the scholarly community using them. For the few cases where this might not be true it should be left to the community or communities in question to decide on a way of abbreviation; in my opinion an INTERNET DRAFT is not the place where to do such a thing.


[ Overview ] [ related information ]

Structured information on authors

Structured information on authors permits differntiation between first and secondary authors, different types of authors, information on their institutional affilation, etc. Structured information on authors might look like this:

<index>

<author-last-name>Kuhn</author-last-name>
<author-first-name>Heinrich C.</author-first-
name>
<author-affilation>Max-Planck-Gesellschaft / 
Generalverwaltung, München </author-affilation>

<secondary-author-last-name>Meier
</secondary-author-last-name>
<secondary-author-first-name>Martin
</secondary-author-first-name>
<secondary-author-affilation> Institut für 
Bibliothekswesen, Kleinkarlbach 
</secondary-author-affilation>
<secondary-author-last-name>Müller
</secondary-author-last-name>
<secondary-author-first-name>Manuel
</secondary-author-first-name>
<secondary-author-affilation> Arbeitskreis für 
Bibliothekswesen, Untergiesing 
</secondary-author-affilation>
<secondary-author-last-name>Huber
</secondary-author-last-name>
<secondary-author-first-name>Harald
</secondary-author-first-name>
<secondary-author-affilation> Kolleg für 
Sacherschließung, Borghorst 
</secondary-author-affilation>

</index>


[ Overview ] [ related information ]

structured indexing of abstracts

There are cases where a documment comes with more than just one abstract, or with abstracts in "unexpected" langages. For such cases the possibility to give structured information about abstracts and their contents might be wellcome.

An application of this could look like this:

<index>

<abstract>
<abstract-deutsch>Kurze Zusammenfassung 
des Dokumenten-Inhalts auf Deutsch
</abstract-deutsch>
<abstract-english>Short resumee of the document; 
in English language
</abstract-englisch>
</abstract>

</index>


[ Overview ] [ related information ]

structured miscellaneous information

There might be cases, where other, micellaneous information might be of interest as well, like e.g. the full, long title of a baroque document made available on the net (because you would not want to put something like the title of a book by Niculaus Taurellus which runs PhilosophiaeTriumphus, hoc est Metaphysica Philosophandi Methodus, Qua Divinitus Inditis menti notitiis, humanae rationes eo deducuntur, ut firmissimis unde contructis demonstrationibus, aperte rei veritas elucescat, & quae diu Philosophorum sepulta fuit authoritate, Philosophia victrix erumpat: Quaetionibus enim vel sexcentis, ea quibus cum revelato nobis veritate Philosophia pugnare videtur, adeo vere conciliantur, ut non fidei solum servire dicenda sit, sed eius esse fundamentum between <h1> and </h1> on the user's screen ...).

An application might look like this:

<index>

<full-title>
Very long title of the document, which is a type 
of title very much en vogue in the late 
renaissance and in the baroque era, but which 
can be found in the case contemporary German 
doctoral theses up to the present days. 
Diplaying a type of title, that often tends 
to make use of subtitles as well 
</full-title>

<date-creation>19951222</date-creation>
<date-update>19960226</date-update>

<technical-info>
HTML-Dokument with 
search-interface to 
Database and Links to ftp-resources
</technical-info>

</index>


[ Overview ] [ related information ]

What user-agents could do with such more structured information

User Agents (SearchEngines, programs collecting information gathered by various robots, crawlers, and the like) could use structered indexing information in the following way:

  1. In case they are programmed to "know" a certain type of information (like e.g. indexing according to DDC) they might put the respective content in the appropriate field of their indexing information on the respective document.
  2. In case they come accross some indexing information of a type not "known" to them (like e.g. something inbetween <index> and </index> that is surrounded by <xyz> and <xyz> they might put it into a filed for "other indexing information".

Somebody trying to find out some scholarly indexed information in the biomedical field is out there on the net then would just have to ask the SearchEngine used by her or him for all documents having certain contents within the indexing fields for MeSH and the relevant BiosisBioCodes.


[ Overview ] [ related information ]

Conclusion

This type of information worked well for printed documents in cases where "author's keywords" alone did not work. It is to be hoped, that it will work in the electronic context as well.

The main difference here is, that in the electronic context in many cases it will be up to the authors to index their documents themselves, while in the case of printed documents such indexing was done by experts in such indexing.

As most authors know rather well the types and terms of indexing relevant to their narrower field of research, and as good indexing helps the authors to find the desired readers for their documents, it is to be hoped, that relying on the authors to do (at least part of) the indexing is possible.

Comments and critique are always wellcome by Heinrich C. Kuhn . Thanks a lot in advance!


[ Overview ] [ related information ]