A proposal for an RDF API

Ron Daniel (RDaniel@DATAFUSION.net)
Thu, 3 Jun 1999 11:08:44 -0700 

      Messages sorted by: [ date ][ thread ][ subject ][ author ] 
      Next message: Chris Waterson: "Re: A proposal for an RDF API" 
      Previous message: Danny Ayers: "Hello" 
      Next in thread: Chris Waterson: "Re: A proposal for an RDF API" 

Hi all,

The RDF model and syntax did not specify an API for manipulating
RDF models. It seems that having such an API might be a good
way of getting towards reusable RDF software. The development of
the SAX API is the real inspiration here. So, this rather long
message is an attempt to start an equivalent development effort
around an RDF API.

I have made no attempt to look at implementations of RDF processors
such as SiRPAC or RDF4J. They may have already provided something
more suitable than the one suggested below. But that's OK, if they
do someone should extract those APIs and put them forth as
alternative proposals.

Radix - A community-developed RDF API effort

This document sets out a brief scenario showing some future
uses of RDF we would like to enable, derives some
requirements from that scenario, and sketches out an initial
API. All of these sections are subject to enhancement and
correction. Feedback from interested parties is requested.

I picked the name "Radix" because it has letters drawn from
the acronyms "RDF", "API", and "XML"; and because it sounds
cool to me. (Un?)Fortunately, there is no reasonable way to
torture the strings "RDF", "API", and "XML" into a saying
for which Radix can be made an acronym. Other suggested names
for the effort are welcome.

A usage scenario:
=================

Susan Smith has been tasked with developing a metadata specification
for her company's documents. She decides to develop it in
conformance with the RDF. She discovers the Dublin Core schema,
makes some minor additions to it, and produces a new schema.
She then downloads a freely available editor (such as Reggie) that
reads the schema and helps her prepare sample RDF descriptions
for some of her company's resources. She downloads a free RDF
parser that parses those descriptions and stores the information
into an Access database on her laptop. Using that, she develops a
quick mockup of a fielded web search
form that offers searches by Author, Title, and Subject for her
company's web pages. Based on the strength of that demo, her manager
authorizes her to continue the effort. She contracts with an external
company to automatically generate RDF descriptions for pages on
her company's Intranet, including an automatically assigned
classification code from the Dewey Decimal System.
Loading all the new descriptions into Access, Susan
finds performance to be unacceptable. She downloads a different
RDF storage backend that uses DBM files. She install it on a UNIX
machine at the office, loads it with the data, and modifies the
search form. The manager uses the tool and approves it for wider use
in the office. A user requests that the tool be extended to allow
searches over the daily subscription of news stories. Those stories
come in with XMLNews markup. Susan obtains an add-on parser that
understands the XMLNews markup and begins adding the data to the
RDF storage system. She also configures that system to delete
records from the newsfeed after 4 weeks.

Initial Requirements;
=====================

("must", "should", and "may" have the IETF meanings).

The RDF API must be independent of the particular means for storing the
RDF data. It must be possible to develop a variety of storage mechanisms
for the RDF information. (Storage mechanisms of particular interest are
relational databases, systems based on DBM files, in-memory models that
may be loaded by parsing files, and PROLOG-like environments).

The RDF API must allow for different front-end parsers to be constructed
that will parse a variety of input formats and generate RDF assertions
to populate the RDF storage. (Parsing the RDF serialization format is
of course the first job for a parser, and we would like for there to be
several to choose from. But it is possible to identify mappings from
other storage formats to the RDF model, so we want to allow parsers that
deal with those other formats to be able to read documents and store
facts from them as RDF information).

The RDF API must provide for:
Storage of RDF nodes and arcs
Queries over the RDF nodes and arcs
Deletion of RDF nodes and arcs
Meta-description of properties and resources.
(For example, being able to say that a statement was made on a
particular date so it can be automatically deleted).
The grouping of related statements into a "model" (see example
above).
(Really need a different term here than model because we already
use that term for the RDF data model of nodes and arcs).
Meta-description of "models". 
Insertion, query, and deletion of models.

The RDF API may provide for:
The management of multiple RDF storage systems by the same thread
of control.
The validation of RDF statements against an RDF schema.

Reference implementations of the RDF API should provide:
More than one storage mechanism (or an interface to more than one).
At least one input format parser.
At least one output format for RDF models.

Again, readers are reminded that the requirements listed above are a
first pass. Please suggest additions, deletions, modifications, etc.

Overall design:
===============
The API provides classes for Resource, Statement, Model, and ModelStore.

Just about everything in RDF is a "Resource", defined as something
that has a URI. 

The Statement Class has methods for getting and setting the Subject,
Predicate, and Object of the Statement. A Statement is also a Resource,
in that it has a URI. The URI can either be assigned by the system or
set explicitly. If that URI is used as the subject of a
statement then the statement is effectively reified. (The precise
way the API exposes reification is one of the major open issues.
I suggest one way below but I don't think it is especially great.
Suggestions on this point are actively solicited.)

A group of statements can be bundled together in a "Model". Models
have methods for adding and deleting Resources and Statements. Models
also have a Query method that returns a Model. (This is analogous to
the Relational database model, where the result of querying a table
is another table.) A Model is a resource (it has a URI) and Statements
can be made about a like any other resource.

Models are held in a "ModelStore". The ModelStore is itself a Model.
Specifically, it is the Model that holds the statements about the
import time, creator, etc. of the models it contains.

(There is another analogy here to RDBMS. They use system tables to
manage user tables. I'm using an RDF Model to manage the other models.
This needs more elaboration in the design and in the documentation).

A quick hack API:
=================

This defines the interface for the classes:
Resource
Statement
Model
ModelStore 
Query

This is only to start discussions about what the real API should look
like. I'm using Java interfaces to define the API, it would be great
for others to look at the translation to other languages as we evolve
the API.

/** Resource.java - A node in an RDF Model. Also the base class
* that all RDF things are derived from.
* 
* We don't actually DO much with a resource, other than get
* or set its URI.
*
* One issue to be discussed is the use of relative and system-
* assigned URIs. Since every Statement and every Model is a
* Resource, the RDF implementation will be generating lots of
* URIs. Many of them will never be reused, so simple little
* relative identifiers like an integer counter are desired for
* space and time reasons. How to deal with the Base URI for
* those needs to be developed.
*/

public interface Resource {

/** Create a new Resource. 
*/
public Resource(); // System assigns a URI
public Resource(URI theURI);
public Resource(String theURI); // String to be converted to a URI

// Getters and setters
public String getID();
public void setID(String theURI);

// Question - what exceptions should be thrown?
}

/** Statement.java - A statement in an RDF Model. 
*
* This has getter and setter methods for the Subject, Predicate,
* and Object of the Statement (as well as for the URI of the
* Statement since it is a Resource).
*
* RDF allows the Object of the Statement to be a Resource or a
* Literal. I'm not sure of the best way to deal with Literals
* for forward-compatibility with typed values (ints, dates, ...).
* The methods below deal with it as a Java "Object". This means
* implementations will need to do lots of "instanceof" comparisons
* if they want to optimize the storage of things. It also does
* not translate well to C. Suggestions?
*/

public interface Statement extends Resource {

/** Null constructor.
*/
public Statement(); 

/** Create a new, named, statement.
*/
public Statement(URI pid, Resource subj, Resource pred, Object obj);


// Getters and Setters
// Getting and setting the statement id (its URI) is
// handled by the superclass

public void setSubject(Resource r);
public Resource getSubject();
public void setPredicate(Resource r);
public Resource getPredicate();
public void setObject(Object o);
public Object getValue(); // It is up to the caller to figure out if
// the value is a Resource or a String or
// ...
}

/** Model.java - Abstract class for all the stuff we can do to
* a Model.
*
* This should be implemented by classes that implement particular
* storage strategies - such as Java Hashtables vs. Relational
Databases...
*
* Should this be an abstract class rather than an Interface?
*/

public interface Model extends Resource {

/** Add a resource to the model if it is not already there.
* The resource will have its URI set to the given string.
* Repeated calls to this routine with
* the same argument must return the same Resource object.
*/
public Resource addResource(String resourceID);
public Resource addResource(Resource r);

/** Add a statement to the model if it is not already there.
* Repeated calls to this routine with the same arguments
* will return references to the same Statement object.
*
* If the Resource arguments are not already in the model they
* are added.
*
* The 's' argument is a Boolean flag
* to say if a Statement is 'structural', as opposed to part
* of the model. Structural statements are ones that should not
* be output if the model is serialized, because they are ones
* created when parsing the model. The default value for it
* is 'false'.
*
* The 'b' argument is weird, probably controversial, and related
* to reification. It stands for whether a statement is 'believed'
* in the context of this model or not. If a statement is believed
* then it appears in the model for purposes of queries, output,
etc.
* If a statement is not believed then it is invisible to queries
* and output. However, a reified version of the statement can be
* used.
*
* A StatementConflictException may be thrown if the new statement
* contradicts an existing statement. (Is this a good or a bad
idea)
*/
public Statement addStatement(Statement s)
throws StatementConflictException;
public Statement addStatement(Statement s, boolean s, boolean b)
throws StatementConflictException;

/** Convienece method to construct a Statement and add it to the
* model. The newly constructed Statement is returned.
* If an equivalent Statement already exists in the model, it is
* returned and a new Statement is not made.
*
* The subj, pred, and obj arguments are the obvious parts of
* an RDF statement.
*
* The s and b booleans are as described above.
* Should a null ID be allowed to tell the system to just assign
* its own URI?
*/
public Statement addStatement(URI id, Resource subj,
Resource pred, Resource obj, boolean s, boolean b)
throws StatementConflictException;


/** Delete a Resource from the model.
* If the Resource does not exist in the model a
NoSuchResourceException
* is thrown. If the Resource does exist in the model but it is
used
* in Statements in the model then the ResourceInUseException is
* thrown.
*/
public abstract void deleteResource(Resource r)
throws NoSuchResourceException, ResourceInUseException;

/** Delete a statement from the model.
* If the statement is not part of the model the
NoSuchStatementException
* is thrown.
*
* (Is this needed? Since Statements are resources we could
probably
* just use the deleteResource() method.)
*/
public abstract void deleteStatement(Statement s)
throws NoSuchStatementException;

/** Enumerate through the statements in this model.
*/
public Enumeration getStatements() throws Exception; 

/** Enumerate through the statements in this model, ordered
* according to the StatementObject field.
*
* Probably need a more general ordering criterion.
*/
public Enumeration getOrderedProperties() throws Exception; 

/** Add the contents of another model to this one.
* All the Resources and Statements of the imported Model are
* now part of this model.
*/
public void importModel(Model m) throws ModelImportException;

/** Return a new model that is the subset of this model that
* matches the specfied query.
*/
public Model select(Query q);


/** Whenever a system-generated identifier is needed for
* statements or nodes.
* (Should this generate a URI instead? Right now you can
* use the getBaseURI() and combine it with the generated
* System ID to get a URI for some internal model resources).
*/
synchronized public String genSysID(String s);

/** Set the base URI for the model. */
public void setBaseURI(URI b);

/** Get the base URI for the model. */
public URI getBaseURI();

/** Dump the content of the model to a String for debugging
* purposes. The implementations should wrap output lines
* so that they are no longer than 'width' characters.
*
* (This is a dump to the RDF serialization format).
*/
public String dump(int width);
}

/** Routines around the storage of Models.
*/ 
public interface ModelStore extends Model
{
addModel(Model m);
deleteModel(Model m);
}

/** A Query is a pattern applied to a Model in order to select a
* subset of it for returning. This really needs to be elaborated
* to allow for combining multiple queries. Perhaps we should use
* SQL as an analogy. 
*
* The Query below is really just a Statement. If something in
* the statement is specified, then the parts of the Model that match
* it are selected. For example, if the ID of the statement is given,
* then the Resource that matches that ID is selected from the model.
* (Probably not very useful without wildcards). More useful is to
* specify the Predicate of the Statement (the result would be a new
* model containing all the Statements of this model with the specified
* Predicate. If you specify the Subject and the Predicate, the result
* would be even smaller.
*/
public interface Query extends Statement
{
// No extra methods needed here?
}

Ron Daniel Jr.
DATAFUSION, Inc.
139 Townsend Street, Suite 100
San Francisco, CA 94107
415.222.0100 fax 415.222.0150 
rdaniel@datafusion.net