A proposal for an RDF API Ron Daniel (RDaniel@DATAFUSION.net) Thu, 3 Jun 1999 11:08:44 -0700 Messages sorted by: [ date ][ thread ][ subject ][ author ] Next message: Chris Waterson: "Re: A proposal for an RDF API" Previous message: Danny Ayers: "Hello" Next in thread: Chris Waterson: "Re: A proposal for an RDF API" Hi all, The RDF model and syntax did not specify an API for manipulating RDF models. It seems that having such an API might be a good way of getting towards reusable RDF software. The development of the SAX API is the real inspiration here. So, this rather long message is an attempt to start an equivalent development effort around an RDF API. I have made no attempt to look at implementations of RDF processors such as SiRPAC or RDF4J. They may have already provided something more suitable than the one suggested below. But that's OK, if they do someone should extract those APIs and put them forth as alternative proposals. Radix - A community-developed RDF API effort This document sets out a brief scenario showing some future uses of RDF we would like to enable, derives some requirements from that scenario, and sketches out an initial API. All of these sections are subject to enhancement and correction. Feedback from interested parties is requested. I picked the name "Radix" because it has letters drawn from the acronyms "RDF", "API", and "XML"; and because it sounds cool to me. (Un?)Fortunately, there is no reasonable way to torture the strings "RDF", "API", and "XML" into a saying for which Radix can be made an acronym. Other suggested names for the effort are welcome. A usage scenario: ================= Susan Smith has been tasked with developing a metadata specification for her company's documents. She decides to develop it in conformance with the RDF. She discovers the Dublin Core schema, makes some minor additions to it, and produces a new schema. She then downloads a freely available editor (such as Reggie) that reads the schema and helps her prepare sample RDF descriptions for some of her company's resources. She downloads a free RDF parser that parses those descriptions and stores the information into an Access database on her laptop. Using that, she develops a quick mockup of a fielded web search form that offers searches by Author, Title, and Subject for her company's web pages. Based on the strength of that demo, her manager authorizes her to continue the effort. She contracts with an external company to automatically generate RDF descriptions for pages on her company's Intranet, including an automatically assigned classification code from the Dewey Decimal System. Loading all the new descriptions into Access, Susan finds performance to be unacceptable. She downloads a different RDF storage backend that uses DBM files. She install it on a UNIX machine at the office, loads it with the data, and modifies the search form. The manager uses the tool and approves it for wider use in the office. A user requests that the tool be extended to allow searches over the daily subscription of news stories. Those stories come in with XMLNews markup. Susan obtains an add-on parser that understands the XMLNews markup and begins adding the data to the RDF storage system. She also configures that system to delete records from the newsfeed after 4 weeks. Initial Requirements; ===================== ("must", "should", and "may" have the IETF meanings). The RDF API must be independent of the particular means for storing the RDF data. It must be possible to develop a variety of storage mechanisms for the RDF information. (Storage mechanisms of particular interest are relational databases, systems based on DBM files, in-memory models that may be loaded by parsing files, and PROLOG-like environments). The RDF API must allow for different front-end parsers to be constructed that will parse a variety of input formats and generate RDF assertions to populate the RDF storage. (Parsing the RDF serialization format is of course the first job for a parser, and we would like for there to be several to choose from. But it is possible to identify mappings from other storage formats to the RDF model, so we want to allow parsers that deal with those other formats to be able to read documents and store facts from them as RDF information). The RDF API must provide for: Storage of RDF nodes and arcs Queries over the RDF nodes and arcs Deletion of RDF nodes and arcs Meta-description of properties and resources. (For example, being able to say that a statement was made on a particular date so it can be automatically deleted). The grouping of related statements into a "model" (see example above). (Really need a different term here than model because we already use that term for the RDF data model of nodes and arcs). Meta-description of "models". Insertion, query, and deletion of models. The RDF API may provide for: The management of multiple RDF storage systems by the same thread of control. The validation of RDF statements against an RDF schema. Reference implementations of the RDF API should provide: More than one storage mechanism (or an interface to more than one). At least one input format parser. At least one output format for RDF models. Again, readers are reminded that the requirements listed above are a first pass. Please suggest additions, deletions, modifications, etc. Overall design: =============== The API provides classes for Resource, Statement, Model, and ModelStore. Just about everything in RDF is a "Resource", defined as something that has a URI. The Statement Class has methods for getting and setting the Subject, Predicate, and Object of the Statement. A Statement is also a Resource, in that it has a URI. The URI can either be assigned by the system or set explicitly. If that URI is used as the subject of a statement then the statement is effectively reified. (The precise way the API exposes reification is one of the major open issues. I suggest one way below but I don't think it is especially great. Suggestions on this point are actively solicited.) A group of statements can be bundled together in a "Model". Models have methods for adding and deleting Resources and Statements. Models also have a Query method that returns a Model. (This is analogous to the Relational database model, where the result of querying a table is another table.) A Model is a resource (it has a URI) and Statements can be made about a like any other resource. Models are held in a "ModelStore". The ModelStore is itself a Model. Specifically, it is the Model that holds the statements about the import time, creator, etc. of the models it contains. (There is another analogy here to RDBMS. They use system tables to manage user tables. I'm using an RDF Model to manage the other models. This needs more elaboration in the design and in the documentation). A quick hack API: ================= This defines the interface for the classes: Resource Statement Model ModelStore Query This is only to start discussions about what the real API should look like. I'm using Java interfaces to define the API, it would be great for others to look at the translation to other languages as we evolve the API. /** Resource.java - A node in an RDF Model. Also the base class * that all RDF things are derived from. * * We don't actually DO much with a resource, other than get * or set its URI. * * One issue to be discussed is the use of relative and system- * assigned URIs. Since every Statement and every Model is a * Resource, the RDF implementation will be generating lots of * URIs. Many of them will never be reused, so simple little * relative identifiers like an integer counter are desired for * space and time reasons. How to deal with the Base URI for * those needs to be developed. */ public interface Resource { /** Create a new Resource. */ public Resource(); // System assigns a URI public Resource(URI theURI); public Resource(String theURI); // String to be converted to a URI // Getters and setters public String getID(); public void setID(String theURI); // Question - what exceptions should be thrown? } /** Statement.java - A statement in an RDF Model. * * This has getter and setter methods for the Subject, Predicate, * and Object of the Statement (as well as for the URI of the * Statement since it is a Resource). * * RDF allows the Object of the Statement to be a Resource or a * Literal. I'm not sure of the best way to deal with Literals * for forward-compatibility with typed values (ints, dates, ...). * The methods below deal with it as a Java "Object". This means * implementations will need to do lots of "instanceof" comparisons * if they want to optimize the storage of things. It also does * not translate well to C. Suggestions? */ public interface Statement extends Resource { /** Null constructor. */ public Statement(); /** Create a new, named, statement. */ public Statement(URI pid, Resource subj, Resource pred, Object obj); // Getters and Setters // Getting and setting the statement id (its URI) is // handled by the superclass public void setSubject(Resource r); public Resource getSubject(); public void setPredicate(Resource r); public Resource getPredicate(); public void setObject(Object o); public Object getValue(); // It is up to the caller to figure out if // the value is a Resource or a String or // ... } /** Model.java - Abstract class for all the stuff we can do to * a Model. * * This should be implemented by classes that implement particular * storage strategies - such as Java Hashtables vs. Relational Databases... * * Should this be an abstract class rather than an Interface? */ public interface Model extends Resource { /** Add a resource to the model if it is not already there. * The resource will have its URI set to the given string. * Repeated calls to this routine with * the same argument must return the same Resource object. */ public Resource addResource(String resourceID); public Resource addResource(Resource r); /** Add a statement to the model if it is not already there. * Repeated calls to this routine with the same arguments * will return references to the same Statement object. * * If the Resource arguments are not already in the model they * are added. * * The 's' argument is a Boolean flag * to say if a Statement is 'structural', as opposed to part * of the model. Structural statements are ones that should not * be output if the model is serialized, because they are ones * created when parsing the model. The default value for it * is 'false'. * * The 'b' argument is weird, probably controversial, and related * to reification. It stands for whether a statement is 'believed' * in the context of this model or not. If a statement is believed * then it appears in the model for purposes of queries, output, etc. * If a statement is not believed then it is invisible to queries * and output. However, a reified version of the statement can be * used. * * A StatementConflictException may be thrown if the new statement * contradicts an existing statement. (Is this a good or a bad idea) */ public Statement addStatement(Statement s) throws StatementConflictException; public Statement addStatement(Statement s, boolean s, boolean b) throws StatementConflictException; /** Convienece method to construct a Statement and add it to the * model. The newly constructed Statement is returned. * If an equivalent Statement already exists in the model, it is * returned and a new Statement is not made. * * The subj, pred, and obj arguments are the obvious parts of * an RDF statement. * * The s and b booleans are as described above. * Should a null ID be allowed to tell the system to just assign * its own URI? */ public Statement addStatement(URI id, Resource subj, Resource pred, Resource obj, boolean s, boolean b) throws StatementConflictException; /** Delete a Resource from the model. * If the Resource does not exist in the model a NoSuchResourceException * is thrown. If the Resource does exist in the model but it is used * in Statements in the model then the ResourceInUseException is * thrown. */ public abstract void deleteResource(Resource r) throws NoSuchResourceException, ResourceInUseException; /** Delete a statement from the model. * If the statement is not part of the model the NoSuchStatementException * is thrown. * * (Is this needed? Since Statements are resources we could probably * just use the deleteResource() method.) */ public abstract void deleteStatement(Statement s) throws NoSuchStatementException; /** Enumerate through the statements in this model. */ public Enumeration getStatements() throws Exception; /** Enumerate through the statements in this model, ordered * according to the StatementObject field. * * Probably need a more general ordering criterion. */ public Enumeration getOrderedProperties() throws Exception; /** Add the contents of another model to this one. * All the Resources and Statements of the imported Model are * now part of this model. */ public void importModel(Model m) throws ModelImportException; /** Return a new model that is the subset of this model that * matches the specfied query. */ public Model select(Query q); /** Whenever a system-generated identifier is needed for * statements or nodes. * (Should this generate a URI instead? Right now you can * use the getBaseURI() and combine it with the generated * System ID to get a URI for some internal model resources). */ synchronized public String genSysID(String s); /** Set the base URI for the model. */ public void setBaseURI(URI b); /** Get the base URI for the model. */ public URI getBaseURI(); /** Dump the content of the model to a String for debugging * purposes. The implementations should wrap output lines * so that they are no longer than 'width' characters. * * (This is a dump to the RDF serialization format). */ public String dump(int width); } /** Routines around the storage of Models. */ public interface ModelStore extends Model { addModel(Model m); deleteModel(Model m); } /** A Query is a pattern applied to a Model in order to select a * subset of it for returning. This really needs to be elaborated * to allow for combining multiple queries. Perhaps we should use * SQL as an analogy. * * The Query below is really just a Statement. If something in * the statement is specified, then the parts of the Model that match * it are selected. For example, if the ID of the statement is given, * then the Resource that matches that ID is selected from the model. * (Probably not very useful without wildcards). More useful is to * specify the Predicate of the Statement (the result would be a new * model containing all the Statements of this model with the specified * Predicate. If you specify the Subject and the Predicate, the result * would be even smaller. */ public interface Query extends Statement { // No extra methods needed here? } Ron Daniel Jr. DATAFUSION, Inc. 139 Townsend Street, Suite 100 San Francisco, CA 94107 415.222.0100 fax 415.222.0150 rdaniel@datafusion.net