Libraries and linked data #1: What are linked data?

[Note: An introduction to this and the following five blog posts, all under the general title Libraries and linked data, is given in the previous post.]

Linked data and RDF

‘Linked data’ are data encoded and published on the Web using three simple rules:

People, places, and other things under discussion are identified using HTTP names (i.e. Uniform Resource Identifiers; URIs).
Information is expressed in the form of simple relationships between pairs of things.
Information is encoded in a standard format.

This is achieved by using Semantic Web standards, particularly the Resource Description Framework (RDF), a general-purpose data model, language and encoding format developed by the World Wide Web Consortium. Information published on the Web in RDF is ‘linked data’.

Richard Cyganiac has created a map of linked data resources, that grows in complexity year by year as new sources are added. Because of its generality, the central node in the map is dbpedia, an RDF representation of the structured information that exists on Wikipedia pages, initially created in 2007 by Chris Bizer of the Free University of Berlin and his colleagues.

The principles of RDF are very simple:

Each RDF statement expresses a single simple relationship between two entities, forming a subject–predicate–object ‘triple’, for example “Chris Bizer created dbpedia”.
Each subject entity is identified by a unique URI defining either a Web resource or a member of an ontology class, for example
```
<http://www.w3.org/wiki/ChrisBizer>
```
Each object entity may be identified by a unique URI defining either a Web resource or a member of an ontology class, or alternatively may be identified as a literal value, signified as a character string presented in double quotation marks, for example
```
<http://dbpedia.org> or "Berlin"
```
Each predicate is specified by a unique URI defining a property within an ontology, either an object property or a data property. Object properties are those that take as their object a Web resource or a member of an ontology class defined by a URI, while data properties are those that take as their object a literal value. An example of an object property is
```
<http://purl.org/dc/terms/creator>
```
, defining the Dublin Core metadata term meaning “is creator of”.

Thus an RDF statement expressing the relationship “ Chris Bizer created dbpedia ” is:
```
<http://www.w3.org/wiki/ChrisBizer> 
       <http://purl.org/dc/terms/creator> <http://dbpedia.org> .
```
[Note: A brief introduction to the Turtle syntax used here to represent RDF is given in the second paper in this series, entitled Libraries and linked data #2: A Rough Guide to Turtle.]

This simplicity of course limits the sophistication of the things that can be said, but enables information to be encoded in a standardized, machine-readable manner.

In a second example, which uses ontology prefixes such as “rdf:” as abbreviations for full URIs, in this case for <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, the following three triples form a small RDF graph (expressed in Turtle notation – see point 7 below) describing three facts about my journal article (doi:10.1186/2041-1480-1-S1-S6) entitled CiTO, the Citation Typing Ontology [1]. These three triples in turn define its nature, its author and its title:
```
 <http://dx.doi.org/10.1186/2041-1480-1-S1-S6>  
     rdf:type   fabio:JournalArticle  .
 <http://dx.doi.org/10.1186/2041-1480-1-S1-S6> 
     dc:creator "David Shotton"  .
 <http://dx.doi.org/10.1186/2041-1480-1-S1-S6> 
     dc:title "CiTO, the Citation Typing Ontology" .
```
Here, rdf:type is an object property, while dc:creator and dc:title are data properties, taking literal objects.
Ontology URIs reference publicly available and commonly accepted structured vocabularies (ontologies), in which the meanings of terms (such as ‘Journal Article’) are uniquely and unambiguously defined on the Web using unique URIs.
A group of RDF triple statements having subjects or objects in common constitute a labelled directed acyclic graph (DAG), in which the subjects and objects form the nodes (vertices) in the graph, and the properties form the links (edges).
An RDF graph can be written out (‘serialized’) in a series of simple machine-readable RDF statements. Various syntaxes exist for this purpose, including an XML syntax for RDF called RDF/XML, that is hard for humans to read, and Turtle, which is much easier for humans to read.

An explanation and example of encoding richer bibliographic data in RDF is given in the third paper in this series, Libraries and linked data #3: Encoding bibliographic records in RDF.

How to create and publish linked data?

Authoritative guidance on creating linked data is available in a series of blog posts by Jenny Tennison [2-6]. An excellent book describing the creation, publication and consumption of linked data is freely available on the Web [7].

One of the best ways to publish RDF triple statements is to put them in a Web-accessible ‘triple store’ – a database dedicated to storing RDF triples – that has a SPARQL endpoint, i.e. a human-readable interface that can be queried using SPARQL, the Semantic Web query language, and that also has an API permitting such querying to be done automatically from another computer.

An example of SPARQL

SPARQL is a query language for RDF triples, and stands in relation to them as SQL does to relational database data. For bibliographic information encoded in RDF, such as that exemplified in the third paper in this series, entitled Libraries and linked data #3: Encoding bibliographic records in RDF, the following SPARQL:

Select distinct ?paper ?doi ?pubmedid ?citation
where {
?paper fabio:hasPubMedId ?pubmedid ;

 prism:doi ?doi ;
dcterms:bibliographicCitation ?citation ;
dcterms:publisher [foaf:name ?publisher] .
filter regex(?publisher, "PubMed Central", "i") .
?paper prism:publicationDate ?date .
filter (?date ≥ 2010-01-01 && ?date ≤ 2010-12-31) .
}

expresses a typical query, which, in normal English, reads:

“Give me the PubMed ID, the DOI and the bibliographic citation for any paper that has PubMed Central as its publisher and was published in 2010.”

What are open linked data?

Open linked data are simply linked data that are published under a Creative Commons Zero (CC0) open data waiver, which essentially places the data in the public domain, free of legal and copyright restrictions, or a similar license, so that potential users are assured that they are free to re-use the data for any purpose.

The importance of explicitly stating the license under which linked data are published cannot be over-stated. This should be both in human-readable terms, with an Open Data label

that can be inserted into an HTLM page using the following statement:

<!-- Open Data Link -->
<a href="http://opendefinition.org/">
     <img alt="This material is Open Data" border="0"
     src="http://assets.okfn.org/images/ok_buttons/od_80x15_blue.png" />
</a>
<!-- /Open Data Link -->

and also in a machine-readable RDF license statement, thus:

 :this-dataset dcterms:license
     <http://creativecommons.org/publicdomain/zero/1.0/> .

It is equally important that a genuinely open data license such as CC0 is used. The Creative Commons Attribution License (CC-By), widely and appropriately used for licensing copyright documents and photographs, is inappropriate for published data and metadata, because its legal requirement to attribute the source of the data potentially leads to ‘attribution stacking’ when linked data from many different sources are integrated automatically.

Where data are reused in a manner that permits scholarly acknowledgement of the source(s), this of course should be done, following community norms, just as bibliographic citation references are made at the end of scholarly articles to previously published papers upon which new articles are based, without any legal license requirement that the author does this.

References

[1] Shotton, David (2010). CiTO, the Citation Typing Ontology. J. Biomedical Semantics 1 (Suppl. 1): S6. http://dx.doi.org/10.1186/2041-1480-1-S1-S6.

[2] Creating Linked Data – Part I: Analysing and Modelling. http://www.jenitennison.com/blog/node/135.

[3] Creating Linked Data – Part II: Defining URIs. http://www.jenitennison.com/blog/node/136.

[4] Creating Linked Data – Part III: Defining Concept Schemes. http://www.jenitennison.com/blog/node/137.

[5] Creating Linked Data – Part IV: Developing RDF Schemas. http://www.jenitennison.com/blog/node/138.

[6] Creating Linked Data – Part V: Finishing Touches. http://www.jenitennison.com/blog/node/139.

[7] Tom Heath and Chris Bizer (2011). Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool. doi:10.2200/S00334ED1V01Y201102WBE001; ISBN: 9781608454303 (paperback); ISBN: 9781608454310 (ebook). Freely available on the Web in HTML at http://linkeddatabook.com/editions/1.0/.

Libraries and linked data #1: What are linked data?

7 Responses to Libraries and linked data #1: What are linked data?

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta

Libraries and linked data #1: What are linked data?

Share this:

Related

7 Responses to Libraries and linked data #1: What are linked data?

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta