The creation of the dataset is motivated by several factors, one being the desire to have more real-world RDF datasets of reasonable size. Wikipedia assembles a wealth of information created and maintained by people all over the globe - opening up that rich pool of data or even only a small part of it to the semantic web seems like a worthy pursuit.
The Wikipedia³ dataset currently combines structural information like link and category relationships with basic per-page metadata. The dataset is based on the Wikimedia dump of the English Wikipedia (enwiki, currently from 2006-03-26) and consists of roughly 47 million triples (47'054'407). We provide Wikipedia³ in all major RDF serialization formats: RDF/XML, Turtle and N-Triples.
Download here:
A very brief excerpt of example data is also available: see samples.rdf (RDF/XML) and samples.ttl (Turtle). The dataset uses a custom ontology (www.systemone.at/2006/03/wikipedia) which is closely modelled after WikiOnt. We re-use elements from the Dublin Core and Simple Knowledge Organisation System ontologies where possible. A description of our ontology is coming soon, for now please consult the source.
We are committed to achieve and maintain very high quality standards for this dataset. However, due to the sheer size of the dataset loads of possible glitches may have escaped our testing. Please give us feedback via mail or blog comment - and we'd also be very happy to hear about projects or experiments based on that dataset. Be sure to drop us a line!
Scope: At the moment, only Wikipedia pages in the Article (NS 0) and Category (NS 14) namespaces are extracted. Further, "internal link" and "redirects to" relations are limited to targets in those two namespaces. We don't extract page text at the moment, as this severely increases the dataset's size but we intend to provide a seperate dataset containing only text triples in the future.
Roadmap:
- interwiki link extraction (soon!)
- external link extraction (with MediaWiki 1.6)
- an accompanying dataset solely containing the full article text
- a public browser interface (e.g. OINK, Sesame, Longwell)
- a public SPARQL endpoint
- ...
Kudos for most of the hard work to the incredible earl & lowi! We will provide regular updates based on the latest Wikipedia snapshots. The Wikipedia³ datasets are of course licensed under the GFDL. Enjoy!


