A data language for Wikibase
2019-01-16
This document contains no official, established, or final specification but a request for comments. See
Kukulu (named after the Hawaiian word kūkulu: to build, construct) is a formal language to express, query, and model data in the database model of Wikibase. Wikibase is primarily known for the knowledge base Wikidata. Its database model has official serializations in JSON and in RDF. Kukulu defines an alternative syntax with extensions to express queries and rules.
Existing methods to express Wikibase data such as JSON and RDF can be complex to read and write (see data bindings). This also applies to query languages and methods to express rules and constraints (see query and rule languages).
The goal of Kukulu is to provide a simple data language designed for the Wikibase database model. It is inspired by QuickStatements format among other influences. The data language is not intended to replace all existing alternatives but to cover most typical use cases.
Features of the Kukulu data language can be divided into three levels:
A serialization language to express Wikibase content
A query language to match patterns against Wikibase content
A rule language to check or enforce simple if-then-rules in Wikibase content
The language is illustrated with several editable examples:
# Syntax like QuickStatements
Q4115189 P31 Q1
Q41576278 P373 "Antoni Ignacy Mietelski" # Strings
Q1214098 P1476 "Krzyżacy"@pl # Monolingual text
Q41576483 P569 1839-00/year # Time
Q3033 P856 https://www.goettingen.de/ # URL
# Alternative syntax like YAML
Q3033:
P625: @51.533888/51.533888 # Coordinate
P1082: 119177 # Quantity
P576: novalue # special values
# Qualifiers and references
Q41577083 P570 +1586/7 P1319 +1586/9 U248 Q52 # like QuickStatement
Q41577083 P570 +1586/7: # more readable
P1319: +1586/9
references:
P248: Q52
The current draft of Kukulu does not fully support the following elements that might be considered part of the Wikibase database model. Support may be added in the future:
The database model of Wikibase (also referred to as conceptual data model of Wikibase) is implemented canonically in PHP and described at MediaWiki.org. The model is most visible through the Wikibase user interface. This specification assumes basic knowledge of the Wikibase database model and its terminology.
A good starting point to learn about the Wikibase database model in practice is the Wikidata introduction and help pages such as Help:Items, Help:Statements, and Help:Lexemes. The best way to get familiar with data in Wikibase is regularly contributing to Wikidata.
Official serializations of the Wikibase database model exist in JSON and in RDF. Additional serializations exist as part of the tools QuickStatements, GraphQL API, and wikidata-cli. Data bindings in addition to the PHP sources of Wikibase are available as part of programming libraries at least in Lua (MediaWiki Wikibase Client), JavaScript (wikidata-sdk), Java (Wikidata Toolkit), Python (Wikidata for Python), and .NET (Wiki Client Library).
Wikibase instances can be queried in many ways.
This section needs to be extended
Kukulu supports all Wikibase data types including the WikibaseLexeme extension:
All data types are reserved keywords:
WikibaseDataType ::= 'Item' | 'Property' | 'Lexeme' | 'Sense' | 'Form' |
'String' | 'Text' | 'Math' | 'Time' | 'Id' | 'Url' |
'Quantity' | 'Coordinate' | 'Shape' | 'Media' | 'Tabular'
Kukulu defines additional data types:
KukuluDataType ::= 'Bool' | 'Set' | 'Range' | 'DataType'
'LanguageTag' | 'LanguageSet'
Instances of entity types (Item, Property, Lexeme, Sense, and Form) are referened by their plain ID:
Q42 an Item
P31 a Property
L7 a Lexeme
L7-S1 a Sense
L7-F4 a Form
IdNumber ::= [1-9] [0-9]*
ItemId ::= 'Q' IdNumber
PropertyId ::= 'P' IdNumber
LexemeId ::= 'L' IdNumber
SenseId ::= LexemeId '-' 'S' IdNumber
FormId ::= LexemeId '-' 'F' IdNumber
Entities can always be followed by an annotation.
Entities have additional read-only attributes:
id
gives the entity id as Stringuri
gives the entity URI as Urlbool
gives the Bool value True
if the entity exists and False
otherwiseQ42.id === "Q42"
Q42.uri === <http://www.wikidata.org/entity/Q42>
Q42.bool === True
Q6.bool === False # does not exist in Wikidata
Items have attributes labels
, descriptions
, aliases
, claims
, and sitelinks
. The read-only attributes lastrevid
and modified
give the internal revision id (as Quantity) and the timestamp of last modification (as Time).
Properties habe same attributes like items.
Lexemes have attributes lemmas
, category
, language
, claims
, senses
, and forms
. The attribute category
equals to the lengthy key lexicalCategory
in the JSON data binding.
L7:
lemmas:
en: cat
category: Q1084'substantive'
language: Q1860
...
# TODO: exemplify senses and forms
…
…
…
Strings (reserved word String
) can be expressed quoted by double quotes ("..."
) or unquoted. Unquoted strings are possible following :
or :=
if they start with a letter or digit and don’t contain the character sequence #
.
Quoted strings use same escape rules like JSON grammar except escape sequences also include \'
.
String ::= QuotedString | PlainString
QuotedString ::= '"' (StringCharacter | EscapedCharacter)* '"'
StringCharacter ::= [#x20-#x10ffff] - ["\] # exclude U+22 (") and U+5C (\)
EscapedCharacter ::= '\' ( '"' | ''' | '\' | '/' | 'b' | 'n' | 'r' | 't' | 'u' Hex Hex Hex Hex )
Hex ::= 'u' [0-9A-Za-z] [0-9A-Za-z] [0-9A-Za-z] [0-9A-Za-z]
A third type of strings is used for annotations.
The attribute length
of a string gives its number of Unicode characters after NFCK normalization.
Casting to String is done with the str
attribute:
Url(?str) === ?x.str
Monolingual text (reserved word Text
) can be expressed by a quoted string directly followed a language tag.
"love"@en
"حب"@ar
MonolingualText ::= QuotedString LanguageTag
The read-only attribute value
gives the string value and the attribute language
gives the language tag:
?text == "xxx"@und
<=>
?text.value == "xxx" && ?text.language == @und
External identifiers (reserved word Id
) are expressed as strings. To explicitly state that an identifier is not a string use a condition its type:
"foo"
"12345" an Id
ExternalId ::= String
Mathematical expressions (reserved word Math
) are expressed as strings. To explicitly state that a mathematical expression is not a string use a condition on its type.
"e^{i \pi} + 1 = 0"
"e = mc^2" a Math
MathExpression ::= String
Values of data type Url
can be expressed as strings or unqoted URLs.
"https://www.wikidata.org/"
https://www.wikidata.org/
<https://www.wikidata.org/>
http://example.org
URL ::= PlainURL | QuotedURL
PlainURL ::= [a-z]+ '://' [^\s<>"{}|^`\]+
QuotedURL ::= '<' PlainURL '>'
Casting to Url is done with the uri
attribute:
Url(?x) === ?x.uri
Values of data type Time
are represented with its attributes time
, timezone
, precision
, before
, after
, and calendarmodel
(see Wikibase database model for details). The following example expresses the date 2001-12-31
with explicitly giving the default values for each optional attributes:
time: +2001-12-31T00:00:00Z # mandatory
timezone: 0
precision: 11
before: 0
after: 0
calendarmodel: Q1985727
The same time can also be expressed in any of the following forms:
+2001-12-31T00:00:00Z # full form
+2001-12-31T00:00:00+00:00 # explicit timezone
+2001-12-31T00:00:00Z/11 # explicit precision
+2001-12-31T00:00 # some optional parts left out
2001-12-31 # all optional parts left out
If days are omitted or set to zero, the default precision is changed to 10
(month):
2013-12-00
2013-12
2013-12/10
If days and month are omitted or set to zero, the default precision is changed to 9
(year):
# equivalent:
2013-00-00
2013-00
2013+00
2013/8
2013+00/8
# not a year but a quantity:
2013
A simple year cannot be abbreviated as plain integer value except if explicitly given as value of attribute time
:
- time: 2013 # value of type Time with time set to +2013-00-00T00:00::Z and precision 9
- 2013 # value of type Quantity
Time ::= DateValue TimeValue? TimePrecision?
DateValue ::= [+-]? YearValue ( '-' [0-9][0-9] ( '-' [0-9][0-9] )? )?
YearValue ::= [0-9][0-9][0-9][0-9]+
TimeValue ::=
More readable precisions, e.g. 2013-12-01/month
Values of data type Quantity
(known as Quantity in the Wikibase ontology) are represented with its attributes amount
, lowerBound
, upperBound
, and unit
(see Wikibase database model for details). The attributes lowerBound
and upperBound
are optional and have no default values. The attribute unit
is optional with the special default value 1
and data type item otherwise.
- amount: 42 # 42
unit: 1
- amount: 42 # 42±0 (distinct from 42)
lowerBound: 42
upperBound: 42
- amount: 99 # 99 bottles of beer
unit: Q23668
- amount: # 10.38±0.005 km²
upperBound: 10.385
lowerBound: 10.375
Quantities can be expressed in abbreviated form:
Quantity ::= QuantityValue Unit?
QuantityValue ::= Number Tolerance?
Number ::= Decimal Exponent?
Decimal ::= [+-]? ( [0-9]* '.' )? [0-9]+ )
Exponent ::= [eE] Integer
Integer ::= [+-]? [0-9]+
Tolerance ::= '~' | '!' | PlusMinus Number | '[' Number ',' Number ']'
PlusMinus ::= '±' | '+/-' | '+-'
Unit ::= 'U' IdNumber [ Annotation ]
The tolerances ~
and !
can be interpreted as following:
42~ === 42+-0.5
0.1~ === 0.1±0.05
42! === 42±0
Note that every number in Kukulu is a Quantity:
?string.length in 12+-2 # length is 10 to 14
Values of data type geographic coordinate (reserved word Coordinate
, known as GlobeCoordinate in the Wikibase ontology) are represented with its attributes latitude
, longitude
, precision
, and globe
. (see Wikibase database model for details). The globe
is a value of type item and set to Q2 by default.
Quantities can be expressed in abbreviated form:
Q3669835 P625 @043.26193/010.92708
CoordinateValue ::= '@' Decimal '/' Decimal
Values of data type commons media (reserved word Media
, known as CommonsMedia in the Wikibase ontology) …
Values of data type tabular data (reserved word Tabular
) …
Values of data type geographic shape (reserved word Shape
, known as GeoShape in the Wikibase ontology) …
See also operator in
The Bool
data type is returned for boolean operators. The reserved words True
and False
hold predefined instances of this data type.
?isItem := ?x.type === Item
?isItem.type === Bool
True.type === Bool
Casting to Bool is done with the attribute bool
:
Bool(?x) === ?x.bool
Sets can be defined by set variables and set operators.
# extended type constraint on property P26: if A is spouse of B, then both must
# be instance of human, fictional character, person, or mythical character
?A P26 ?B => ?A & ?B P31 Q5 | Q95074 | Q215627 | Q4271324
# equivalent with prefix set operators:
?A P26 ?B => all(?A ?B) P31 any(Q5 Q95074 Q215627 Q4271324)
The attribute length
of a set gives the number of elements in a set as Quantity.
# works with more then 100 authors
?work P50 ?*authors ; ?authors.length > 100
# Entity := Item | Property | Lexeme | Sense | Form
Entity.length === 5
See also operator in.
The reserved keyword Empty
denotes the empty set.
String, Time, and Quantity can be combined to ranges with the range operator:
"a"..."z"
1901-01-01...2000-12-31
1...42
Indiviual values can be checked whether they are part of a range, for instance:
?date in 1901-01-01...2000-12-31
is equivalent to
?date >= 1901-01-01
?date <= 2000-12-13
The attribute upper
and lower
give the upper and lower bound of a set, respectively.
See also operator in
Language codes are used at values of type monolingual text and for annotations.
@ar
@zh-yue
@mis-x-Q36790
LanguageTag ::= '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
Additional constraints may apply.
See https://www.wikidata.org/wiki/Help:Monolingual_text_languages, https://meta.wikimedia.org/wiki/Language_codes, and special language codes such as mis-x-Q36790
(specified where?).
A language set is an infinite set of language tag values.
LanguageSet ::= LanguageTag '-'
?en := @en- # @en | @en-US | @en-GB | ...
?misc := @mis-x- # @mis-x-Q36790 | ...
The size
of a LanguageSet is not defined.
All data type keywords have data type DataType
. The read-only attribute uri
gives the URI of a data type in Wikibase ontology as Url. The read-only attribute str
gives the keyword as String.
Shape.str === "Shape"
Shape.uri === <http://wikiba.se/ontology#GeoShape>
Kukulu defines some reserved keywords for predefined expressions.
The keyword True
and False
are defined as instances of data type Bool.
The keyword Entity
is defined as Set of the data types Item, Property, Lexeme, Sense, and Form.
Entity === Item | Property | Lexeme | Sense | Form
The keyword Empty
ist defined as the empty Set.
Empty.length === 0
See grammar for additional syntax rules
A Kukulu script consists of a newline-separated sequence of sentences.
Script ::= Sentence? ( EOL Sentence? )*
A sentence is serialized in one logical line, optionally followed by in intended block. Logical lines, blank lines, indentation, and comments follow Python syntax (see lexical analysis in Python). In addition it is possible to join logical lines with the semicolon operator.
Entities with their labels, aliases, claims etc. can be serialized in key-value form, in line-based form, and mixed form.
The key-value form is inspired by YAML (but differs in several ways).
# entity identifier as root key
Q4115189:
# labels, descriptions, and aliases with language tag as key of each value
labels:
ar: حب
en: love
es: amor
descriptions:
ar: شعور انجذاب عاطفي تجاه شخص
en: strong, positive emotion based on affection
aliases:
# aliases can be given as list...
ar:
- محبة
- حُب
# ...or as single values
es: amar
sitelinks:
arwiki: حب
enwiki: Love
eswiki: Amor
claims:
# single value
P2002: bulgroz
P31: Q5
# list of values
P1775:
- Q3576110
- Q12206942
# qualifiers, references, and ranks
P369:
value: Q12345
# single value
qualifiers:
P580: 1970-01-01
# multiple values
references:
- P854: http://example.org/
P1932: Example
- P248: Q28549308
rank: normal # either preferred, normal (default) or deprecated
# values can also be given by their data type attributes
P1450: # e.g. monolingual text...
text: bulgroz
language: fr
P1106: # ...or quantity
amount: 42
unit: Q42
The example has partly been adopted from an example of wikidata-cli that uses a similar structure.
Q4115189: # colons are optional before an intended block or list
# claims do not need to be put under key 'claims'
P31: Q5
P369: Q12345 # implicit 'value' key
# qualifiers do not need to be put under key 'qualifiers'
P580: 1970-01-01
# one reference with properties starting with 'S' instead of 'P'
S854: http://example.org/
S1932: Example
Repeated keys are always merged. Repeated values are merged into lists.
Q316
labels
en: love
P31: Q9415
Q4115189
P31: Q5
Q316
labels
de: Liebe
P31: Q840396
P31: Q170774
<=> # equivalent
Q4115189
P31: Q5
Q316
labels
en: love
de: Liebe
P31
- Q9415
- Q840396
- Q170774
Merging is not applied between sections of a document. Sections are separated with an explicit section separator (---
or on a line) and with rules operators.
Values must be quoted if they don’t start with a letter or digit or if they contain the character sequence #
. See String for details.
The line-based form is inspired by QuickStatements syntax:
Q316 Len "love"
Q316 Den "strong, positive emotion based on affection"
Q316 Aes "amores"
Q316 Sarwiki "حب"
# TODO: exemplify claims, including ranks and special values
Q316 P31 Q9415
# Value and qualifier
Q41577083 P570 1586/7 P1319 1586/9
TODO: Check whether this syntax is fully compatible with QuickStatement (is QuickStatement syntax a subset of Kukulu?).
Key-value form and [line based form] can be mixed. In fact they are only two sides of a spectrum.
# qualifiers as key-values
Q41577083 P570 1586/7:
P1319: 1586/9
# values and qualifiers in key-value syntax
Q41577083 P570:
- value: 1586/7 # value
P1319: 1586/9 # qualifier
# TODO: more variants...
Ranks can be expressed with ^
(preferred rank), ~
(deprecated rank), and *
(any rank) preprending a property:
?person P463 ?organization # truthy member-of (default)
?person ~P463 ?organization # deprecated member-of
?person ^P463 ?organization # preferred member-of (all statements)
?person *P463 ?organization # member-of (all statements)
Note that predicates of qualifiers and references cannot have ranks.
An attribute can be referenced inline prepended by a dot or intended followed by a colon. The attribute name must start with a lowercase letter, optionally followed by lowercase and uppercase letters:
Q42.id == "Q42"
Q42:
id: Q42
AttributeName ::= [a-z] [a-zA-Z]
The names novalue
and somevalue
are forbidden as attribute names.
A serialization can be used as query to check whether the specified entities with given labels, aliases, claims etc. exists. Variables can be used as placeholders for unknown entities, properties and values.
Two types of variables exist:
Simple variables are bound do one value. They are referenced with a question mark followed by the variable name:
?entity:
labels:
en: ?label
aliases:
en:
- ?alias1
- ?alias2
claims:
P31: ?class
?property: ?value # any property except P31
?work P463 ?organization:
P580 ?start # qualifier
S248 ?source # reference
A variable can be made optional with two question marks:
?human P31 Q5
P569 ??birthdate
SPARQL equivalent:
SELECT ?human ?birth WHERE {
?human wdt:P31 wd:Q5 .
OPTIONAL { ?human wdt:P596 ?birth }
}
Set variables are bound to multiple values at one. A set variable is referenced with a question mark followed by plus or asterisk:
?*humans P31 Q5 # bound to the set of all humans
?human P31 Q5 # bound to each human
?human P40 ?+children # bound to each parent and to the set of its children
?human # bound to
P31 Q5 # each human
~P596 ?*birth # and optionally its deprecated dates of birth
See set and set operators for extended usage of set variables.
Core elements of entities should be accessible via their common (JSON) name:
Q42
.labels.en
== "Douglas Adams"
L7
.lemmas.en
== "cat"
Methods should be provided to for each data type, e.g.
Q42.
type
== Item
Q42.
id
?text.
language
== @en
?lexeme.
lexicalCategory
== Q1084 # noun
?tabular.
fields
2018-12-31.
precision
== 11 # days
A very small subset of these methods is available in SPARQL but not beyond XSD data types. Wikibase schema language should also do implicit type casting (also useful for comparison operator ==
):
?uri.length
instead of strlen(str(?uri))
Some data types can be converted to each other by implicit or explicit type casting.
?instance-of := P31
?+instace-or-subclass-of := P31 | P279
This corresponds to BIND
in SPARQL.
Variables can only be assigned once.
Simple statements can be expressed in QuickStatements syntax extended by variables:
?human P31 Q5 # variable item, property, value
?human P569 1952-03-11 # same with date as value
Q42 P27 ? # country of citizenship (value not bound to variable)
? ? ? # all possible statements
Property path inspired by SPARQL are useful:
?work P50 ? => # if item has an author
?work P31/P279* Q17537576 # then it must be subclass of creative work
Queries can be evaluated against a Wikibase instance.
> Q7 # does not exist in Wikidata
Empty
> Q42 an Item
True
> Q42.labels.en
"Douglas Adams"
> Q42 P31 Q5
True
Constraints can be expressed with rule operators.
?work P50 ? => ... # if item has an author
# all items must be instance of subclass of something
?item an Item => ?item P31|P279 ?
# in contrast
?item P31|P279 ? # an item that is an instance or subclass of something
Can be used to merge logical lines. The logical line after semicolon has same intention level as the logical line before the operator.
Can be used to construct lists
Normal equality operators make heavy use of type coercion. Strict equality operators require both operands to have exactely the same data type.
?a == ?b # normal equality
?a != ?b
?a === ?b # strict equality
?a !== ?b
!?a # booleannegation
Values of comparable data types can be compared with:
?a > ?b
?a >= ?b
?a < ?b
?a <= ?b
Comparing non-comparable data types always returns False
.
To match a value against a regular expression:
?a =~ ?regex
?a !~ ?regex
Support (named) capturing groups (implicit assignment), e.g.
?foo :=~ "(.+), (.+)" # assign ?1 and ?2
?foo :=~ "(?<given>.+), (?<surname>.+)" # assign ?given and ?surname
The operator a
or its alias an
can be used as shortcut to test the data type of an element:
?s a String <=> ?s.type == String
?x an Item <=> ?x.type == Item
?id an Id|Url <=> ?id.type == Id|Url
The operator in
can be used to check whether
The assignment operator :=
can be used to define variables.
?douglas := Q42
Infix set operators:
... | ... | ...
... & ... & ...
... in ...
Prefix set operators any
and all
:
any( ... )
all( ... )
# Short names of entities (acronyms, abbreviations etc.) should also be aliases
?entity P1813"short name" ?name =>
?name in ?entity.aliases
The range operator ...
defines a range.
2012-01...2013-07
Q1...Q7
?a...?b
Both parts of a range must have same data type.
All data types keywords can be used to convert between data types.
Time("2018-12-31") === 2018-12-31
String(2018-12-31) === "2018-12-31"
=> # material implication (if...then... / ...implies...)
# alternative syntax
if ...
...
else
...
if
...
then
...
<=> # biconditional (...if and only if...)
iff # alias
Rules can also be written with keywords if
, then
, else
, unless
, case
…
See also [sets] for OR-clauses.
An entity or variable can be followed by a single-quoted string:
?person:
P31'instance of' Q5'human'
P569'date of birth' ?date
Applications can choose to ignore annotations, translate annotations, and/or check whether annotations match entity labels/lemmas.
Annotations can have languages:
?person:
P31'είναι'@gr Q5'Mensch'@de
P569'date of birth' ?date
The default annotation language can be set by a language tag, prepended with @
on its own line:
@gr
?person:
P31'είναι' Q5'άνθρωπος'
@en
P569'date of birth' ?date
If annotations are checked, the following should be equivalent:
?place Len 'Shangri-La'
?place'Shangri-La'@en
AnnotationString ::= "'" (StringCharacter | EscapedCharacter)* "'"
Annotation ::= AnnotationString [ LanguageTag ]
Kukulu has been influenced by:
Formal grammar is work in progress.
Script ::= ( Expression )*
...
Unicode codepoints below U+20 are forbidden.
Grammar rules from the current (incomplete!) parser implementation: