Wie Webseiten RDF einbetten und wie man strukturierte Daten aus dem Web extrahieren kann
Thursday, May 04th, 2023
Summary
RDF can be embedded in webpages with JSON-LD, Microdata, or RDFa. In services like search engines and social networks, this can improve visibility and drive more traffic to your website. Embedded RDF in external websites can potentially enrich your own knowledge graph.
RDF
RDF, the foundation of the Semantic Web and Linked Data, is a standard for describing and exchanging data.
One of the advantages is that external RDF data can be quickly integrated in, and utilized by, your own RDF-based Knowledge Graph.
There are many publicly available datasets which can be downloaded in an RDF serialization (and some publishers also offer an endpoint which allows querying the data with SPARQL). To get an impression, the Linked Open Data Cloud lists some RDF datasets which are published under an open license.
But there is another possible source of RDF data: regular webpages which embed RDF as part of their HTML.
Why do webpages embed RDF?
There can be countless motivations for embedding RDF, but it should be safe to assume that most publishers do this to enable certain features in services like social networks and search engines.
In social networks, RDF can enable showing a preview of the webpage when the link gets shared.
In search engines, RDF (using the vocabulary Schema.org) can enable showing a richer result snippet for that page. This is relevant for SEO, as such rich results easily catch the eyes of the searchers, and this improved visibility can increase the click-through rate to your pages.
As an example, Google Search offers rich results for datasets, Q&As, and many more. The following screenshot shows the job postings rich result, which gets displayed at the top of the results page, even before the top-ranked regular results:
Google Search query “job postings teacher düsseldorf”
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Organization", "name": "ACME" } </script>
Microdata consists of attributes (e.g., itemprop
) that get added to HTML elements:
<div itemscope itemtype="https://schema.org/Organization"> <span itemprop="name">ACME</span> </div>
RDFa, like Microdata, consists of attributes (e.g., property
) that get added to HTML elements:
<div typeof="schema:Organization"> <span property="schema:name">ACME</span> </div>
While Microdata and RDFa allow reusing the content that is already part of the HTML, JSON-LD requires duplicating the content.
How many webpages embed RDF?
The project Web Data Commons regularly analyzes the corpus of the project Common Crawl to find out how many of the crawled domains / pages embed triples (which includes the three syntaxes mentioned above, and certain Microformats): https://webdatacommons.org/structureddata/
For each year between 2012 and 2022, this bar chart shows how many of the crawled pay-level domains published Microdata, JSON-LD, hCard (Microformats), and RDFa. (Screenshot taken from webdatacommons.org, 2023-03-07)
For the October 2022 crawl almost 50 % of the crawled pages, and around 40 % of the crawled pay-level domains, contained triples.
How to notice if a webpage embeds RDF?
By default, web browsers don’t give any indication that a page contains RDF. Apart from checking the HTML source code, browser extensions could be used to detect RDF.
An example would be the Structured Data Sniffer by OpenLink Software. It displays the RDF in an overlay in the top right corner:
How to extract the embedded RDF?
The above-mentioned Structured Data Sniffer allows to view, download, and upload (e.g., to a SPARQL endpoint) the extracted RDF. It supports the serializations JSON-LD, RDF/XML, and Turtle.
Another option, suitable for a programmatic approach, is the Python library and command-line tool extruct by Zyte. It outputs everything in one JSON object, which contains JSON-LD objects for the extracted RDF.
Join in!
Do you want to utilize embedded RDF? For example, to integrate it in your own knowledge graph?
Do you want to embed RDF in your webpages? For example, to increase visiblity for your search engine result?
Let’s get in touch to see if we can support you.
Stefan Götz
Author
Linked Data Consultant