Select syntaxes
+++++++++++++++
It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::
r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
base_url = get_base_url(r.text, r.url)
data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'http://www.facebook.com/2008/fbml',
'og': 'http://ogp.me/ns#'},
'properties': [ ('fb:app_id', '308540029359'),
('og:site_name', 'Songkick'),
('og:type', 'songkick-concerts:artist'),
('og:title', 'Elysian Fields'),
( 'og:description',
'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.'),
( 'og:url',
'https://www.songkick.com/artists/236156-elysian-fields'),
( 'og:image',
'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
'al:ios:app_store_id': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
Alternatively, if you already parsed the HTML before calling extruct, you can use the tree instead of the HTML string: ::
using the request from the previous example
base_url = get_base_url(r.text, r.url)
from extruct.utils import parse_html
tree = parse_html(r.text)
data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
Microformat format doesn't support the HTML tree, so you need to use a HTML string.
Uniform
+++++++
Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::
{'@context': 'http://example.com',
'@type': 'example_type',
/* All other the properties in keys here */
}
To do so set uniform=True when calling extract, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::
r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
base_url = get_base_url(r.text, r.url)
data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'http://www.facebook.com/2008/fbml',
'og': 'http://ogp.me/ns#'},
'@type': 'songkick-concerts:artist',
'fb:app_id': '308540029359',
'og:description': 'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.',
'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
'og:site_name': 'Songkick',
'og:title': 'Elysian Fields',
'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
'al:ios:app_store_id': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
NB rdfa structure is not uniformed yet.
Returning HTML node
+++++++++++++++++++
It is also possible to get references to HTML node for every extracted metadata item.
The feature is supported only by microdata syntax.
To use that, just set the return_html_node option of extract method to True.
As the result, an additional key "nodeHtml" will be included in the result for every
item. Each node is of lxml.etree.Element type: ::
r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
base_url = get_base_url(r.text, r.url)
data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
'Not your thin sticky pad, '
'No-Muv is truly the best!',
'image': ['', ''],
'name': ['No-Muv', 'No-Muv'],
'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
'properties': { 'availability': 'http://schema.org/InStock',
'price': 'Price: '
'$45'},
'type': 'http://schema.org/Offer'},
{ 'htmlNode': <Element div at 0x7f10f8e60f48>,
'properties': { 'availability': 'http://schema.org/InStock',
'price': '(Select '
'Size/Shape '
'for '
'Pricing)'},
'type': 'http://schema.org/Offer'}],
'ratingValue': ['5.00', '5.00']},
'type': 'http://schema.org/Product'}]}
Single extractors
You can also use each extractor individually. See below.
Microdata extraction
++++++++++++++++++++
::
import pprint
pp = pprint.PrettyPrinter(indent=2)
from extruct.w3cmicrodata import MicrodataExtractor
example from http://www.w3.org/TR/microdata/#associating-names-with-items
html = """
...
...
... Photo gallery
...
...
...
My photos
...
...
... The house I found.
...
...
...
... The mailbox.
...
...
...
... """
mde = MicrodataExtractor()
data = mde.extract(html)
pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
'title': 'The house I found.',
'work': 'http://www.example.com/images/house.jpeg'},
'type': 'http://n.whatwg.org/work'},
{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
'title': 'The mailbox.',
'work': 'http://www.example.com/images/mailbox.jpeg'},
'type': 'http://n.whatwg.org/work'}]
import pprint
pp = pprint.PrettyPrinter(indent=2)
from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
'parsers will not be available.')
html = """
...
... ...
...
...
...
...
The trouble with Bob
... ...
...
Alice
...
...
The trouble with Bob is that he takes much better photos than I do:
...
... ...
...
...
...
... """
rdfae = RDFaExtractor()
pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
'@type': ['http://schema.org/BlogPosting'],
'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
'http://schema.org/articleBody': [{'@value': '\n'
' The trouble with Bob '
'is that he takes much better '
'photos than I do:\n'
' '}],
'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]
You'll get a list of expanded JSON-LD nodes.
Open Graph extraction
++++++++++++++++++++++++++++++
extruct provides a command line tool that allows you to fetch a page and
extract the metadata from it directly from the command line.
Dependencies
++++++++++++
The command line tool depends on requests, which is not installed by default
when you install extruct. In order to use the command line tool, you can
install extruct with the cli extra requirements::
pip install 'extruct[cli]'
Usage
+++++
::
extruct "http://example.com"
Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph
and Microformat metadata to stdout.
Supported Parameters
++++++++++++++++++++
By default, the command line tool will try to extract all the supported
metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph
and Microformat). If you want to restrict the output to just one or a subset of
those, you can pass their individual names collected in a list through 'syntaxes' argument.
For example, this command extracts only Microdata and JSON-LD metadata from
"http://example.com"::