fast html to text parser (article readability tool) with python 3 support
.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master :target: https://travis-ci.org/buriy/python-readability
Given a html document, it pulls out the main body text and cleans it up.
This is a python port of a ruby port of arc90's readability project <http://lab.arc90.com/experiments/readability/>
__.
It's easy using pip
, just run:
.. code-block:: bash
$ pip install readability-lxml
.. code-block:: python
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'
>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""
This code is under the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>
__ license.
readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>
__Python port <https://github.com/gfxmonk/python-readability>
__ by gfxmonkDecruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>
to move to lxml