Heuristic based boilerplate removal tool
.. _jusText: http://code.google.com/p/justext/ .. _Python: http://www.python.org/ .. _lxml: http://lxml.de/
.. image:: https://api.travis-ci.org/miso-belica/jusText.png?branch=master :target: https://travis-ci.org/miso-belica/jusText
Program jusText is a tool for removing boilerplate content, such as navigation
links, headers, and footers from HTML pages. It is
designed <doc/algorithm.rst>
_ to preserve
mainly text containing full sentences and it is therefore well suited for
creating linguistic resources such as Web corpora. You can
try it online <http://nlp.fi.muni.cz/projects/justext/>
_.
This is a fork of original (currently unmaintained) code of jusText_ hosted on Google Code.
Adaptations of the algorithm to other languages:
C++ <https://github.com/endredy/jusText>
_Go <https://github.com/JalfResi/justext>
_Java <https://github.com/wizenoze/justext-java>
_Some libraries using jusText:
chirp <https://github.com/9b/chirp>
_lazynlp <https://github.com/chiphuyen/lazynlp>
_off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>
_pears <https://github.com/PeARSearch/PeARS-orchard>
_readability calculator <https://github.com/joaopalotti/readability_calculator>
_sky <https://github.com/kootenpv/sky>
_Some currently (Jan 2020) maintained alternatives:
dragnet <https://github.com/dragnet-org/dragnet>
_html2text <https://github.com/Alir3z4/html2text>
_inscriptis <https://github.com/weblyzard/inscriptis>
_newspaper <https://github.com/codelucas/newspaper>
_python-readability <https://github.com/buriy/python-readability>
_trafilatura <https://github.com/adbar/trafilatura>
_Make sure you have Python_ 2.7+/3.5+ and pip <https://pip.pypa.io/en/stable/>
_
(Windows <http://docs.python-guide.org/en/latest/starting/install/win/>
,
Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>
) installed.
Run simply:
.. code-block:: bash
$ [sudo] pip install justext
::
lxml (version depends on your Python version)
.. code-block:: bash
$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ python -m justext -s English -o plain_text.txt english_page.html $ python -m justext --help # for more info
.. code-block:: python
import requests import justext
response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text
Run tests via
.. code-block:: bash
$ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9
.. _Natural Language Processing Centre
: http://nlp.fi.muni.cz/en/nlpc
.. _Masaryk University in Brno
: http://nlp.fi.muni.cz/en
.. _PRESEMT: http://presemt.eu/
.. _Lexical Computing Ltd.
: http://lexicalcomputing.com/
.. _PhD research
: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf
This software has been developed at the Natural Language Processing Centre
_ of
Masaryk University in Brno
_ with a financial support from PRESEMT_ and
Lexical Computing Ltd.
_ It also relates to PhD research
_ of Jan Pomikálek.
.. :changelog:
<br>
tag.decode_html
now respects parameter errors
when falling to default_encoding
#9 <https://github.com/miso-belica/jusText/issues/9>
_.xpath
attribute of <p>
tag #5 <https://github.com/miso-belica/jusText/pull/5>
_.justext.paragraph.Paragraph
.python -m justext
.