Project: justext

Heuristic based boilerplate removal tool

Project Details

Latest version
3.0.0
Home Page
https://github.com/miso-belica/jusText
PyPI Page
https://pypi.org/project/justext/

Project Popularity

PageRank
0.0023341769211192697
Number of downloads
128414

.. _jusText: http://code.google.com/p/justext/ .. _Python: http://www.python.org/ .. _lxml: http://lxml.de/

jusText

.. image:: https://api.travis-ci.org/miso-belica/jusText.png?branch=master :target: https://travis-ci.org/miso-belica/jusText

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed <doc/algorithm.rst>_ to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online <http://nlp.fi.muni.cz/projects/justext/>_.

This is a fork of original (currently unmaintained) code of jusText_ hosted on Google Code.

Adaptations of the algorithm to other languages:

  • C++ <https://github.com/endredy/jusText>_
  • Go <https://github.com/JalfResi/justext>_
  • Java <https://github.com/wizenoze/justext-java>_

Some libraries using jusText:

  • chirp <https://github.com/9b/chirp>_
  • lazynlp <https://github.com/chiphuyen/lazynlp>_
  • off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>_
  • pears <https://github.com/PeARSearch/PeARS-orchard>_
  • readability calculator <https://github.com/joaopalotti/readability_calculator>_
  • sky <https://github.com/kootenpv/sky>_

Some currently (Jan 2020) maintained alternatives:

  • dragnet <https://github.com/dragnet-org/dragnet>_
  • html2text <https://github.com/Alir3z4/html2text>_
  • inscriptis <https://github.com/weblyzard/inscriptis>_
  • newspaper <https://github.com/codelucas/newspaper>_
  • python-readability <https://github.com/buriy/python-readability>_
  • trafilatura <https://github.com/adbar/trafilatura>_

Installation

Make sure you have Python_ 2.7+/3.5+ and pip <https://pip.pypa.io/en/stable/>_ (Windows <http://docs.python-guide.org/en/latest/starting/install/win/>, Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>) installed. Run simply:

.. code-block:: bash

$ [sudo] pip install justext

Dependencies

::

lxml (version depends on your Python version)

Usage

.. code-block:: bash

$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ python -m justext -s English -o plain_text.txt english_page.html $ python -m justext --help # for more info

Python API

.. code-block:: python

import requests import justext

response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text

Testing

Run tests via

.. code-block:: bash

$ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9

Acknowledgements

.. _Natural Language Processing Centre: http://nlp.fi.muni.cz/en/nlpc .. _Masaryk University in Brno: http://nlp.fi.muni.cz/en .. _PRESEMT: http://presemt.eu/ .. _Lexical Computing Ltd.: http://lexicalcomputing.com/ .. _PhD research: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

This software has been developed at the Natural Language Processing Centre_ of Masaryk University in Brno_ with a financial support from PRESEMT_ and Lexical Computing Ltd._ It also relates to PhD research_ of Jan Pomikálek.

.. :changelog:

Changelog for jusText

3.0.0 (2021-10-21)

  • INCOMPATIBLE CHANGE: Dropped support for Python 3.4 and below.
  • BUG FIX: Don't join words separated only by <br> tag.
  • BUG FIX: List available stop-lists alphabetically.

2.2.0 (2016-03-06)

  • INCOMPATIBLE CHANGE: Stop words are case insensitive.
  • INCOMPATIBLE CHANGE: Dropped support for Python 3.2
  • BUG FIX: Preserve new lines from original text in paragraphs.

2.1.1 (2014-05-27)

  • BUG FIX: Function decode_html now respects parameter errors when falling to default_encoding #9 <https://github.com/miso-belica/jusText/issues/9>_.

2.1.0 (2014-01-25)

  • FEATURE: Added XPath selector to the paragrahs. XPath selector is also available in detailed output as xpath attribute of <p> tag #5 <https://github.com/miso-belica/jusText/pull/5>_.

2.0.0 (2013-08-26)

  • FEATURE: Added pluggable DOM preprocessor.
  • FEATURE: Added support for Python 3.2+.
  • INCOMPATIBLE CHANGE: Paragraphs are instances of justext.paragraph.Paragraph.
  • INCOMPATIBLE CHANGE: Script 'justext' removed in favour of command python -m justext.
  • FEATURE: It's possible to enter an URI as input document in CLI.
  • FEATURE: It is possible to pass unicode string directly.

1.2.0 (2011-08-08)

  • FEATURE: Character counts used instead of word counts where possible in order to make the algorithm work well in the language independent mode (without a stoplist) for languages where counting words is not easy (Japanese, Chinese, Thai, etc).
  • BUG FIX: More robust parsing of meta tags containing the information about used charset.
  • BUG FIX: Corrected decoding of HTML entities � to �

1.1.0 (2011-03-09)

  • First public release.