Fast and robust extraction of original and updated publication dates from URLs and web pages.
.. image:: https://img.shields.io/pypi/v/htmldate.svg :target: https://pypi.python.org/pypi/htmldate :alt: Python package
.. image:: https://img.shields.io/pypi/pyversions/htmldate.svg :target: https://pypi.python.org/pypi/htmldate :alt: Python versions
.. image:: https://readthedocs.org/projects/htmldate/badge/?version=latest :target: https://htmldate.readthedocs.org/en/latest/?badge=latest :alt: Documentation Status
.. image:: https://img.shields.io/codecov/c/github/adbar/htmldate.svg :target: https://codecov.io/gh/adbar/htmldate :alt: Code Coverage
.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational :target: https://pepy.tech/project/htmldate :alt: Downloads
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen :target: https://doi.org/10.21105/joss.02439 :alt: JOSS article reference DOI: 10.21105/joss.02439
.. image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/psf/black :alt: Code style: black
|
.. image:: docs/htmldate-logo.png :alt: Logo as PNG image :align: center :width: 60%
|
Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.
|
.. image:: docs/htmldate-demo.gif :alt: Demo as GIF image :align: center :width: 80% :target: https://htmldate.readthedocs.org/
|
With Python:
.. code-block:: python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
On the command-line:
.. code-block:: bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>_)htmldate can examine markup and text. It provides the following ways to date an HTML document:
link and meta elements) including Open Graph protocol <http://ogp.me/>_ attributesabbr or time elements and a series of attributes (e.g. postmetadata)fast mode the HTML page is cleaned and precise patterns are targetedextensive mode all potential dates are collected and a disambiguation algorithm determines the best oneFinally the output is validated and converted to the chosen format.
Python Package Precision Recall Accuracy F-Score Time =============================== ========= ========= ========= ========= ======= articleDateExtractor 0.20 0.803 0.734 0.622 0.767 5x date_guesser 2.1.4 0.781 0.600 0.514 0.679 18x goose3 3.1.17 0.869 0.532 0.493 0.660 15x htmldate[all] 1.6.0 (fast) 0.883 0.924 0.823 0.903 1x htmldate[all] 1.6.0 (extensive) 0.870 0.993 0.865 0.928 1.7x newspaper3k 0.2.8 0.769 0.667 0.556 0.715 15x news-please 1.5.35 0.801 0.768 0.645 0.784 34x =============================== ========= ========= ========= ========= =======
For complete results and explanations see the evaluation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>_.
This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.6 upwards. It is available on the package repository PyPI <https://pypi.org/>_ and can notably be installed with pip (pip3 where applicable): pip install htmldate and optionally pip install htmldate[speed].
For more details on installation, Python & CLI usage, please refer to the documentation: htmldate.readthedocs.io <https://htmldate.readthedocs.io/>_
htmldate is distributed under the GNU General Public License v3.0 <https://github.com/adbar/htmldate/blob/master/LICENSE>. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>, multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>_ with compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>, or contacting me <https://github.com/adbar/htmldate#author>.
See also GPL and free software licensing: What's in it for business? <https://www.techrepublic.com/blog/cio-insights/gpl-and-free-software-licensing-whats-in-it-for-business/>_
This effort is part of methods to derive information from web documents in order to build text databases for research <https://www.dwds.de/d/k-web>_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information:
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen :target: https://doi.org/10.21105/joss.02439 :alt: JOSS article reference DOI: 10.21105/joss.02439
.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue :target: https://doi.org/10.5281/zenodo.3459599 :alt: Zenodo archive DOI: 10.5281/zenodo.3459599
.. code-block:: shell
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
htmldate: A Python package to extract publication dates from web pages <https://doi.org/10.21105/joss.02439>_", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>", Proceedings of the 10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>, 2016.You can contact me via my contact page <https://adrien.barbaresi.eu/>_ or GitHub <https://github.com/adbar>_.
Contributions <https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md>_ are welcome!
Feel free to file issues on the dedicated page <https://github.com/adbar/htmldate/issues>. Thanks to the contributors <https://github.com/adbar/htmldate/graphs/contributors> who submitted features and bugfixes!
Kudos to the following software libraries:
lxml <http://lxml.de/>, dateparser <https://github.com/scrapinghub/dateparser>python-goose <https://github.com/grangier/python-goose>, metascraper <https://github.com/ianstormtaylor/metascraper>, newspaper <https://github.com/codelucas/newspaper>_ and articleDateExtractor <https://github.com/Webhose/article-date-extractor>_ libraries. This module extends their coverage and robustness significantly.