Project: advertools

Productivity and analysis tools for online marketing

Project Details

Latest version
0.13.5
Home Page
https://github.com/eliasdabbas/advertools
PyPI Page
https://pypi.org/project/advertools/

Project Popularity

PageRank
0.00438467040826952
Number of downloads
46005

.. image:: https://img.shields.io/pypi/v/advertools.svg :target: https://pypi.python.org/pypi/advertools

.. image:: https://readthedocs.org/projects/advertools/badge/?version=latest :target: https://advertools.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: http://pepy.tech/badge/advertools :target: http://pepy.tech/project/advertools

Announcing Data Science with Python for SEO course <https://bit.ly/dsseo-course>_: Cohort based course, interactive, live-coding.

advertools: productivity & analysis tools to scale your online marketing

| A digital marketer is a data scientist. | Your job is to manage, manipulate, visualize, communicate, understand, and make decisions based on data.

You might be doing basic stuff, like copying and pasting text on spread sheets, you might be running large scale automated platforms with sophisticated algorithms, or somewhere in between. In any case your job is all about working with data.

As a data scientist you don't spend most of your time producing cool visualizations or finding great insights. The majority of your time is spent wrangling with URLs, figuring out how to stitch together two tables, hoping that the dates, won't break, without you knowing, or trying to generate the next 124,538 keywords for an upcoming campaign, by the end of the week!

advertools is a Python package that can hopefully make that part of your job a little easier.

Installation

.. code:: bash

python3 -m pip install advertools

SEM Campaigns

The most important thing to achieve in SEM is a proper mapping between the three main elements of a search campaign

Keywords (the intention) -> Ads (your promise) -> Landing Pages (your delivery of the promise) Once you have this done, you can focus on management and analysis. More importantly, once you know that you can set this up in an easy way, you know you can focus on more strategic issues. In practical terms you need two main tables to get started:

  • Keywords: You can generate keywords <https://advertools.readthedocs.io/en/master/advertools.kw_generate.html>_ (note I didn't say research) with the kw_generate function.

  • Ads: There are two approaches that you can use:

    • Bottom-up: You can create text ads for a large number of products by simple replacement of product names, and providing a placeholder in case your text is too long. Check out the ad_create <https://advertools.readthedocs.io/en/master/advertools.ad_create.html>_ function for more details.
    • Top-down: Sometimes you have a long description text that you want to split into headlines, descriptions and whatever slots you want to split them into. ad_from_string <https://advertools.readthedocs.io/en/master/advertools.ad_from_string.html>_ helps you accomplish that.
  • Tutorials and additional resources

    • Get started with Data Science for Digital Marketing and SEO/SEM <https://www.oncrawl.com/technical-seo/data-science-seo-digital-marketing-guide-beginners/>_
    • Setting a full SEM campaign <https://www.datacamp.com/community/tutorials/sem-data-science>_ for DataCamp's website tutorial
    • Project to practice generating SEM keywords with Python <https://www.datacamp.com/projects/400>_ on DataCamp
    • Setting up SEM campaigns on a large scale <https://www.semrush.com/blog/setting-up-search-engine-marketing-campaigns-on-large-scale/>_ tutorial on SEMrush
    • Visual tool to generate keywords <https://www.dashboardom.com/advertools>_ online based on the kw_generate function

SEO

Probably the most comprehensive online marketing area that is both technical (crawling, indexing, rendering, redirects, etc.) and non-technical (content creation, link building, outreach, etc.). Here are some tools that can help with your SEO

  • SEO crawler: <https://advertools.readthedocs.io/en/master/advertools.spider.html>_ A generic SEO crawler that can be customized, built with Scrapy, & with several features:

    • Standard SEO elements extracted by default (title, header tags, body text, status code, reponse and request headers, etc.)
    • CSS and XPath selectors: You probably have more specific needs in mind, so you can easily pass any selectors to be extracted in addition to the standard elements being extracted
    • Custom settings: full access to Scrapy's settings, allowing you to better control the crawling behavior (set custom headers, user agent, stop spider after x pages, seconds, megabytes, save crawl logs, run jobs at intervals where you can stop and resume your crawls, which is ideal for large crawls or for continuous monitoring, and many more options)
    • Following links: option to only crawl a set of specified pages or to follow and discover all pages through links
  • robots.txt downloader <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html#advertools.sitemaps.robotstxt_to_df>_ A simple downloader of robots.txt files in a DataFrame format, so you can keep track of changes across crawls if any, and check the rules, sitemaps, etc.

  • XML Sitemaps downloader / parser <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html>_ An essential part of any SEO analysis is to check XML sitemaps. This is a simple function with which you can download one or more sitemaps (by providing the URL for a robots.txt file, a sitemap file, or a sitemap index

  • SERP importer and parser for Google & YouTube <https://advertools.readthedocs.io/en/master/advertools.serp.html>_ Connect to Google's API and get the search data you want. Multiple search parameters supported, all in one function call, and all results returned in a DataFrame

  • Tutorials and additional resources

    • A visual tool built with the serp_goog function to get SERP rankings on Google <https://www.dashboardom.com/google-serp>_
    • A tutorial on analyzing SERPs on a large scale with Python <https://www.semrush.com/blog/analyzing-search-engine-results-pages/>_ on SEMrush
    • SERP datasets on Kaggle <https://www.kaggle.com/eliasdabbas/datasets?search=engine>_ for practicing on different industries and use cases
    • SERP notebooks on Kaggle <https://www.kaggle.com/eliasdabbas/notebooks?sortBy=voteCount&group=everyone&pageSize=20&userId=484496&tagIds=1220>_ some examples on how you might tackle such data
    • Content Analysis with XML Sitemaps and Python <https://www.semrush.com/blog/content-analysis-xml-sitemaps-python/>_
    • XML dataset examples: news sites <https://www.kaggle.com/eliasdabbas/news-sitemaps>, Turkish news sites <https://www.kaggle.com/eliasdabbas/turk-haber-sitelerinin-site-haritalari>, Bloomberg news <https://www.kaggle.com/eliasdabbas/bloomberg-business-articles-urls>_

Text & Content Analysis (for SEO & Social Media)

URLs, page titles, tweets, video descriptions, comments, hashtags are some exmaples of the types of text we deal with. advertools provides a few options for text analysis

  • Word frequency <https://advertools.readthedocs.io/en/master/advertools.word_frequency.html>_ Counting words in a text list is one of the most basic and important tasks in text mining. What is also important is counting those words by taking in consideration their relative weights in the dataset. word_frequency does just that.

  • URL Analysis <https://advertools.readthedocs.io/en/master/advertools.urlytics.html>_ We all have to handle many thousands of URLs in reports, crawls, social media extracts, XML sitemaps and so on. url_to_df converts your URLs into easily readable DataFrames.

  • Emoji <https://advertools.readthedocs.io/en/master/advertools.emoji.html>_ Produced with one click, extremely expressive, highly diverse (3k+ emoji), and very popular, it's important to capture what people are trying to communicate with emoji. Extracting emoji, get their names, groups, and sub-groups is possible. The full emoji database is also available for convenience, as well as an emoji_search function in case you want some ideas for your next social media or any kind of communication

  • extract_ functions <https://advertools.readthedocs.io/en/master/advertools.extract.html>_ The text that we deal with contains many elements and entities that have their own special meaning and usage. There is a group of convenience functions to help in extracting and getting basic statistics about structured entities in text; emoji, hashtags, mentions, currency, numbers, URLs, questions and more. You can also provide a special regex for your own needs.

  • Stopwords <https://advertools.readthedocs.io/en/master/advertools.stopwords.html>_ A list of stopwords in forty different languages to help in text analysis.

  • Tutorial on DataCamp for creating the word_frequency function and explaining the importance of the difference between absolute and weighted word frequency <https://www.datacamp.com/community/tutorials/absolute-weighted-word-frequency>_

  • Text Analysis for Online Marketers <https://www.semrush.com/blog/text-analysis-for-online-marketers/>_ An introductory article on SEMrush

Social Media

In addition to the text analysis techniques provided, you can also connect to the Twitter and YouTube data APIs. The main benefits of using advertools for this:

  • Handles pagination and request limits: typically every API has a limited number of results that it returns. You have to handle pagination when you need more than the limit per request, which you typically do. This is handled by default

  • DataFrame results: APIs send you back data in a formats that need to be parsed and cleaned so you can more easily start your analysis. This is also handled automatically

  • Multiple requests: in YouTube's case you might want to request data for the same query across several countries, languages, channels, etc. You can specify them all in one request and get the product of all the requests in one response

  • Tutorials and additional resources

  • A visual tool to check what is trending on Twitter <https://www.dashboardom.com/trending-twitter>_ for all available locations

  • A Twitter data analysis dashboard <https://www.dashboardom.com/twitterdash>_ with many options

  • How to use the Twitter data API with Python <https://www.kaggle.com/eliasdabbas/twitter-in-a-dataframe>_

  • Extracting entities from social media posts <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>_ tutorial on Kaggle

  • Analyzing 131k tweets <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>_ by European Football clubs tutorial on Kaggle

  • An overview of the YouTube data API with Python <https://www.kaggle.com/eliasdabbas/youtube-data-api>_

Conventions

Function names mostly start with the object you are working on, so you can use autocomplete to discover other options:

| kw_: for keywords-related functions | ad_: for ad-related functions | url_: URL tracking and generation | extract_: for extracting entities from social media posts (mentions, hashtags, emoji, etc.) | emoji_: emoji related functions and objects | twitter: a module for querying the Twitter API and getting results in a DataFrame | youtube: a module for querying the YouTube Data API and getting results in a DataFrame | serp_: get search engine results pages in a DataFrame, currently available: Google and YouTube | crawl: a function you will probably use a lot if you do SEO | *_to_df: a set of convenience functions for converting to DataFrames (log files, XML sitemaps, robots.txt files, and lists of URLs)

======================= Change Log - advertools

0.13.5 (2023-08-22)

  • Added

    • Initial experimental functionality for crawl_images.
  • Changed

    • Enable autothrottling by default for crawl_headers.

0.13.4 (2023-07-26)

  • Fixed
    • Make img attributes consistent in length, and support all attributes.

0.13.3 (2023-06-27)

  • Changed

    • Allow optional trailing space in log files (contributed by @andypayne)
  • Fixed

    • Replace newlines with spaces while parsing JSON-LD which was causing errors in some cases.

0.13.2 (2022-09-30)

  • Added

    • Crawling recipe for how to use the DEFAULT_REQUEST_HEADERS to change the default headers.
  • Changed

    • Split long lists of URL while crawling regardless of the follow_links parameter
  • Fixed

    • Clarify that while authenticating for Twitter only app_key and app_secret are required, with the option to provide oauth_token and oauth_token_secret if/when needed.

0.13.1 (2022-05-11)

  • Added

    • Command line interface with most functions
    • Make documentation interactive for most pages using thebe-sphinx
  • Changed

    • Use np.nan wherever there are missing values in url_to_df
  • Fixed

    • Don't remove double quotes from etags when downloading XML sitemaps
    • Replace instances of pd.DataFrame.append with pd.concat, which is depracated.
    • Replace empty values with np.nan for the size column in logs_to_df

0.13.0 (2022-02-10)

  • Added

    • New function crawl_headers: A crawler that only makes HEAD requests to a known list of URLs.
    • New function reverse_dns_lookup: A way to get host information for a large list of IP addresses concurrently.
    • New options for crawling: exclude_url_params, include_url_params, exclude_url_regex, and include_url_regex for controlling which links to follow while crawling.
  • Fixed

    • Any custom_settings options given to the crawl function that were defined using a dictionary can now be set without issues. There was an issue if those options were not strings.
  • Changed

    • The skip_url_params option was removed and replaced with the more versatile exclude_url_params, which accepts either True or a list of URL parameters to exclude while following links.

0.12.3 (2021-11-27)

  • Fixed
    • Crawler stops when provided with bad URLs in list mode.

0.12.0,1,2 (2021-11-27)

  • Added

    • New function logs_to_df: Convert a log file of any non-JSON format into a pandas DataFrame and save it to a parquet file. This also compresses the file to a much smaller size.
    • Crawler extracts all available img attributes: 'alt', 'crossorigin', 'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes', 'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes like style and draggable).
    • New parameter for the crawl function skip_url_params: Defaults to False, consistent with previous behavior, with the ability to not follow/crawl links containing any URL parameters.
    • New column for url_to_df "last_dir": Extract the value in the last directory for each of the URLs.
  • Changed

    • Query parameter columns in url_to_df DataFrame are now sorted by how full the columns are (the percentage of values that are not NA)

0.11.1 (2021-04-09)

  • Added

    • The nofollow attribute for nav, header, and footer links.
  • Fixed

    • Timeout error while downloading robots.txt files.
    • Make extracting nav, header, and footer links consistent with all links.

0.11.0 (2021-03-31)

  • Added

    • New parameter recursive for sitemap_to_df to control whether or not to get all sub sitemaps (default), or to only get the current (sitemapindex) one.
    • New columns for sitemap_to_df: sitemap_size_mb (1 MB = 1,024x1,024 bytes), and sitemap_last_modified and etag (if available).
    • Option to request multiple robots.txt files with robotstxt_to_df.
    • Option to save downloaded robots DataFrame(s) to a file with robotstxt_to_df using the new parameter output_file.
    • Two new columns for robotstxt_to_df: robotstxt_last_modified and etag (if available).
    • Raise ValueError in crawl if css_selectors or xpath_selectors contain any of the default crawl column headers
    • New XPath code recipes for custom extraction.
    • New function crawllogs_to_df which converts crawl logs to a DataFrame provided they were saved while using the crawl function.
    • New columns in crawl: viewport, charset, all h headings (whichever is available), nav, header and footer links and text, if available.
    • Crawl errors don't stop crawling anymore, and the error message is included in the output file under a new errors and/or jsonld_errors column(s).
    • In case of having JSON-LD errors, errors are reported in their respective column, and the remainder of the page is scraped.
  • Changed

    • Removed column prefix resp_meta_ from columns containing it
    • Redirect URLs and reasons are separated by '@@' for consistency with other multiple-value columns
    • Links extracted while crawling are not unique any more (all links are extracted).
    • Emoji data updated with v13.1.
    • Heading tags are scraped even if they are empty, e.g.

      .
    • Default user agent for crawling is now advertools/VERSION.
  • Fixed

    • Handle sitemap index files that contain links to themselves, with an error message included in the final DataFrame
    • Error in robots.txt files caused by comments preceded by whitespace
    • Zipped robots.txt files causing a parsing issue
    • Crawl issues on some Linux systems when providing a long list of URLs
  • Removed

    • Columns from the crawl output: url_redirected_to, links_fragment

0.10.7 (2020-09-18)

  • Added

    • New function knowledge_graph for querying Google's API
    • Faster sitemap_to_df with threads
    • New parameter max_workers for sitemap_to_df to determine how fast it could go
    • New parameter capitalize_adgroups for kw_generate to determine whether or not to keep ad groups as is, or set them to title case (the default)
  • Fixed

    • Remove restrictions on the number of URLs provided to crawl, assuming follow_links is set to False (list mode)
    • JSON-LD issue breaking crawls when it's invalid (now skipped)
  • Removed

    • Deprecate the youtube.guide_categories_list (no longer supported by the API)

0.10.6 (2020-06-30)

  • Added
    • JSON-LD support in crawling. If available on a page, JSON-LD items will have special columns, and multiple JSON-LD snippets will be numbered for easy filtering
  • Changed
    • Stricter parsing for rel attributes, making sure they are in link elements as well
    • Date column names for robotstxt_to_df and sitemap_to_df unified as "download_date"
    • Numbering OG, Twitter, and JSON-LD where multiple elements are present in the same page, follows a unified approach: no numbering for the first element, and numbers start with "1" from the second element on. "element", "element_1", "element_2" etc.

0.10.5 (2020-06-14)

  • Added

    • New features for the crawl function:
      • Extract canonical tags if available
      • Extract alternate href and hreflang tags if available
      • Open Graph data "og:title", "og:type", "og:image", etc.
      • Twitter cards data "twitter:site", "twitter:title", etc.
  • Fixed

    • Minor fixes to robotstxt_to_df:
      • Allow whitespace in fields
      • Allow case-insensitive fields
  • Changed

    • crawl now only supports output_file with the extension ".jl"
    • word_frequency drops wtd_freq and rel_value columns if num_list is not provided

0.10.4 (2020-06-07)

  • Added
    • New function url_to_df, splitting URLs into their components and to a DataFrame
    • Slight speed up for robotstxt_test

0.10.3 (2020-06-03)

  • Added

    • New function robotstxt_test, testing URLs and whether they can be fetched by certain user-agents
  • Changed

    • Documentation main page relayout, grouping of topics, & sidebar captions
    • Various documentation clarifications and new tests

0.10.2 (2020-05-25)

  • Added

    • User-Agent info to requests getting sitemaps and robotstxt files
    • CSS/XPath selectors support for the crawl function
    • Support for custom spider settings with a new parameter custom_settings
  • Fixed

    • Update changed supported search operators and values for CSE

0.10.1 (2020-05-23)

  • Changed
    • Links are better handled, and new output columns are available: links_url, links_text, links_fragment, links_nofollow
    • body_text extraction is improved by containing

      ,

    • , and elements

0.10.0 (2020-05-21)

  • Added
    • New function crawl for crawling and parsing websites
    • New function robotstxt_to_df downloading robots.txt files into DataFrames

0.9.1 (2020-05-19)

  • Added

    • Ability to specify robots.txt file for sitemap_to_df
    • Ability to retreive any kind of sitemap (news, video, or images)
    • Errors column to the returnd DataFrame if any errors occur
    • A new sitemap_downloaded column showing datetime of getting the sitemap
  • Fixed

    • Logging issue causing sitemap_to_df to log the same action twice
    • Issue preventing URLs not ending with xml or gz from being retreived
    • Correct sitemap URL showing in the sitemap column

0.9.0 (2020-04-03)

  • Added
    • New function sitemap_to_df imports an XML sitemap into a DataFrame

0.8.1 (2020-02-08)

  • Changed
    • Column query_time is now named queryTime in the youtube functions
    • Handle json_normalize import from pandas based on pandas version

0.8.0 (2020-02-02)

  • Added

    • New module youtube connecting to all GET requests in API
    • extract_numbers new function
    • emoji_search new function
    • emoji_df new variable containing all emoji as a DataFrame
  • Changed

    • Emoji database updated to v13.0
    • serp_goog with expanded pagemap and metadata
  • Fixed

    • serp_goog errors, some parameters not appearing in result df
    • extract_numbers issue when providing dash as a separator in the middle

0.7.3 (2019-04-17)

  • Added
    • New function extract_exclamations very similar to extract_questions
    • New function extract_urls, also counts top domains and top TLDs
    • New keys to extract_emoji; top_emoji_categories & top_emoji_sub_categories
    • Groups and sub-groups to emoji db

0.7.2 (2019-03-29)

  • Changed
    • Emoji regex updated
    • Simpler extraction of Spanish questions

0.7.1 (2019-03-26)

  • Fixed
    • Missing init imports.

0.7.0 (2019-03-26)

  • Added

    • New extract_ functions:

      • Generic extract used by all others, and takes arbitrary regex to extract text.
      • extract_questions to get question mark statistics, as well as the text of questions asked.
      • extract_currency shows text that has currency symbols in it, as well as surrounding text.
      • extract_intense_words gets statistics about, and extract words with any character repeated three or more times, indicating an intense feeling (+ve or -ve).
    • New function word_tokenize:

      • Used by word_frequency to get tokens of 1,2,3-word phrases (or more).
      • Split a list of text into tokens of a specified number of words each.
    • New stop-words from the spaCy package:

      current: Arabic, Azerbaijani, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

      new: Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian, Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu, Thai, Ukrainian, Urdu, Vietnamese

  • Changed

    • word_frequency takes new parameters:

      • regex defaults to words, but can be changed to anything '\S+' to split words and keep punctuation for example.

      • sep not longer used as an option, the above regex can be used instead

      • num_list now optional, and defaults to counts of 1 each if not provided. Useful for counting abs_freq only if data not available.

      • phrase_len the number of words in each split token. Defaults to 1 and can be set to 2 or higher. This helps in analyzing phrases as opposed to words.

    • Parameters supplied to serp_goog appear at the beginning of the result df

    • serp_youtube now contains nextPageToken to make paginating requests easier

0.6.0 (2019-02-11)

  • New function
    • extract_words to extract an arbitrary set of words
  • Minor updates
    • ad_from_string slots argument reflects new text ad lenghts
    • hashtag regex improved

0.5.3 (2019-01-31)

  • Fix minor bugs
    • Handle Twitter search queries with 0 results in final request

0.5.2 (2018-12-01)

  • Fix minor bugs
    • Properly handle requests for >50 items (serp_youtube)
    • Rewrite test for _dict_product
    • Fix issue with string printing error msg

0.5.1 (2018-11-06)

  • Fix minor bugs
    • _dict_product implemented with lists
    • Missing keys in some YouTube responses

0.5.0 (2018-11-04)

  • New function serp_youtube
    • Query YouTube API for videos, channels, or playlists
    • Multiple queries (product of parameters) in one function call
    • Reponse looping and merging handled, one DataFrame
  • serp_goog return Google's original error messages
  • twitter responses with entities, get the entities extracted, each in a separate column

0.4.1 (2018-10-13)

  • New function serp_goog (based on Google CSE)
    • Query Google search and get the result in a DataFrame
    • Make multiple queries / requests in one function call
    • All responses merged in one DataFrame
  • twitter.get_place_trends results are ranked by town and country

0.4.0 (2018-10-08)

  • New Twitter module based on twython
    • Wraps 20+ functions for getting Twitter API data
    • Gets data in a pands DataFrame
    • Handles looping over requests higher than the defaults
  • Tested on Python 3.7

0.3.0 (2018-08-14)

  • Search engine marketing cheat sheet.
  • New set of extract_ functions with summary stats for each:
    • extract_hashtags
    • extract_mentions
    • extract_emoji
  • Tests and bug fixes

0.2.0 (2018-07-06)

  • New set of kw_ functions.
  • Full testing and coverage.

0.1.0 (2018-07-02)

  • First release on PyPI.
  • Functions available:
    • ad_create: create a text ad place words in placeholders
    • ad_from_string: split a long string to shorter string that fit into given slots
    • kw_generate: generate keywords from lists of products and words
    • url_utm_ga: generate a UTM-tagged URL for Google Analytics tracking
    • word_frequency: measure the absolute and weighted frequency of words in collection of documents