Productivity and analysis tools for online marketing
.. image:: https://img.shields.io/pypi/v/advertools.svg :target: https://pypi.python.org/pypi/advertools
.. image:: https://readthedocs.org/projects/advertools/badge/?version=latest :target: https://advertools.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status
.. image:: http://pepy.tech/badge/advertools :target: http://pepy.tech/project/advertools
Announcing Data Science with Python for SEO course <https://bit.ly/dsseo-course>_: Cohort based course, interactive, live-coding.
advertools: productivity & analysis tools to scale your online marketing| A digital marketer is a data scientist. | Your job is to manage, manipulate, visualize, communicate, understand, and make decisions based on data.
You might be doing basic stuff, like copying and pasting text on spread sheets, you might be running large scale automated platforms with sophisticated algorithms, or somewhere in between. In any case your job is all about working with data.
As a data scientist you don't spend most of your time producing cool visualizations or finding great insights. The majority of your time is spent wrangling with URLs, figuring out how to stitch together two tables, hoping that the dates, won't break, without you knowing, or trying to generate the next 124,538 keywords for an upcoming campaign, by the end of the week!
advertools is a Python package that can hopefully make that part of your job a little easier.
.. code:: bash
python3 -m pip install advertools
The most important thing to achieve in SEM is a proper mapping between the three main elements of a search campaign
Keywords (the intention) -> Ads (your promise) -> Landing Pages (your delivery of the promise) Once you have this done, you can focus on management and analysis. More importantly, once you know that you can set this up in an easy way, you know you can focus on more strategic issues. In practical terms you need two main tables to get started:
Keywords: You can generate keywords <https://advertools.readthedocs.io/en/master/advertools.kw_generate.html>_ (note I didn't say research) with the
kw_generate function.
Ads: There are two approaches that you can use:
ad_create <https://advertools.readthedocs.io/en/master/advertools.ad_create.html>_ function for more details.ad_from_string <https://advertools.readthedocs.io/en/master/advertools.ad_from_string.html>_
helps you accomplish that.Tutorials and additional resources
Data Science for Digital Marketing and SEO/SEM <https://www.oncrawl.com/technical-seo/data-science-seo-digital-marketing-guide-beginners/>_Setting a full SEM campaign <https://www.datacamp.com/community/tutorials/sem-data-science>_ for DataCamp's website tutorialgenerating SEM keywords with Python <https://www.datacamp.com/projects/400>_ on DataCampSetting up SEM campaigns on a large scale <https://www.semrush.com/blog/setting-up-search-engine-marketing-campaigns-on-large-scale/>_ tutorial on SEMrushtool to generate keywords <https://www.dashboardom.com/advertools>_ online based on the kw_generate functionProbably the most comprehensive online marketing area that is both technical (crawling, indexing, rendering, redirects, etc.) and non-technical (content creation, link building, outreach, etc.). Here are some tools that can help with your SEO
SEO crawler: <https://advertools.readthedocs.io/en/master/advertools.spider.html>_
A generic SEO crawler that can be customized, built with Scrapy, & with several
features:
robots.txt downloader <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html#advertools.sitemaps.robotstxt_to_df>_
A simple downloader of robots.txt files in a DataFrame format, so you can
keep track of changes across crawls if any, and check the rules, sitemaps,
etc.
XML Sitemaps downloader / parser <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html>_
An essential part of any SEO analysis is to check XML sitemaps. This is a
simple function with which you can download one or more sitemaps (by
providing the URL for a robots.txt file, a sitemap file, or a sitemap index
SERP importer and parser for Google & YouTube <https://advertools.readthedocs.io/en/master/advertools.serp.html>_
Connect to Google's API and get the search data you want. Multiple search
parameters supported, all in one function call, and all results returned in a
DataFrame
Tutorials and additional resources
serp_goog function to get SERP rankings on Google <https://www.dashboardom.com/google-serp>_analyzing SERPs on a large scale with Python <https://www.semrush.com/blog/analyzing-search-engine-results-pages/>_ on SEMrushSERP datasets on Kaggle <https://www.kaggle.com/eliasdabbas/datasets?search=engine>_ for practicing on different industries and use casesSERP notebooks on Kaggle <https://www.kaggle.com/eliasdabbas/notebooks?sortBy=voteCount&group=everyone&pageSize=20&userId=484496&tagIds=1220>_
some examples on how you might tackle such dataContent Analysis with XML Sitemaps and Python <https://www.semrush.com/blog/content-analysis-xml-sitemaps-python/>_news sites <https://www.kaggle.com/eliasdabbas/news-sitemaps>, Turkish news sites <https://www.kaggle.com/eliasdabbas/turk-haber-sitelerinin-site-haritalari>,
Bloomberg news <https://www.kaggle.com/eliasdabbas/bloomberg-business-articles-urls>_URLs, page titles, tweets, video descriptions, comments, hashtags are some
exmaples of the types of text we deal with. advertools provides a few
options for text analysis
Word frequency <https://advertools.readthedocs.io/en/master/advertools.word_frequency.html>_
Counting words in a text list is one of the most basic and important tasks in
text mining. What is also important is counting those words by taking in
consideration their relative weights in the dataset. word_frequency does
just that.
URL Analysis <https://advertools.readthedocs.io/en/master/advertools.urlytics.html>_
We all have to handle many thousands of URLs in reports, crawls, social media
extracts, XML sitemaps and so on. url_to_df converts your URLs into
easily readable DataFrames.
Emoji <https://advertools.readthedocs.io/en/master/advertools.emoji.html>_
Produced with one click, extremely expressive, highly diverse (3k+ emoji),
and very popular, it's important to capture what people are trying to communicate
with emoji. Extracting emoji, get their names, groups, and sub-groups is
possible. The full emoji database is also available for convenience, as well
as an emoji_search function in case you want some ideas for your next
social media or any kind of communication
extract_ functions <https://advertools.readthedocs.io/en/master/advertools.extract.html>_
The text that we deal with contains many elements and entities that have
their own special meaning and usage. There is a group of convenience
functions to help in extracting and getting basic statistics about structured
entities in text; emoji, hashtags, mentions, currency, numbers, URLs, questions
and more. You can also provide a special regex for your own needs.
Stopwords <https://advertools.readthedocs.io/en/master/advertools.stopwords.html>_
A list of stopwords in forty different languages to help in text analysis.
Tutorial on DataCamp for creating the word_frequency function and
explaining the importance of the difference between absolute and weighted word frequency <https://www.datacamp.com/community/tutorials/absolute-weighted-word-frequency>_
Text Analysis for Online Marketers <https://www.semrush.com/blog/text-analysis-for-online-marketers/>_
An introductory article on SEMrush
In addition to the text analysis techniques provided, you can also connect to
the Twitter and YouTube data APIs. The main benefits of using advertools
for this:
Handles pagination and request limits: typically every API has a limited number of results that it returns. You have to handle pagination when you need more than the limit per request, which you typically do. This is handled by default
DataFrame results: APIs send you back data in a formats that need to be parsed and cleaned so you can more easily start your analysis. This is also handled automatically
Multiple requests: in YouTube's case you might want to request data for the same query across several countries, languages, channels, etc. You can specify them all in one request and get the product of all the requests in one response
Tutorials and additional resources
A visual tool to check what is trending on Twitter <https://www.dashboardom.com/trending-twitter>_ for all available locations
A Twitter data analysis dashboard <https://www.dashboardom.com/twitterdash>_ with many options
How to use the Twitter data API with Python <https://www.kaggle.com/eliasdabbas/twitter-in-a-dataframe>_
Extracting entities from social media posts <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>_ tutorial on Kaggle
Analyzing 131k tweets <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>_ by European Football clubs tutorial on Kaggle
An overview of the YouTube data API with Python <https://www.kaggle.com/eliasdabbas/youtube-data-api>_
Function names mostly start with the object you are working on, so you can use autocomplete to discover other options:
| kw_: for keywords-related functions
| ad_: for ad-related functions
| url_: URL tracking and generation
| extract_: for extracting entities from social media posts (mentions, hashtags, emoji, etc.)
| emoji_: emoji related functions and objects
| twitter: a module for querying the Twitter API and getting results in a DataFrame
| youtube: a module for querying the YouTube Data API and getting results in a DataFrame
| serp_: get search engine results pages in a DataFrame, currently available: Google and YouTube
| crawl: a function you will probably use a lot if you do SEO
| *_to_df: a set of convenience functions for converting to DataFrames
(log files, XML sitemaps, robots.txt files, and lists of URLs)
Added
crawl_images.Changed
crawl_headers.Changed
Fixed
Added
DEFAULT_REQUEST_HEADERS to change
the default headers.Changed
follow_links
parameterFixed
app_key and
app_secret are required, with the option to provide oauth_token
and oauth_token_secret if/when needed.Added
thebe-sphinxChanged
np.nan wherever there are missing values in url_to_dfFixed
pd.DataFrame.append with pd.concat, which is
depracated.logs_to_dfAdded
crawl_headers: A crawler that only makes HEAD requests
to a known list of URLs.reverse_dns_lookup: A way to get host information for a
large list of IP addresses concurrently.exclude_url_params, include_url_params,
exclude_url_regex, and include_url_regex for controlling which links to
follow while crawling.Fixed
custom_settings options given to the crawl function that were
defined using a dictionary can now be set without issues. There was an
issue if those options were not strings.Changed
skip_url_params option was removed and replaced with the more
versatile exclude_url_params, which accepts either True or a list
of URL parameters to exclude while following links.Added
logs_to_df: Convert a log file of any non-JSON format
into a pandas DataFrame and save it to a parquet file. This also
compresses the file to a much smaller size.img attributes: 'alt', 'crossorigin',
'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes',
'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes
like style and draggable).crawl function skip_url_params: Defaults to
False, consistent with previous behavior, with the ability to not
follow/crawl links containing any URL parameters.url_to_df "last_dir": Extract the value in the last
directory for each of the URLs.Changed
url_to_df DataFrame are now sorted by how
full the columns are (the percentage of values that are not NA)Added
nofollow attribute for nav, header, and footer links.Fixed
Added
recursive for sitemap_to_df to control whether or not
to get all sub sitemaps (default), or to only get the current
(sitemapindex) one.sitemap_to_df: sitemap_size_mb
(1 MB = 1,024x1,024 bytes), and sitemap_last_modified and etag
(if available).robotstxt_to_df.robotstxt_to_df using the new parameter output_file.robotstxt_to_df: robotstxt_last_modified and
etag (if available).ValueError in crawl if css_selectors or
xpath_selectors contain any of the default crawl column headerscrawllogs_to_df which converts crawl logs to a DataFrame
provided they were saved while using the crawl function.crawl: viewport, charset, all h headings
(whichever is available), nav, header and footer links and text, if
available.errors and/or jsonld_errors
column(s).Changed
resp_meta_ from columns containing itFixed
Removed
crawl output: url_redirected_to, links_fragmentAdded
knowledge_graph for querying Google's APIsitemap_to_df with threadsmax_workers for sitemap_to_df to determine how fast
it could gocapitalize_adgroups for kw_generate to determine
whether or not to keep ad groups as is, or set them to title case (the
default)Fixed
crawl,
assuming follow_links is set to False (list mode)Removed
youtube.guide_categories_list (no longer supported by
the API)robotstxt_to_df and sitemap_to_df unified
as "download_date"Added
crawl function:
href and hreflang tags if availableFixed
robotstxt_to_df:
Changed
crawl now only supports output_file with the extension ".jl"word_frequency drops wtd_freq and rel_value columns if num_list
is not providedurl_to_df, splitting URLs into their components and to a
DataFramerobotstxt_testAdded
robotstxt_test, testing URLs and whether they can be
fetched by certain user-agentsChanged
Added
custom_settingsFixed
links_url, links_text, links_fragment, links_nofollowbody_text extraction is improved by containing ,
crawl for crawling and parsing websitesrobotstxt_to_df downloading robots.txt files into
DataFramesAdded
sitemap_to_dfsitemap_downloaded column showing datetime of getting the
sitemapFixed
sitemap_to_df to log the same action twicesitemap columnsitemap_to_df imports an XML sitemap into a
DataFramequery_time is now named queryTime in the youtube functionsAdded
youtube connecting to all GET requests in APIextract_numbers new functionemoji_search new functionemoji_df new variable containing all emoji as a DataFrameChanged
serp_goog with expanded pagemap and metadataFixed
serp_goog errors, some parameters not appearing in result
dfextract_numbers issue when providing dash as a separator
in the middleextract_exclamations very similar to
extract_questionsextract_urls, also counts top domains and
top TLDsextract_emoji; top_emoji_categories
& top_emoji_sub_categoriesemoji dbquestionsAdded
New extract_ functions:
extract used by all others, and takes
arbitrary regex to extract text.extract_questions to get question mark statistics, as
well as the text of questions asked.extract_currency shows text that has currency symbols in it, as
well as surrounding text.extract_intense_words gets statistics about, and extract words with
any character repeated three or more times, indicating an intense
feeling (+ve or -ve).New function word_tokenize:
word_frequency to get tokens of
1,2,3-word phrases (or more).New stop-words from the spaCy package:
current: Arabic, Azerbaijani, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
new: Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian, Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu, Thai, Ukrainian, Urdu, Vietnamese
Changed
word_frequency takes new parameters:
regex defaults to words, but can be changed to anything '\S+'
to split words and keep punctuation for example.
sep not longer used as an option, the above regex can
be used instead
num_list now optional, and defaults to counts of 1 each if not
provided. Useful for counting abs_freq only if data not
available.
phrase_len the number of words in each split token. Defaults
to 1 and can be set to 2 or higher. This helps in analyzing phrases
as opposed to words.
Parameters supplied to serp_goog appear at the beginning
of the result df
serp_youtube now contains nextPageToken to make
paginating requests easier
extract_words to extract an arbitrary set of wordsad_from_string slots argument reflects new text
ad lenghtshashtag regex improvedserp_youtube)serp_youtube
serp_goog return Google's original error messagesserp_goog (based on Google CSE)