Sensible multi-core apply function for Pandas
mapply
provides a sensible multi-core apply function for Pandas.
Where pandarallel
relies on in-house multiprocessing and progressbars, and hard-codes 1 chunk per worker (which will cause idle CPUs when one chunk happens to be more expensive than the others), swifter
relies on the heavy dask
framework for multiprocessing (converting to Dask DataFrames and back). In an attempt to find the golden mean, mapply
is highly customizable and remains lightweight, using tqdm
for progressbars and leveraging the powerful pathos
framework, which shadows Python's built-in multiprocessing module using dill
for universal pickling.
This pure-Python, OS independent package is available on PyPI:
$ pip install mapply
For documentation, see mapply.readthedocs.io.
import pandas as pd
import mapply
mapply.init(
n_workers=-1,
chunk_size=100,
max_chunks_per_worker=8,
progressbar=False
)
df = pd.DataFrame({"A": list(range(100))})
# avoid unnecessary multiprocessing:
# due to chunk_size=100, this will act as regular apply.
# set chunk_size=1 to skip this check and let max_chunks_per_worker decide.
df["squared"] = df.A.mapply(lambda x: x ** 2)
Run make help
for options like installing for development, linting, testing, and building docs.