Use pyarrow with Azure Data Lake gen2
pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.
It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.
pip install pyarrowfs-adlgen2
Example usage with pandas dataframe:
import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
df = pd.read_parquet('container/dataset.parq', filesystem=fs)
Example usage with arrow tables:
import azure.identity
import pyarrow.dataset
import pyarrow.fs
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
table = ds.to_table()
Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds.
import azure.identity
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
'YOUR_ACCOUNT_NAME',
azure.identity.DefaultAzureCredential(),
timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10)
)
# or mutate it:
handler.timeouts.file_client_timeout = 20
With pyarrow version 3 or greater, you can write datasets from arrow tables:
import pyarrow as pa
import pyarrow.dataset
pyarrow.dataset.write_dataset(
table,
'name.pq',
format='parquet',
partitioning=pyarrow.dataset.partitioning(
schema=pyarrow.schema([('year', pa.int32())]), flavor='hive'
),
filesystem=pyarrow.fs.PyFileSystem(handler)
)
With earlier versions, files must be opened/written one at a time:
As of pyarrow version 1.0.1, pyarrow.parquet.ParquetWriter
does not support pyarrow.fs.PyFileSystem
, but data can be written to open files:
with fs.open_output_stream('container/out.parq') as out:
df.to_parquet(out)
Or with arrow tables:
import pyarrow.parquet
with fs.open_output_stream('container/out.parq') as out:
pyarrow.parquet.write_table(table, out)
If you do not want, or can't access the whole storage account as a single filesystem, you can use pyarrowfs_adlgen2.FilesystemHandler
to view a single file system within an account:
import azure.identity
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name(
"STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())
All access is done through the file system within the storage account.
You can set headers for any output files by using the metadata
argument to handler.open_output_stream
:
import pyarrowfs_adlgen2
fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs()
metadata = {"content_type": "application/json"}
with fs.open_output_stream("container/data.json", metadata) as out:
out.write("{}")
Note that the spelling is different than you might expect! For a list of valid keys, see ContentSettings.
You can do this for pyarrow >= 5 when using pyarrow.fs.PyFileSystem
, and for any pyarrow if using the handlers
from pyarrowfs_adlgen2 directly.
To run the integration tests, you need:
$ az login
or set up environment variables, see azure.identity.DefaultAzureCredential
)pip install pytest
NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT
AZUREARROWFS_TEST_ACT=thestorageaccount pytest