A utility library for working with Table Schema in Python
A Python implementation of the Table Schema standard.
[Important Notice] We have released Frictionless Framework. This framework provides improved
tableschema
functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the Migration Guide fromtableschema
to Frictionless Framework.
- we continue to bug-fix
tableschema@1.x
in this repository as well as it's available on PyPi as it was before- please note that
frictionless@3.x
version's API, we're working on at the moment, is not stable- we will release
frictionless@4.x
by the end of 2020 to be the first SemVer/stable version
Table
to work with data tables described by Table SchemaSchema
representing Table SchemaField
representing Table Schema fieldvalidate
to validate Table Schemainfer
to infer Table Schema from dataThe package uses semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify tableschema
version range in your setup/requirements
file e.g. tableschema>=1.0,<2.0
.
$ pip install tableschema
Let's start with a simple example:
from tableschema import Table
# Create table
table = Table('path.csv', schema='schema.json')
# Print schema descriptor
print(table.schema.descriptor)
# Print cast rows in a dict form
for keyed_row in table.iter(keyed=True):
print(keyed_row)
A table is a core concept in a tabular data world. It represents data with metadata (Table Schema). Let's see how we can use it in practice.
Consider we have some local csv file. It could be inline data or from a remote link - all supported by the Table
class (except local files for in-brower usage of course). But say it's data.csv
for now:
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
Let's create and read a table instance. We use the static Table.load
method and the table.read
method with the keyed
option to get an array of keyed rows:
table = Table('data.csv')
table.headers # ['city', 'location']
table.read(keyed=True)
# [
# {city: 'london', location: '51.50,-0.11'},
# {city: 'paris', location: '48.85,2.30'},
# {city: 'rome', location: 'N/A'},
# ]
As we can see, our locations are just strings. But they should be geopoints. Also, Rome's location is not available, but it's just a string N/A
instead of None
. First we have to infer Table Schema:
table.infer()
table.schema.descriptor
# { fields:
# [ { name: 'city', type: 'string', format: 'default' },
# { name: 'location', type: 'geopoint', format: 'default' } ],
# missingValues: [ '' ] }
table.read(keyed=True)
# Fails with a data validation error
Let's fix the "not available" location. There is a missingValues
property in Table Schema specification. As a first try we set missingValues
to N/A
in table.schema.descriptor
. The schema descriptor can be changed in-place, but all changes should also be committed using table.schema.commit()
:
table.schema.descriptor['missingValues'] = 'N/A'
table.schema.commit()
table.schema.valid # false
table.schema.errors
# [<ValidationError: "'N/A' is not of type 'array'">]
As a good citizens we've decided to check our schema descriptor's validity. And it's not valid! We should use an array for the missingValues
property. Also, don't forget to include "empty string" as a valid missing value:
table.schema.descriptor['missingValues'] = ['', 'N/A']
table.schema.commit()
table.schema.valid # true
All good. It looks like we're ready to read our data again:
table.read(keyed=True)
# [
# {city: 'london', location: [51.50,-0.11]},
# {city: 'paris', location: [48.85,2.30]},
# {city: 'rome', location: null},
# ]
Now we see that:
None
And because there are no errors after reading, we can be sure that our data is valid against our schema. Let's save it:
table.schema.save('schema.json')
table.save('data.csv')
Our data.csv
looks the same because it has been stringified back to csv
format. But now we have schema.json
:
{
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}
If we decide to improve it even more we could update the schema file and then open it again. But now providing a schema path:
table = Table('data.csv', schema='schema.json')
# Continue the work
As already mentioned a given schema can be used to validate data (see the Schema section for schema specification details). In default mode invalid data rows immediately trigger an exception in the table.iter()
/table.write()
methods.
Suppose this schema-invalid local file invalid_data.csv
:
key,value
zero,0
one,not_an_integer
two,2
We're going to validate the data against the following schema:
table = Table(
'invalid_data.csv',
schema={'fields': [{'name': 'key'}, {'name': 'value', 'type': 'integer'}]})
Iterating over the data triggers an exception due to the failed cast of 'not_an_integer'
to int
:
for row in table.iter():
print(row)
# Traceback (most recent call last):
# ...
# tableschema.exceptions.CastError: There are 1 cast errors (see exception.errors) for row "3"
Hint: The row number count starts with 1 and also includes header lines.
(Note: You can optionally switch off iter()
/read()
value casting using the cast parameter, see reference below.)
By providing a custom exception handler (a callable) to those methods you can treat occurring exceptions at your own discretion, i.e. to "fail late" and e.g. gather a validation report on the whole data:
errors = []
def exc_handler(exc, row_number=None, row_data=None, error_data=None):
errors.append((exc, row_number, row_data, error_data))
for row in table.iter(exc_handler=exc_handler):
print(row)
# ['zero', 0]
# ['one', FailedCast('not_an_integer')]
# ['two', 2]
print(errors)
# [(CastError('There are 1 cast errors (see exception.errors) for row "3"',),
# 3,
# OrderedDict([('key', 'one'), ('value', 'not_an_integer')]),
# OrderedDict([('value', 'not_an_integer')]))]
Note that
iter()
/read()
cast parameter is set to True, which is the default) are wrapped into a FailedCast
"value holder". This allows for distinguishing uncasted values from successfully casted values on the data consumer side. FailedCast
instances can only get yielded when custom exception handling is in place.iter()
/read()
sections of the Table
class API reference.A model of a schema with helpful methods for working with the schema and supported data. Schema instances can be initialized with a schema source as a url to a JSON file or a JSON object. The schema is initially validated (see validate below). By default validation errors will be stored in schema.errors
but in a strict mode it will be instantly raised.
Let's create a blank schema. It's not valid because descriptor.fields
property is required by the Table Schema specification:
schema = Schema()
schema.valid # false
schema.errors
# [<ValidationError: "'fields' is a required property">]
To avoid creating a schema descriptor by hand we will use a schema.infer
method to infer the descriptor from given data:
schema.infer([
['id', 'age', 'name'],
['1','39','Paul'],
['2','23','Jimmy'],
['3','36','Jane'],
['4','28','Judy'],
])
schema.valid # true
schema.descriptor
#{ fields:
# [ { name: 'id', type: 'integer', format: 'default' },
# { name: 'age', type: 'integer', format: 'default' },
# { name: 'name', type: 'string', format: 'default' } ],
# missingValues: [ '' ] }
Now we have an inferred schema and it's valid. We can cast data rows against our schema. We provide a string input which will be cast correspondingly:
schema.cast_row(['5', '66', 'Sam'])
# [ 5, 66, 'Sam' ]
But if we try provide some missing value to the age
field, the cast will fail because the only valid "missing" value is an empty string. Let's update our schema:
schema.cast_row(['6', 'N/A', 'Walt'])
# Cast error
schema.descriptor['missingValues'] = ['', 'N/A']
schema.commit()
schema.cast_row(['6', 'N/A', 'Walt'])
# [ 6, None, 'Walt' ]
We can save the schema to a local file, and resume work on it at any time by loading it from that file:
schema.save('schema.json')
schema = Schema('schema.json')
from tableschema import Field
# Init field
field = Field({'name': 'name', 'type': 'number'})
# Cast a value
field.cast_value('12345') # -> 12345
Data values can be cast to native Python objects with a Field instance. Type instances can be initialized with field descriptors. This allows formats and constraints to be defined.
Casting a value will check the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema. E.g. a date value (in ISO 8601 format) can be cast with a DateType instance. Values that can't be cast will raise an InvalidCastError
exception.
Casting a value that doesn't meet the constraints will raise a ConstraintError
exception.
cli
cli()
Command-line interface
Usage: tableschema [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
infer Infer a schema from data.
info Return info on this version of Table Schema
validate Validate that a supposed schema is in fact a Table Schema.
Table
Table(self,
source,
schema=None,
strict=False,
post_cast=[],
storage=None,
**options)
Table representation
Arguments
Schema
classSchema
constructorsql
or bigquery
tabulator
or storage's optionsRaises
TableSchemaException
: raises on any errortable.hash
Table's SHA256 hash if it's available.
If it's already read using e.g. table.read
, otherwise returns None
.
In the middle of an iteration it returns hash of already read contents
Returns
str/None
: SHA256 hash
table.headers
Table's headers is available
Returns
str[]
: headers
table.schema
Returns schema class instance if available
Returns
Schema
: schema
table.size
Table's size in BYTES if it's available
If it's already read using e.g. table.read
, otherwise returns None
.
In the middle of an iteration it returns size of already read contents
Returns
int/None
: size in BYTES
table.iter
table.iter(keyed=False,
extended=False,
cast=True,
integrity=False,
relations=False,
foreign_keys_values=False,
exc_handler=None)
Iterates through the table data and emits rows cast based on table schema.
Arguments
keyed (bool):
yield keyed rows in a form of `{header1: value1, header2: value2}`
(default is false; the form of rows is `[value1, value2]`)
extended (bool):
yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]`
(default is false; the form of rows is `[value1, value2]`)
cast (bool):
disable data casting if false
(default is true)
integrity (dict):
dictionary in a form of `{'size': <bytes>, 'hash': '<sha256>'}`
to check integrity of the table when it's read completely.
Both keys are optional.
relations (dict):
dictionary of foreign key references in a form
of `{resource1: [{field1: value1, field2: value2}, ...], ...}`.
If provided, foreign key fields will checked and resolved
to one of their references (/!\ one-to-many fk are not completely resolved).
foreign_keys_values (dict):
three-level dictionary of foreign key references optimized
to speed up validation process in a form of
`{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`.
If not provided but relations is true, it will be created
before the validation process by *index_foreign_keys_values* method
exc_handler (func):
optional custom exception handler callable.
Can be used to defer raising errors (i.e. "fail late"), e.g.
for data validation purposes. Must support the signature below
Custom exception handler
def exc_handler(exc, row_number=None, row_data=None, error_data=None):
'''Custom exception handler (example)
# Arguments:
exc(Exception):
Deferred exception instance
row_number(int):
Data row number that triggers exception exc
row_data(OrderedDict):
Invalid data row source data
error_data(OrderedDict):
Data row source data field subset responsible for the error, if
applicable (e.g. invalid primary or foreign key fields). May be
identical to row_data.
'''
# ...
Raises
TableSchemaException
: base class of any errorCastError
: data cast errorIntegrityError
: integrity checking errorUniqueKeyError
: unique key constraint violationUnresolvedFKError
: unresolved foreign key reference errorReturns
Iterator[list]
: yields rows
table.read
table.read(keyed=False,
extended=False,
cast=True,
limit=None,
integrity=False,
relations=False,
foreign_keys_values=False,
exc_handler=None)
Read the whole table and return as array of rows
It has the same API as
table.iter
except for
Arguments
Returns
list[]
: returns rows
table.infer
table.infer(limit=100,
confidence=0.75,
missing_values=[''],
guesser_cls=None,
resolver_cls=None)
Infer a schema for the table.
It will infer and set Table Schema to table.schema
based on table data.
Arguments
['']
)Returns
dict
: Table Schema descriptor
table.save
table.save(target, storage=None, **options)
Save data source to file locally in CSV format with ,
(comma) delimiter
To save schema use
table.schema.save()
Arguments
sql
or bigquery
tabulator
or storage optionsRaises
TableSchemaException
: raises an error if there is saving problemReturns
True/Storage
: returns true or storage instance
table.index_foreign_keys_values
table.index_foreign_keys_values(relations)
Creates a three-level dictionary of foreign key references
We create them optimized to speed up validation process in a form of
{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}
.
For each foreign key of the schema it will iterate through the corresponding
relations['resource']
to create an index (i.e. a dict) of existing values
for the foreign fields and store on keyed row for each value combination.
The optimization relies on the indexation of possible values for one foreign key in a hashmap to later speed up resolution.
This method is public to allow creating the index once to apply it on multiple tables charing the same schema (typically grouped resources in datapackage)
Notes
Arguments
{resource1: [{field1: value1, field2: value2}, ...], ...}
.
It must contain all resources pointed in the foreign keys schema definition.Returns
dict
:
returns a three-level dictionary of foreign key references
optimized to speed up validation process in a form of
{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}})
Schema
Schema(self, descriptor={}, strict=False)
Schema representation
Arguments
schema.errors
- if true, validation errors are raised immediatelyRaises
TableSchemaException
: raise any error that occurs during the processschema.descriptor
Schema's descriptor
Returns
dict
: descriptor
schema.errors
Validation errors
Always empty in strict mode.
Returns
Exception[]
: validation errors
schema.field_names
Schema's field names
Returns
str[]
: an array of field names
schema.fields
Schema's fields
Returns
Field[]
: an array of field instances
schema.foreign_keys
Schema's foreign keys
Returns
dict[]
: foreign keys
schema.headers
Schema's field names
Returns
str[]
: an array of field names
schema.missing_values
Schema's missing values
Returns
str[]
: missing values
schema.primary_key
Schema's primary keys
Returns
str[]
: primary keys
schema.valid
Validation status
Always true in strict mode.
Returns
bool
: validation status
schema.get_field
schema.get_field(name)
Get schema's field by name.
Use
table.update_field
if you want to modify the field descriptor
Arguments
Returns
Field/None
: Field
instance or None
if not found
schema.get_field
schema.get_field(name)
Get schema's field by name.
Use
table.update_field
if you want to modify the field descriptor
Arguments
Returns
Field/None
: Field
instance or None
if not found
schema.add_field
schema.add_field(descriptor)
Add new field to schema.
The schema descriptor will be validated with newly added field descriptor.
Arguments
Raises
TableSchemaException
: raises any error that occurs during the processReturns
Field/None
: added Field
instance or None
if not added
schema.update_field
schema.update_field(name, update)
Update existing descriptor field by name
Arguments
Returns
bool
: true on success and false if no field is found to be modified
schema.remove_field
schema.remove_field(name)
Remove field resource by name.
The schema descriptor will be validated after field descriptor removal.
Arguments
Raises
TableSchemaException
: raises any error that occurs during the processReturns
Field/None
: removed Field
instances or None
if not found
schema.cast_row
schema.cast_row(row, fail_fast=False, row_number=None, exc_handler=None)
Cast row based on field types and formats.
Arguments
Returns
any[]
: returns cast data row
schema.infer
schema.infer(rows,
headers=1,
confidence=0.75,
guesser_cls=None,
resolver_cls=None)
Infer and set schema.descriptor
based on data sample.
Arguments
rows
should contain headers rows)
- array of headers (rows
should NOT contain headers rows)Returns
dict
: Table Schema descriptor
schema.commit
schema.commit(strict=None)
Update schema instance if there are in-place changes in the descriptor.
Example
from tableschema import Schema
descriptor = {'fields': [{'name': 'my_field', 'title': 'My Field', 'type': 'string'}]}
schema = Schema(descriptor)
print(schema.get_field('my_field').descriptor['type']) # string
# Update descriptor by field position
schema.descriptor['fields'][0]['type'] = 'number'
# Update descriptor by field name
schema.update_field('my_field', {'title': 'My Pretty Field'}) # True
# Change are not committed
print(schema.get_field('my_field').descriptor['type']) # string
print(schema.get_field('my_field').descriptor['title']) # My Field
# Commit change
schema.commit()
print(schema.get_field('my_field').descriptor['type']) # number
print(schema.get_field('my_field').descriptor['title']) # My Pretty Field
Arguments
strict
mode for further workRaises
TableSchemaException
: raises any error that occurs during the processReturns
bool
: true on success and false if not modified
schema.save
schema.save(target, ensure_ascii=True)
Save schema descriptor to target destination.
Arguments
Raises
TableSchemaException
: raises any error that occurs during the processReturns
bool
: true on success
Field
Field(self, descriptor, missing_values=[''], schema=None)
Field representaion
Arguments
Raises
TableSchemaException
: raises any error that occurs during the processfield.constraints
Field constraints
Returns
dict
: dict of field constraints
field.descriptor
Fields's descriptor
Returns
dict
: descriptor
field.format
Field format
Returns
str
: field format
field.missing_values
Field's missing values
Returns
str[]
: missing values
field.name
Field name
Returns
str
: field name
field.required
Whether field is required
Returns
bool
: true if required
field.schema
Returns a schema instance if the field belongs to some schema
Returns
Schema
: field's schema
field.type
Field type
Returns
str
: field type
field.cast_value
field.cast_value(value, constraints=True)
Cast given value according to the field type and format.
Arguments
Raises
TableSchemaException
: raises any error that occurs during the processReturns
any
: returns cast value
field.test_value
field.test_value(value, constraints=True)
Test whether value is compliant to the field.
Arguments
Returns
bool
: returns if value is compliant to the field
Storage
Storage(self, **options)
Storage factory/interface
For users
Use
Storage.connect
to instantiate a storage
For instantiation of concrete storage instances,
tableschema.Storage
provides a unified factory method connect
(which uses the plugin system under the hood):
# pip install tableschema_sql
from tableschema import Storage
storage = Storage.connect('sql', **options)
storage.create('bucket', descriptor)
storage.write('bucket', rows)
storage.read('bucket')
For integrators
The library includes interface declaration to implement tabular Storage
.
This interface allow to use different data storage systems like SQL
with tableschema.Table
class (load/save) as well as on the data package level:
An implementor must follow tableschema.Storage
interface
to write his own storage backend. Concrete storage backends
could include additional functionality specific to conrete storage system.
See plugins
below to know how to integrate custom storage plugin into your workflow.
storage.buckets
Return list of storage bucket names.
A bucket
is a special term which has almost the same meaning as table
.
You should consider bucket
as a table
stored in the storage
.
Raises
exceptions.StorageError
: raises on any errorReturns
str[]
: return list of bucket names
storage.connect
storage.connect(name, **options)
Create tabular storage
based on storage name.
This method is statis:
Storage.connect()
Arguments
sql
Raises
StorageError
: raises on any errorReturns
Storage
: returns Storage
instance
storage.create
storage.create(bucket, descriptor, force=False)
Create one/multiple buckets.
Arguments
Raises
exceptions.StorageError
: raises on any errorstorage.delete
storage.delete(bucket=None, ignore=False)
Delete one/multiple/all buckets.
Arguments
None
, all buckets will be deletedRaises
exceptions.StorageError
: raises on any errorstorage.describe
storage.describe(bucket, descriptor=None)
Get/set bucket's Table Schema descriptor
Arguments
Raises
exceptions.StorageError
: raises on any errorReturns
dict
: returns Table Schema descriptor
storage.iter
storage.iter(bucket)
Return an iterator of typed values based on the schema of this bucket.
Arguments
Raises
exceptions.StorageError
: raises on any errorReturns
list[]
: yields data rows
storage.read
storage.read(bucket)
Read typed values based on the schema of this bucket.
Arguments
exceptions.StorageError
: raises on any error
Returnslist[]
: returns data rows
storage.write
storage.write(bucket, rows)
This method writes data rows into storage
.
It should store values of unsupported types as strings internally (like csv does).
Arguments
Raises
exceptions.StorageError
: raises on any errorvalidate
validate(descriptor)
Validate descriptor
Arguments
Raises
ValidationError
: on validation errorsReturns
bool
: True
infer
infer(source,
headers=1,
limit=100,
confidence=0.75,
missing_values=[''],
guesser_cls=None,
resolver_cls=None,
**options)
Infer source schema.
Arguments
['']
)Raises
TableSchemaException
: raises any error that occurs during the processReturns
dict
: returns schema descriptor
FailedCast
FailedCast(self, value)
Wrap an original data field value that failed to be properly casted.
FailedCast allows for further processing/yielding values but still be able to distinguish uncasted values on the consuming side.
Delegates attribute access and the basic rich comparison methods to the underlying object. Supports default user-defined classes hashability i.e. is hashable based on object identity (not based on the wrapped value).
Arguments
DataPackageException
DataPackageException(self, message, errors=[])
Base class for all DataPackage/TableSchema exceptions.
If there are multiple errors, they can be read from the exception object:
try:
# lib action
except DataPackageException as exception:
if exception.multiple:
for error in exception.errors:
# handle error
datapackageexception.errors
List of nested errors
Returns
DataPackageException[]
: list of nested errors
datapackageexception.multiple
Whether it's a nested exception
Returns
bool
: whether it's a nested exception
TableSchemaException
TableSchemaException(self, message, errors=[])
Base class for all TableSchema exceptions.
LoadError
LoadError(self, message, errors=[])
All loading errors.
ValidationError
ValidationError(self, message, errors=[])
All validation errors.
CastError
CastError(self, message, errors=[])
All value cast errors.
IntegrityError
IntegrityError(self, message, errors=[])
All integrity errors.
UniqueKeyError
UniqueKeyError(self, message, errors=[])
Unique key constraint violation (CastError subclass)
RelationError
RelationError(self, message, errors=[])
All relations errors.
UnresolvedFKError
UnresolvedFKError(self, message, errors=[])
Unresolved foreign key reference error (RelationError subclass).
StorageError
StorageError(self, message, errors=[])
All storage errors.
This API is experimental and can be changed/removed in the future
There is an experimental environment variable TABLESCHEMA_PRESERVE_MISSING_VALUES
which, if it is set, affects how data casting works.
By default, missing values are resolved to None
values. When this flag is set, missing values are passed through as it is. For example:
missing_values.py
from tableschema import Field
field = Field({'type': 'number'}, missing_values=['-'])
print(field.cast_value('3'))
print(field.cast_value('-'))
Running this script in different modes:
$ python missing_values.py
3
None
$ TABLESCHEMA_PRESERVE_MISSING_VALUES=1 python missing_values.py
3
-
The flags affects all the library's APIs and software built on top of tableschema
. For example, Data Package Pipelines:
$ TABLESCHEMA_PRESERVE_MISSING_VALUES=1 dpp run ./my_pipeline
The project follows the Open Knowledge International coding standards.
Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:
$ make install
To run tests with linting and coverage:
$ make test
Here described only breaking and the most important changes. The full changelog and documentation for all released versions can be found in the nicely formatted commit history.
field.ERROR/cast_function/check_functions
schema.missing_values
and field.missing_values
geopoint
:
TABLESCHEMA_PRESERVE_MISSING_VALUES
environment variable flagtable.infer
and infer
missing_values
argument to the infer
function (#269)preserve_missing_values
parameter to field.cast_value
table.size
and table.hash
propertiestable.index_foreign_keys_values
and improved foreign key checks performancefield.schema
propertystrict
mode raise an exception if there are problems in field constructionschema.update_field
methodconfidence
parameter to infer