Reversible Data Transforms
RDT (Reversible Data Transforms) is a Python library that transforms raw data into fully numerical data, ready for data science. The transforms are reversible, allowing you to convert from numerical data back into your original format.
Install RDT using pip
or conda
. We recommend using a virtual environment to avoid
conflicts with other software on your device.
pip install rdt
conda install -c conda-forge rdt
For more information about using reversible data transformations, visit the RDT Documentation.
In this short series of tutorials we will guide you through a series of steps that will help you getting started using RDT to transform columns, tables and datasets.
After you have installed RDT, you can get started using the demo dataset.
from rdt import get_demo
customers = get_demo()
This dataset contains some randomly generated values that describe the customers of an online marketplace.
last_login email_optin credit_card age dollars_spent
0 2021-06-26 False VISA 29 99.99
1 2021-02-10 False VISA 18 NaN
2 NaT False AMEX 21 2.50
3 2020-09-26 True NaN 45 25.00
4 2020-12-22 NaN DISCOVER 32 19.99
Let's transform this data so that each column is converted to full, numerical data ready for data science.
The HyperTransformer
is capable of transforming multi-column datasets.
from rdt import HyperTransformer
ht = HyperTransformer()
The HyperTransformer
needs to know about the columns in your dataset and which transformers to
apply to each. These are described by a config. We can ask the HyperTransformer
to automatically
detect it based on the data we plan to use.
ht.detect_initial_config(data=customers)
This will create and set the config.
Config:
{
"sdtypes": {
"last_login": "datetime",
"email_optin": "boolean",
"credit_card": "categorical",
"age": "numerical",
"dollars_spent": "numerical"
},
"transformers": {
"last_login": "UnixTimestampEncoder()",
"email_optin": "BinaryEncoder()",
"credit_card": "FrequencyEncoder()",
"age": "FloatFormatter()",
"dollars_spent": "FloatFormatter()"
}
}
The sdtypes
dictionary describes the semantic data types of each of your columns and the
transformers
dictionary describes which transformer to use for each column. You can customize the
transformers and their settings. (See the Transformers Glossary for more information).
The HyperTransformer
references the config while learning the data during the fit
stage.
ht.fit(customers)
Once the transformer is fit, it's ready to use. Use the transform method to transform all columns of your dataset at once.
transformed_data = ht.transform(customers)
last_login.value email_optin.value credit_card.value age.value dollars_spent.value
0 1.624666e+18 0.0 0.2 29 99.99
1 1.612915e+18 0.0 0.2 18 36.87
2 1.611814e+18 0.0 0.5 21 2.50
3 1.601078e+18 1.0 0.7 45 25.00
4 1.608595e+18 0.0 0.9 32 19.99
The HyperTransformer
applied the assigned transformer to each individual column. Each column
now contains fully numerical data that you can use for your project!
When you're done with your project, you can also transform the data back to the original format
using the reverse_transform
method.
original_format_data = ht.reverse_transform(transformed_data)
last_login email_optin credit_card age dollars_spent
0 NaT False VISA 29 99.99
1 2021-02-10 False VISA 18 NaN
2 NaT False AMEX 21 NaN
3 2020-09-26 True NaN 45 25.00
4 2020-12-22 False DISCOVER 32 19.99
To learn more about reversible data transformations, visit the RDT Documentation.
The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.
This release adds a parameter to the UnixTimestampEncoder
and OptimizedTimestampEncoder
, called enforce_min_max_values
. When this is set to True, it clips all values in the reverse transformed data to the min and max datetimes seen in the fitted data.
This release also internally adds support for multi-column transformers!
This release adds the 'random' missing value replacement strategy, which uses random values of the dataset to fill in missing values.
Additionally users are now able to use the UniformUnivariate
distribution within the Gaussian Normalizer with this update.
This release contains fixes for the ClusterBasedNormalizer
which crashes in the reverse transform caused by values being out of bounds
and a patch for the randomization issue dealing with different values after applying reset_randomization
.
Anonymization has been moved into RDT library from SDV as it was found to self contained module for RDT and would reduce dependencies needed in SDV.
frequencyEncoder
transformer will no longer be supported in future versions of RDT. Please use the UniformEncoder
transformer instead.GaussianNormalizer
distribution option names have been updated to be consistent with scipy. gaussian
-> norm
, student_t
-> t
, and truncated_gaussian
-> truncnorm
This release adds 3 new transformers:
UniformEncoder
- A categorical and boolean transformer that converts the column into a uniform distribution.OrderedUniformEncoder
- The same as above, but the order for the categories can be specified, changing which range in the uniform distribution each category belongs to.IDGenerator
- A text transformer that drops the input column during transform and returns IDs during reverse transform. The IDs all take the form <prefix><number><suffix> and can be configured with a custom prefix, suffix and starting point.Additionally, the AnonymizedFaker
is enhanced to support the text sdtype.
get_input_sdtype
method is being deprecated in favor of get_supported_sdtypes
.This release updates the default transformers used for certain sdtypes. It also enables the AnonymizedFaker
and PseudoAnonymizedFaker
to work with any sdtype besides boolean, categorical, datetime, numerical or text.
This release adds the ability to generate missing values to the AnonymizedFaker
. Users can now provide the missing_value_generation
parameter during initialization. They can set it to None
to not generate any missing values, or 'random'
to generate random missing values in the same proportion as the fitted data.
Additionally, this release improves the NullTransformer
by allowing nulls to be replaced on the forward transform even if missing_value_generation
is set to None. It also fixes a bug that was causing the UnixTimestampEncoder
to return a different dtype than the input on reverse_transform
. This was particularly problematic when datetime columns are represented as ints.
This release adds a new parameter called missing_value_generation
to the initialization of certain transformers to specify how missing values should be created. The parameter can be used in the FloatFormatter
, BinaryEncoder
, UnixTimestampEncoder
, OptimizedTimestampEncoder
, GaussianNormalizer
and ClusterBasedNormalizer
. Additionally, it fixes a bug that was causing every column that had nulls to generate them in the same place.
model_missing_values
parameter is being deprecated in favor of the new missing_value_generation
parameter.This release fixes a bug that caused datetime and numerical transformers to crash if a column was all NaNs. Additionally, it adds support for Pandas 2.0!
This release patches an issue that prevented the RegexGenerator
from working with regexes that had a very large number of possible combinations.
This release adds a couple of new features including adding the OrderedLabelEncoder
and deprecating the CustomLabelEncoder
. It also adds a change that makes all generator type transformers in the HyperTransformer
use a different random seed.
Additionally, bugs were patched in the RegexGenerator
that caused it to crash or take too long in certain cases. Finally, this release improved the detection of Faker functions in the AnonymizedFaker
.
This release makes changes to the way that individual transformers are stored in the HyperTransformer
. When accessing the config via HyperTransformer.get_config()
, the transformers listed in the config are now the actual transformer instances used during fitting and transforming. These instances can now be accessed and used to examine their properties post fitting. For example, you can now view the mapping for a PseudoAnonymizedFaker
instance using PseudoAnonymizedFaker.get_mapping()
on the instance retrieved from the config.
Additionally, the output of reverse_tranform
no longer appends the .value
suffix to every unnamed output column. Only output columns that are created from context extracted from the input columns will have suffixes (eg. .normalized
in the ClusterBasedNormalizer
).
The AnonymizedFaker
and RegexGenerator
now have an enforce_uniqueness
parameter, which controls whether the data returned by reverse_transform
should be unique. The HyperTransformer
now has a method called create_anonymized_columns
that can be used to generate columns that are matched with anonymizing transformers like AnonymizedFaker
and RegexGenerator
. The method can be used as follows:
HyperTransformer.create_anonymized_columns(num_rows=5, column_names=['email_optin', 'credit_card'])
Another major change in this release is the ability to control randomization. Every time a HyperTransformer
is initialized, its randomness will be reset to the same seed, and it will yield the same results for reverse_transform
if given the same input. Every subsequent call to reverse_transform
yields a different result. If a user desires to reset the seed, they can call HyperTransformer.reset_randomization
.
Finally, this release adds support for Python 3.10 and drops support for 3.6.
This release fixes a bug that caused the UnixTimestampEncoder
to return data with the incorrect datetime format. It also fixes a bug that caused the null column
not to be reverse transformed when using the UnixTimestampEncoder
when the missing_value_replacement
was not set.
This release adds a new transformer called the PseudoAnonymizedFaker
. This transformer enables the pseudo-anonymization of your data by mapping all of a column's original values to fake values that get returned during the reverse transformation process. Each original value is always mapped to the same fake value.
Additionally, this release enables the HyperTransformer
to use categorical transformers on boolean columns. It also introduces a new parameter called computer_representation
to the FloatFormatter
that will allow for values to be clipped to certain bounds based on the computer type used for a numerical column.
Finally, this release patches a bug that caused unpredicatable results from the reverse_transform
method of the FrequencyEncoder
when add_noise
is enabled.
This release adds multiple new transformers: the CustomLabelEncoder
and the RegexGenerator
. The CustomLabelEncoder
works similarly
to the LabelEncoder
, except it allows users to provide the order of the categories. The RegexGenerator
allows users to specify a regex
pattern and will generate values that match that pattern.
This release also improves current transformers. The LabelEncoder
now has a parameter called order_by
that allows users to specify the
ordering scheme for their data (eg. order numerically or alphabetically). The LabelEncoder
also now has a parameter called add_noise
that allows users to specify whether or not uniform noise should be added to the transformed data. Performance enhancements were made for the
GaussianNormalizer
by removing an unnecessary distribution search and the FloatFormatter
will no longer round values to any place higher
than the ones place by default.
The main update of this release is the introduction of a config
, which describes the sdtypes
and transformers
that will be used by the HyperTransformer
for each column of the data, where sdtype
stands for the semantic or statistical meaning of a datatype. The user can interact with this config through the newly created methods update_sdtypes
, get_config
, set_config
, update_transformers
, update_transformers_by_sdtype
and remove_transformer_by_sdtype
.
This release also included various new features and updates, including:
transform_subset
and reverse_transform_subset
.transform
, reverse_transform
, update_sdtypes
, update_transformers
, set_config
.GaussianNormalizer.fit
and FrequencyEncoder.transform
.model_missing_values = False
in a transformer was updated to keep track of the percentage of missing values, instead of producing data containing NaN
's.HyperTransformer
.get_demo
was improved to be more intuitive.Finally, a number of transformers were redesigned to be more user friendly. Among them, the following transformers have also been renamed:
BayesGMMTransformer
-> ClusterBasedNormalizer
GaussianCopulaTransformer
-> GaussianNormalizer
DateTimeRoundedTransformer
-> OptimizedTimestampEncoder
DateTimeTransformer
-> UnixTimestampEncoder
NumericalTransformer
-> FloatFormatter
LabelEncodingTransformer
-> LabelEncoder
OneHotEncodingTransformer
-> OneHotEncoder
CategoricalTransformer
-> FrequencyEncoder
BooleanTransformer
-> BinaryEncoder
PIIAnonymizer
-> AnonymizedFaker
This release fixes multiple bugs concerning the HyperTransformer
. One is that the get_transformer_tree_yaml
method no longer crashes on
every call. Another is that calling the update_field_data_types
and update_default_data_type_transformers
after fitting no longer breaks the transform
method.
The HyperTransformer
now sorts its outputs for both transform
and reverse_transform
based on the order of the input's columns. It is also now possible
to create transformers that simply drops columns during transform
and don't return any new columns.
This release adds a new module to the RDT
library called performance
. This module can be used to evaluate the speed and peak memory usage
of any transformer in RDT. This release also increases the maximum acceptable version of scikit-learn to make it more compatible with other libraries
in the SDV
ecosystem. On top of that, it fixes a bug related to a new version of pandas
.
This release adds a new BayesGMMTransformer
. This transformer can be used to convert a numerical column into two
columns: a discrete column indicating the selected component
of the GMM for each row, and a continuous column containing
the normalized value of each row based on the mean
and std
of the selected component
. It is useful when the column being transformed
came from multiple distributions.
This release also adds multiple new methods to the HyperTransformer
API. These allow for users to access the specfic
transformers used on each input field, as well as view the entire tree of transformers that are used when running transform
.
The exact methods are:
BaseTransformer.get_input_columns()
- Return list of input columns for a transformer.BaseTransformer.get_output_columns()
- Return list of output columns for a transformer.HyperTransformer.get_transformer(field)
- Return the transformer instance used for a field.HyperTransformer.get_output_transformers(field)
- Return dictionary mapping output columns of a field to the transformers used on them.HyperTransformer.get_final_output_columns(field)
- Return list of all final output columns related to a field.HyperTransformer.get_transformer_tree_yaml()
- Return YAML representation of transformers tree.Additionally, this release fixes a bug where the HyperTransformer
was incorrectly raising a NotFittedError
. It also improved the
DatetimeTransformer
by autonomously detecting if a column needs to be converted from dtype
object
to dtype
datetime
.
This release adds support for Python 3.9! It also removes unused document files.
This release makes major changes to the underlying code for RDT as well as the API for both the HyperTransformer
and BaseTransformer
.
The changes enable the following functionality:
HyperTransformer
can now apply a sequence of transformers to a column.pandas.dtypes
.HyperTransformer.transform
.HyperTransformer
will continuously apply transformations to the input fields until only acceptable data types are in the output.To take advantage of this functionality, the following API changes were made:
HyperTransformer
has new initialization parameters that allow users to specify data types for any field in their data as well as
specify which transformer to use for a field or data type. The parameters are:
field_transformers
- A dictionary allowing users to specify which transformer to use for a field or derived field. Derived fields
are fields created by running transform
on the input data.field_data_types
- A dictionary allowing users to specify the data type of a field.default_data_type_transformers
- A dictionary allowing users to specify the default transformer to use for a data type.transform_output_types
- A dictionary allowing users to specify which data types are acceptable for the output of transform
.
This is a result of the fact that transformers can now be applied in a sequence, and not every transformer will return numeric data.HyperTransformer
to allow these parameters to be modified. These include get_field_data_types
,
update_field_data_types
, get_default_data_type_transformers
, update_default_data_type_transformers
and set_first_transformers_for_fields
.BaseTransformer
now requires the column names it will transform to be provided to fit
, transform
and reverse_transform
.BaseTransformer
added the following method to allow for users to see its output fields and output types: get_output_types
.BaseTransformer
added the following method to allow for users to see the next suggested transformer for each output field:
get_next_transformers
.On top of the changes to the API and the capabilities of RDT, many automated checks and tests were also added to ensure that contributions to the library abide by the current code style, stay performant and result in data of a high quality. These tests run on every push to the repository. They can also be run locally via the following functions:
validate_transformer_code_style
- Checks that new code follows the code style.validate_transformer_quality
- Tests that new transformers yield data that maintains relationships between columns.validate_transformer_performance
- Tests that new transformers don't take too much time or memory.validate_transformer_unit_tests
- Checks that the unit tests cover all new code, follow naming conventions and pass.validate_transformer_integration
- Checks that the integration tests follow naming conventions and pass.This release fixes a bug with learning rounding digits in the NumericalTransformer
,
and includes a few housekeeping improvements.
This release fixes a couple of bugs introduced by the previous release regarding the
OneHotEncodingTransformer
and the BooleanTransformer
.
This release improves the overall performance of the library, both in terms of memory and time consumption.
More specifically, it makes the following modules more efficient: NullTransformer
, DatetimeTransformer
,
LabelEncodingTransformer
, NumericalTransformer
, CategoricalTransformer
, BooleanTransformer
and OneHotEncodingTransformer
.
It also adds performance-based testing and a script for profiling the performance.
This release updates the NumericalTransformer
by adding a new rounding
argument.
Users can now obtain numerical values with precision, either pre-specified or automatically computed from the given data.
rounding
argument to NumericalTransformer
- Issue #166 by @amontanez24 and @csalaNumericalTransformer
rounding error with infinity - Issue #169 by @amontanez24This release adds a new method to the CategoricalTransformer
to solve a bug where
the transformer becomes unusable after being pickled and unpickled if it had NaN
values in the data which it was fit on.
It also fixes some grammar mistakes in the documentation.
This release improves the HyperTransformer
memory usage when working with a
high number of columns or a high number of categorical values when using one hot encoding.
Boolean
, Datetime
and LabelEncoding
transformers fail with 2D ndarray
- Issue #160 by @pvk-developerHyperTransformer
: Memory usage increase when reverse_transform
is called - Issue #156 by @pvk-developer and @AnupamaGangadharIn this release a change in the HyperTransformer allows using it to transform and reverse transform a subset of the columns seen during training.
The anonymization functionality which was deprecated and not being used has also been removed along with the Faker dependency.
This release changes the behavior of the HyperTransformer
to prevent it from
modifying any column in the given DataFrame
if the transformers
dictionary
is passed empty.
This release adds a new argument to the HyperTransformer
which gives control over
which transformers to use by default for each dtype
if no specific transformer
has been specified for the field.
This is also the first version to be officially released on conda.
dtype_transformers
argument to HyperTransformer - Issue #148 by @csalaThis release fixes a bug that prevented the CategoricalTransformer
from working properly
when being passed data that contained numerical data only, without any strings, but also
contained None
or NaN
values.
This release fixes a few minor bugs, including some which prevented RDT from fully working on Windows systems.
Thanks to this fixes, as well as a new testing infrastructure that has been set up, from now on RDT is officially supported on Windows systems, as well as on the Linux and macOS systems which were previously supported.
In this release we drop the support for the now officially dead Python 3.5 and introduce a new feature in the DatetimeTransformer which reduces the dimensionality of the generated numerical values while also ensuring that the reverted datetimes maintain the same level as time unit precision as the original ones.
Miunor bugfixing release.
column_name
in hypertransformer - Issue #110 by @csalaThis version comes with a brand new API and internal implementation, removing the old
metadata JSON from the user provided arguments, and making each transformer work only
with pandas.Series
of their corresponding data type.
As part of this change, several transformer names have been changed and a new BooleanTransformer and a feature to automatically decide which transformers to use based on dtypes have been added.
Unit test coverage has also been increased to 100%.
Special thanks to @JDTheRipperPC and @csala for the big efforts put in making this release possible.
col_meta
argument from method-level to class-level.HyperTransformer
.NullTransformer
.Numbertransfomer
to set default value to 0 when the column is null.get_types
and impute_table
from HyperTransformer.rdt.utils
into HyperTransformer
class.