Phi_K correlation analyzer library
[offical] <https://www.sciencedirect.com/science/article/abs/pii/S0167947320301341>
_ [arxiv pre-print] <https://arxiv.org/abs/1811.11440>
_Phi_K is a practical correlation constant that works consistently between categorical, ordinal and interval variables. It is based on several refinements to Pearson's hypothesis test of independence of two variables. Essentially, the contingency test statistic of two variables is interpreted as coming from a rotated bi-variate normal distribution, where the tilt is interpreted as Phi_K.
The combined features of Phi_K form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables. Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution. These are useful features when studying the correlation matrix of variables with mixed types.
For details on the methodology behind the calculations, please see our publication. Emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships in a contingency table, in particular in case of low statistics samples. The presented algorithms are easy to use and available through this public Python library.
.. list-table:: :widths: 60 40 :header-rows: 1
basic tutorial <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_basic.ipynb>
_basic on colab <https://colab.research.google.com/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_basic.ipynb>
_advanced tutorial (detailed configuration) <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_advanced.ipynb>
_advanced on colab <https://colab.research.google.com/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_advanced.ipynb>
_spark tutorial <https://nbviewer.jupyter.org/github/KaveIO/PhiK/blob/master/phik/notebooks/phik_tutorial_spark.ipynb>
_The entire Phi_K documentation including tutorials can be found at read-the-docs <https://phik.readthedocs.io>
_.
See the tutorials for detailed examples on how to run the code with pandas. We also have one example on how
calculate the Phi_K correlation matrix for a spark dataframe.
The Phi_K library requires Python >= 3.7 and is pip friendly. To get started, simply do:
.. code-block:: bash
$ pip install phik
or check out the code from out GitHub repository:
.. code-block:: bash
$ git clone https://github.com/KaveIO/PhiK.git $ pip install -e PhiK/
where in this example the code is installed in edit mode (option -e).
You can now use the package in Python with:
.. code-block:: python
import phik
Congratulations, you are now ready to use the PhiK correlation analyzer library!
As a quick example, you can do:
.. code-block:: python
import pandas as pd import phik from phik import resources, report
df = pd.read_csv( resources.fixture('fake_insurance_data.csv.gz') ) df.head()
df.corr()
df.phik_matrix()
df.global_phik()
df.significance_matrix()
cols = ['mileage','car_size'] df[cols].hist2d()
df[cols].outlier_significance_matrix()
df.outlier_significance_matrices()
report.correlation_report(df, pdf_file_name='test.pdf')
For all available examples, please see the tutorials <https://phik.readthedocs.io/en/latest/tutorials.html>
_ at read-the-docs.
Please note that support is (only) provided on a best-effort basis.