scikit_na
- scikit_na.correlate(data: pandas.core.frame.DataFrame, columns: Optional[Sequence] = None, drop: bool = True, **kwargs) → pandas.core.frame.DataFrame
Calculate correlations between columns in terms of NA values.
- Parameters
data (DataFrame) – Input data.
columns (Optional[List, ndarray, Index] = None) – Columns names.
drop (bool = True, optional) – Drop columns without NA values.
kwargs (dict, optional) – Keyword arguments passed to
pandas.DataFrame.corr()
method.
- Returns
Correlation values.
- Return type
DataFrame
- scikit_na.describe(data: pandas.core.frame.DataFrame, col_na: str, columns: Optional[Sequence] = None, na_mapping: Optional[dict] = None) → pandas.core.frame.DataFrame
Describe data grouped by a column with NA values.
- Parameters
data (DataFrame) – Input data.
col_na (str) – Column with NA values to group the other data by.
columns (Optional[Sequence]) – Columns to calculate descriptive statistics on.
na_mapping (dict, optional) – Dictionary with NA mappings. By default, it is {True: “NA”, False: “Filled”}.
- Returns
Descriptive statistics (mean, median, etc.).
- Return type
DataFrame
- scikit_na.model(data: pandas.core.frame.DataFrame, col_na: str, columns: Optional[Sequence] = None, intercept: bool = True, fit_kws: Optional[dict] = None, logit_kws: Optional[dict] = None)
Logistic regression modeling.
Fit a logistic regression model to NA values encoded as 0 (non-missing) and 1 (NA) in column col_na with predictors passed with columns argument. Statsmodels package is used as a backend for model fitting.
- Parameters
data (DataFrame) – Input data.
col_na (str) – Column with NA values to use as a dependent variable.
columns (Optional[Sequence]) – Columns to use as independent variables.
intercept (bool, optional) – Fit intercept.
fit_kws (dict, optional) – Keyword arguments passed to fit() method of model.
logit_kws (dict, optional) – Keyword arguments passed to
statsmodels.discrete.discrete_model.Logit()
class.
- Returns
Model after applying fit method.
- Return type
statsmodels.discrete.discrete_model.BinaryResultsWrapper
Example
>>> import scikit_na as na >>> model = na.model( ... data, ... col_na='column_with_NAs', ... columns=['age', 'height', 'weight']) >>> model.summary()
- scikit_na.report(data: pandas.core.frame.DataFrame, columns: Optional[Sequence[str]] = None, layout: Optional[ipywidgets.widgets.widget_layout.Layout] = None, round_dec: int = 2, corr_kws: Optional[dict] = None, heat_kws: Optional[dict] = None, dist_kws: Optional[dict] = None)
Interactive report.
- Parameters
data (DataFrame) – Input data.
columns (Optional[Sequence[str]], optional) – Columns names.
layout (widgets.Layout, optional) – Layout object for use in GridBox.
round_dec (int, optional) – Number of decimals for rounding.
corr_kws (dict, optional) – Keyword arguments passed to
scikit_na.altair.plot_corr()
.heat_kws (dict, optional) – Keyword arguments passed to
scikit_na.altair.plot_heatmap()
.hist_kws (dict, optional) – Keyword arguments passed to
scikit_na.altair.plot_hist()
.
- Returns
Interactive report with multiple tabs.
- Return type
widgets.Tab
- scikit_na.stairs(data: pandas.core.frame.DataFrame, columns: Optional[Sequence] = None, xlabel: str = 'Columns', ylabel: str = 'Instances', tooltip_label: str = 'Size difference', dataset_label: str = '(Whole dataset)')
DataFrame shrinkage on cumulative
pandas.DataFrame.dropna()
.- Parameters
data (DataFrame) – Input data.
columns (Optional[Sequence], optional) – Columns names.
xlabel (str, optional) – X axis label.
ylabel (str, optional) – Y axis label.
tooltip_label (str, optional) – Tooltip label.
dataset_label (str, optional) – Label for a whole dataset.
- Returns
Dataset shrinkage results after cumulative
pandas.DataFrame.dropna()
.- Return type
DataFrame
- scikit_na.summary(data: pandas.core.frame.DataFrame, columns: Optional[Sequence] = None, per_column: bool = True, round_dec: int = 2) → pandas.core.frame.DataFrame
Summary statistics on NA values.
- Parameters
data (DataFrame) – Data object.
columns (Optional[Sequence]) – Columns or indices to observe.
per_column (bool = True, optional) – Show stats per each selected column.
round_dec (int = 2, optional) – Number of decimals for rounding.
- Returns
Summary on NA values in the input data.
- Return type
DataFrame
- scikit_na.test_hypothesis(data: pandas.core.frame.DataFrame, col_na: str, test_fn: callable, test_kws: Optional[dict] = None, columns: Optional[Union[Sequence[str], Dict[str, callable]]] = None, dropna: bool = True) → Dict[str, object]
Test a statistical hypothesis.
This function can be used to find evidence against missing completely at random (MCAR) mechanism by comparing two samples grouped by missingness in another column.
- Parameters
data (DataFrame) – Input data.
col_na (str) – Column to group values by.
pandas.Series.isna()
method is applied before grouping.columns (Optional[Union[Sequence[str], Dict[str, callable]]]) – Columns to test hypotheses on.
test_fn (callable, optional) – Function to test hypothesis on NA/non-NA data. Must be a two-sample test function that accepts two arrays and (optionally) keyword arguments such as
scipy.stats.mannwhitneyu()
.test_kws (dict, optional) – Keyword arguments passed to test_fn function.
dropna (bool = True, optional) – Drop NA values in two samples before running a hypothesis test.
- Returns
Dictionary with tests results as column => test function output.
- Return type
Dict[str, object]
Example
>>> import scikit_na as na >>> import pandas as pd >>> data = pd.read_csv('some_dataset.csv') >>> # Simple example >>> na.test_hypothesis( ... data, ... col_na='some_column_with_NAs', ... columns=['age', 'height', 'weight'], ... test_fn=ss.mannwhitneyu)
>>> # Example with `columns` as a dictionary of column => function pairs >>> from functools import partial >>> import scipy.stats as st >>> # Passing keyword arguments to functions >>> kstest_mod = partial(st.kstest, N=100) >>> mannwhitney_mod = partial(st.mannwhitneyu, use_continuity=False) >>> # Running tests >>> results = na.test_hypothesis( ... data, ... col_na='some_column_with_NAs', ... columns={ ... 'age': kstest_mod, ... 'height': mannwhitney_mod, ... 'weight': mannwhitney_mod}) >>> pd.DataFrame(results, index=['statistic', 'p-value'])