Statistics
In missing data analysis, an important step is to calculate simple descriptive and aggregating statistics of missing and non-missing data for each column and for the whole dataset. Scikit-na attempts to provide useful functions for such operations.
Summary
We will use Titanic dataset that contains NA values in three columns: Age, Cabin, and Embarked.
Per column
To get a simple summary per each column, we will load a dataset using pandas
and pass it to summary()
function. The latter supports subsetting a dataset
with columns
argument. And we will make use of it to cut the width of the
results table.
import scikit_na as na
import pandas as pd
data = pd.read_csv('titanic_dataset.csv')
# Excluding three columns without NA to fit the table here
na.summary(data, columns=data.columns.difference(['SibSp', 'Parch', 'Ticket']))
Age |
Cabin |
Embarked |
Fare |
Name |
PassengerId |
Pclass |
Sex |
Survived |
|
---|---|---|---|---|---|---|---|---|---|
NA count |
177 |
687 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
NA, % (per column) |
19.87 |
77.1 |
0.22 |
0 |
0 |
0 |
0 |
0 |
0 |
NA, % (of all NAs) |
20.44 |
79.33 |
0.23 |
0 |
0 |
0 |
0 |
0 |
0 |
NA unique (per column) |
19 |
529 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
NA unique, % (per column) |
10.73 |
77 |
100 |
0 |
0 |
0 |
0 |
0 |
0 |
Rows left after dropna() |
714 |
204 |
889 |
891 |
891 |
891 |
891 |
891 |
891 |
Rows left after dropna(), % |
80.13 |
22.9 |
99.78 |
100 |
100 |
100 |
100 |
100 |
100 |
Those measures were meant to be self-explanatory:
NA count is the number of NA values in each column.
NA unique is the number of NA values in each column that are unique for it, i.e. do not intersect with NA values in the other columns (or that will remain in dataset if we drop NA values in the other columns).
Rows left after dropna() shows how many rows will be left in a dataset if we apply
pandas.Series.dropna()
method to each column.
Whole dataset
By default, summary()
function returns the results for each column. To get
the summary of missing data for the whole DataFrame, we should set per_column
argument to
False
.
na.summary(data, per_column=False)
dataset |
|
---|---|
Total columns |
12 |
Total rows |
891 |
Rows with NA |
708 |
Rows without NA |
183 |
Total cells |
10692 |
Cells with NA |
866 |
Cells with NA, % |
8.1 |
Cells with non-missing data |
9826 |
Cells with non-missing data, % |
91.9 |
Descriptive statistics
The next step is to calculate descriptive statistics for columns with quantitative and qualitative data. First, let’s filter the columns by data types:
# Presumably, qualitative data, needs checking
cols_nominal = data.columns[data.dtypes == object]
# Quantitative data
cols_numeric = data.columns[(data.dtypes == float) | (data.dtypes == int)]
We should also specify a column with missing values (NAs) that will be used to split the data in the selected columns into two groups: NA (missing) and Filled (non-missing).
Qualitative data
na.describe(data, columns=cols_nominal)
Embarked |
Name |
Sex |
Ticket |
|||||
---|---|---|---|---|---|---|---|---|
Cabin |
Filled |
NA |
Filled |
NA |
Filled |
NA |
Filled |
NA |
count |
202 |
687 |
204 |
687 |
204 |
687 |
204 |
687 |
unique |
3 |
3 |
204 |
687 |
2 |
2 |
142 |
549 |
top |
S |
S |
Levy, Mr. Rene Jacques |
Nasser, Mr. Nicholas |
male |
male |
113760 |
347082 |
freq |
129 |
515 |
1 |
1 |
107 |
470 |
4 |
7 |
Let’s check the results by hand:
data.groupby(
data['Cabin'].isna().replace({False: 'Filled', True: 'NA'}))['Sex']\
.value_counts()
Cabin |
Sex |
Count |
---|---|---|
Filled |
male |
107 |
female |
97 |
|
NA |
male |
470 |
female |
217 |
Here we take Cabin column, encode missing/non-missing data as Filled/NA, and then use it to group and count values in Sex column: among the passengers with missing cabin data, 470 were males, while 217 were females.
Quantitative data
Now, let’s look at the statistics calculated for the numeric data:
# Selecting just two columns
na.describe(data, columns=['Age', 'Fare'], col_na='Cabin')
Age |
Fare |
|||
---|---|---|---|---|
Cabin |
Filled |
NA |
Filled |
NA |
count |
185 |
529 |
204 |
687 |
mean |
35.8293 |
27.5553 |
76.1415 |
19.1573 |
std |
15.6794 |
13.4726 |
74.3917 |
28.6633 |
min |
0.92 |
0.42 |
0 |
0 |
25% |
24 |
19 |
29.4531 |
7.8771 |
50% |
36 |
26 |
55.2208 |
10.5 |
75% |
48 |
35 |
89.3282 |
23 |
max |
80 |
74 |
512.329 |
512.329 |
The mean age of passengers with missing cabin data was 27.6 years.