Filling NAs

Most Machine Learning algorithms cannot work with missing features, in order to fix them we have three options:

Getting rid of the rows with na in the particular attribute
Getting rid of the whole attribute in the dataset
Set the values to some values (zero, mean, median, etc.)

dataset_df.dropna(subset=["COLUMN_WITH_NA"]) # option 1
dataset_df.drop("COLUMN_WITH_NA", axis=1) # option 2
median = dataset_df["COLUMN_WITH_NA"].median()
dataset_df["COLUMN_WITH_NA"].fillna(median) # option 3

Scikit-Learn provides a handy class to take care of missing values: Imputer. Here is how to use it. First, you need to create an Imputer instance, specifying that you want to replace each attribute’s missing values with the median of that attribute:

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")

# Since median can only be computed on numerical attributes, make sure
# the dataset is the subset of numerical attributes then call the fit()
# method on the dataset.
imputer.fit(dataset_df)

# Returns median of each attribute in its statistics_ instance variable
imputer.statistics_

# Transform the dataset by replacing missing values by learned median
X = imputer.transform(dataset_df)

# Result is a plain Numpy array, put it back into a dataframe
dataset_df_transformed = pd.DataFrame(X, columns=dataset_df.columns)