Cleaning Data

Cleaning Data

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, duplicated, or improperly formatted. This data is usually not necessary for the analysis and can cause issues if not handled properly.

There are several ways to handle missing data, including imputation, deletion, and interpolation.

#given a sample dataset 
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from scipy import interpolate

# Sample DataFrame
data = {
    'Feature1': [1, 2, np.nan, 4, 5],
    'Feature2': [6, np.nan, 8, 9, 10],
    'Categorical': ['A', 'B', np.nan, 'B', 'A']
df = pd.DataFrame(data)

Drop all NA or delete the column entirely

# Row Deletion
df_row_deleted = df.dropna()
# Column Deletion
df_column_deleted = df.dropna(axis=1)

Impute the feature with mean, median or mode

# Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
df['Feature1'] = mean_imputer.fit_transform(df[['Feature1']])

# Median Imputation
median_imputer = SimpleImputer(strategy='median')
df['Feature2'] = median_imputer.fit_transform(df[['Feature2']])

# Mode Imputation
mode_imputer = SimpleImputer(strategy='most_frequent')
df['Categorical'] = mode_imputer.fit_transform(df[['Categorical']])

You can also assign a special placeholder. This is less common but remains an option.

df['Categorical'].fillna('Unknown', inplace=True)
# more commonly 
df.fillna(-999, inplace=True)

For time series data, you can use interpolation to fill in the missing values.

ts_data = pd.Series([1, np.nan, np.nan, 4, 5])
ts_data.interpolate(method='linear', inplace=True)

Hot-deck imputation is using your own data and randomly fill it in

for column in df.columns:
    non_nulls = df[column].dropna()
    df[column] = df[column].apply(lambda x: np.random.choice(non_nulls) if pd.isnull(x) else x)

Would you know what a cold-deck imputation is?

Feature Engineering

Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work more effectively. Proper feature engineering increases the predictive power of machine learning algorithms by creating features from raw data that facilitate the learning process.

Feature Scaling

Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

Standardization (Z-Score Normalization)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

Feature Encoding

One-Hot Encoding

import pandas as pd
df = pd.DataFrame({'color': ['red', 'green', 'blue']})
one_hot_encoded_data = pd.get_dummies(df, columns=['color'])

Label Encoding

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])

Discretization (Binning)

data = [0.2, 0.4, 1.5, 1.7, 2.0]
bins = [0, 1, 2, 3]
discretized_data = pd.cut(data, bins)

Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
polynomial_data = poly.fit_transform(data)

Root Transformation

sqrt_transformed_data = np.sqrt(data)

cube_root_transformed_data = np.cbrt(data)

More advance techniques include Splines to model nonlinear relationships.

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline

# Sample data (could be any real-world data)
x = np.linspace(0, 10, 30)
y = np.sin(x) + np.random.normal(size=x.shape) * 0.2  # Adding some noise

# Fit a spline (UnivariateSpline in this case)
spline = UnivariateSpline(x, y)

# Smoothing factor (s) controls the number of knots by specifying a condition on the total squared error.
# A smaller value of 's' means more knots (i.e., more flexibility).

# Generate points to evaluate the spline fit
x_smooth = np.linspace(0, 10, 200)
y_smooth = spline(x_smooth)

# Plotting the original data and the spline
plt.scatter(x, y, label='Original Data')
plt.plot(x_smooth, y_smooth, color='red', label='Spline Fit')

Can you put your skills to the test with an actual dataset for interviewing data scientists?