Cleaning Data
Cleaning Data
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, duplicated, or improperly formatted. This data is usually not necessary for the analysis and can cause issues if not handled properly.
There are several ways to handle missing data, including imputation, deletion, and interpolation.
#given a sample dataset
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from scipy import interpolate
# Sample DataFrame
data = {
'Feature1': [1, 2, np.nan, 4, 5],
'Feature2': [6, np.nan, 8, 9, 10],
'Categorical': ['A', 'B', np.nan, 'B', 'A']
}
df = pd.DataFrame(data)
Drop all NA or delete the column entirely
# Row Deletion
df_row_deleted = df.dropna()
# Column Deletion
df_column_deleted = df.dropna(axis=1)
Impute the feature with mean, median or mode
# Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
df['Feature1'] = mean_imputer.fit_transform(df[['Feature1']])
# Median Imputation
median_imputer = SimpleImputer(strategy='median')
df['Feature2'] = median_imputer.fit_transform(df[['Feature2']])
# Mode Imputation
mode_imputer = SimpleImputer(strategy='most_frequent')
df['Categorical'] = mode_imputer.fit_transform(df[['Categorical']])
You can also assign a special placeholder. This is less common but remains an option.
df['Categorical'].fillna('Unknown', inplace=True)
# more commonly
df.fillna(-999, inplace=True)
For time series data, you can use interpolation to fill in the missing values.
ts_data = pd.Series([1, np.nan, np.nan, 4, 5])
ts_data.interpolate(method='linear', inplace=True)
Hot-deck imputation is using your own data and randomly fill it in
for column in df.columns:
non_nulls = df[column].dropna()
df[column] = df[column].apply(lambda x: np.random.choice(non_nulls) if pd.isnull(x) else x)
Would you know what a cold-deck imputation is?
Feature Engineering
Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work more effectively. Proper feature engineering increases the predictive power of machine learning algorithms by creating features from raw data that facilitate the learning process.
Feature Scaling
Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
Standardization (Z-Score Normalization)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
Feature Encoding
One-Hot Encoding
import pandas as pd
df = pd.DataFrame({'color': ['red', 'green', 'blue']})
one_hot_encoded_data = pd.get_dummies(df, columns=['color'])
Label Encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])
Discretization (Binning)
data = [0.2, 0.4, 1.5, 1.7, 2.0]
bins = [0, 1, 2, 3]
discretized_data = pd.cut(data, bins)
Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
polynomial_data = poly.fit_transform(data)
Root Transformation
sqrt_transformed_data = np.sqrt(data)
cube_root_transformed_data = np.cbrt(data)
More advance techniques include Splines to model nonlinear relationships.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
# Sample data (could be any real-world data)
x = np.linspace(0, 10, 30)
y = np.sin(x) + np.random.normal(size=x.shape) * 0.2 # Adding some noise
# Fit a spline (UnivariateSpline in this case)
spline = UnivariateSpline(x, y)
# Smoothing factor (s) controls the number of knots by specifying a condition on the total squared error.
# A smaller value of 's' means more knots (i.e., more flexibility).
spline.set_smoothing_factor(0.5)
# Generate points to evaluate the spline fit
x_smooth = np.linspace(0, 10, 200)
y_smooth = spline(x_smooth)
# Plotting the original data and the spline
plt.scatter(x, y, label='Original Data')
plt.plot(x_smooth, y_smooth, color='red', label='Spline Fit')
plt.legend()
plt.show()
Can you put your skills to the test with an actual dataset for interviewing data scientists?