Tabular Playground Series — August 2022

Tabular Playground Series is a monthly competition held by Kaggle.

The data represents the results of a large product testing study. For each, product_code you are given a set of products attributes (fixed for the code) as well as several measurement values for each product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment and absorbs a certain amount of fluid (loading) to see whether or not it fails.

The ultimate goal is to use the data to predict individual product failures of new codes with their lab test results.

train = pd.read_csv("../input/tabular-playground-series-aug-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-aug-2022/test.csv")
Getting familiar with available columns

Next, we can inspect the missing values from the data:

def get_missing_value(df_name, df):
data_shape = df.shape
missing_val = 100* df.isna().sum().sum()/(len(df)*25)
print("{df_name} data shape is: {data_shape}".format(
df_name=df_name,
data_shape=data_shape
))
print("{df_name} data missing value is: {missing_val}".format(
df_name=df_name,
missing_val=missing_val
))
print("-------")
get_missing_value("Train", train)
get_missing_value("Test", test)
Missing values from both Train and Test data.

In the next step, we would like to see the distribution of the data

Data distribution

After understanding the distribution of the data, we take a look at the product code of both train and test data

Product code for Train and Test data

We notice that both train and test data has different product code.

categorical_features = [col for col in train.columns if col not in ['failure', 'id'] and train[col].dtype=="O"]
numerical_features = [col for col in train.columns if col not in ['failure', 'id'] and train[col].dtype!="O"]
def plot_original_and_transformed(df):
plt.figure(figsize= (22,5))
plt.subplot(1,2,1)
sns.histplot(df['loading'],kde=True,color='coral')
plt.title("Original")
plt.subplot(1,2,2)
sns.histplot(np.log(df["loading"]),kde=True)
plt.title("Log transformed")
sns.despine()
Getting original and transformed plots for Train and Test data.

In the following step, we will mix both test and train data.

target = train.pop('failure')
target_mean = np.mean(target)
print(f"target mean --> {target_mean}")
# target mean --> 0.21260820474219044
data = pd.concat([train, test])
train.shape,test.shape

We will get the missing values of some of the attributes that we have:

data['m3_missing'] = data['measurement_3'].isnull().astype(np.int8)
data['m5_missing'] = data['measurement_5'].isnull().astype(np.int8)
data['area'] = data['attribute_2'] * data['attribute_3']
data['loading'] = np.log(data['loading'])
data['count_null'] = data.isnull().sum(axis=1)
features = [f for f in test.columns if f.startswith('measurement') or f=='loading']

Next, we need to get the information about the top ten measurement columns sorted by their correlation.

filled_in = dict()
filled_in['measurement_17'] = dict(
A = [
'measurement_5',
'measurement_6',
'measurement_8',
'measurement_7'
],
B = [
'measurement_4',
'measurement_5',
'measurement_7',
'measurement_9'
],
C = [
'measurement_5',
'measurement_7',
'measurement_8',
'measurement_9'
],
D = [
'measurement_5',
'measurement_6',
'measurement_7',
'measurement_8'
],
E = [
'measurement_4',
'measurement_5',
'measurement_6',
'measurement_8'
],
F = [
'measurement_4',
'measurement_5',
'measurement_6',
'measurement_7'
],
G = [
'measurement_4',
'measurement_6',
'measurement_8',
'measurement_9'
],
H = [
'measurement_4',
'measurement_5',
'measurement_7',
'measurement_8',
'measurement_9'
],
I = [
'measurement_3',
'measurement_7',
'measurement_8',
'measurement_9'
]
)
col = [col for col in test.columns if 'measurement' not in col]+ ['loading','m3_missing','m5_missing']
a = list()
b = list()
for x in range(3,17):
corr = np.absolute(
data.drop(col, axis=1)
.corr()[f'measurement_{x}']).sort_values(ascending=False)
a.append(np.round(np.sum(corr[1:4]),3))
b.append(f'measurement_{x}')
c = pd.DataFrame()
c['Selected columns'] = b
c['correlation total'] = a

Then we get selected the column selected by the sum of its 3 first rows:

c = c.sort_values(
by = 'correlation total',
ascending=False).reset_index(drop = True)
c.head()
Selected columns

Next, we filled in NA values using the HuberRegressor and KNNImputer. HuberRegression will be used when (1) except for the target feature, all other correlated feature column has no null values or samples where the columns have no null values. KNNImputer will be used otherwise.

Once we get the data we will use StandardScaler on the data and then train the data using StratifiedKFold.

N_FOLDS = 5
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=22)
y_oof = np.zeros(df_train[features].shape[0])
y_test = np.zeros(df_test[features].shape[0])
logistic_auc = 0
ix = 0
feature_importance = []
lg_model=[]
sm = SMOTE(random_state = 42, n_jobs = -1)for train_ind, val_ind in skf.split(df_train[features], df_train[['failure']]):
print(f"******* Fold {ix} ******* ")
tr_x, val_x = (
df_train[features].iloc[train_ind].reset_index(drop=True),
df_train[features].iloc[val_ind].reset_index(drop=True),
)
tr_y, val_y = (
df_train['failure'].iloc[train_ind].reset_index(drop=True),
df_train['failure'].iloc[val_ind].reset_index(drop=True),
)

tr_x,val_x,test_x = scale(tr_x, val_x, df_test[features], features)

tr_x, tr_y = sm.fit_resample(tr_x, tr_y)
clf = LogisticRegression(max_iter=500, C=0.0001, penalty='l2',solver='newton-cg')

clf.fit(tr_x, tr_y)

feature_importance.append(clf.coef_.ravel())
preds = clf.predict_proba(val_x)[:,1]

roc_score = roc_auc_score(val_y, preds)

logistic_auc += roc_score/N_FOLDS
print('VAL_ROC-AUC:', round(roc_score, 5))

y_oof[val_ind] = y_oof[val_ind] + preds
preds_test = clf.predict_proba(test_x)[:,1]
lg_model.append(preds_test)
y_test = y_test + preds_test / N_FOLDS
ix = ix + 1

print(f"{Fore.GREEN}{Style.BRIGHT}Average auc = {round(logistic_auc, 5)}{Style.RESET_ALL}")
print(f"{Fore.BLUE}{Style.BRIGHT}OOF auc = {round(roc_auc_score(df_train[['failure']], y_oof), 5)}{Style.RESET_ALL}")
feature_importance.append(clf.coef_.ravel())
importance_df = pd.DataFrame(np.array(feature_importance).T, index=df_train[features].columns)
importance_df['mean'] = importance_df.mean(axis=1).abs()
importance_df['feature'] = df_train[features].columns
importance_df = importance_df.sort_values('mean', ascending=True).reset_index()
fig, ax = plt.subplots(figsize=(12, 8), facecolor='#EAECEE')
plt.barh(importance_df.index, importance_df['mean'], color='lightseagreen')
plt.yticks(ticks=importance_df.index, labels=importance_df['feature'])
plt.title('LogisticRegression feature importances', fontsize=20, y= 1.05)
plt.show()

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Irvi Aini

Irvi Aini

152 Followers

Machine Learning, Natural Language Processing, and Open Source.