Alexander Petrov

November 10, 2025

What Changed? Pin-pointing behavior shift

Let’s say you know that your metrics changed at a certain point of time. How could you leverage this to learn what exactly has changed?

Sometimes it's possible to reduce a practical problem down to a well-studied one.
In this case  it’s possible to turn “What Changed?” into a Classification Problem.

  1. Label data  
    • Everything before the suspected change → control = 0  
    • Everything after the suspected change → treatment = 1
  2. Train a classifier to predict 0 vs 1 using only impression-level features (geo, publisher, app, creative, campaign, etc. but never use time as a feature as it’s a target leak)
  3. Look at feature importances → the top ones are exactly what changed.
That’s it, the classifier is forced to find the patterns that best separate “old world” from “new world”.

Why this works so well

  • Gradient-boosted trees (LightGBM, XGBoost, CatBoost) are perfect here: they handle mixed data types, missing values, and give you reliable feature importances.
  • Because there is no time in the dataset, the model can’t cheat with ‘it’s later → label 1’.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score


df = pd.read_parquet('data.pq').assign(
    # label 'before' and 'after'
    y=lambda x: x.t <= '2025-09-09'
)

numerical_features = ['p'] # specify your numeric columns
categorical_features = df.drop(columns=['t','y'] + numerical_features).columns.tolist()
target = 'y' # trying to pre

X = df[categorical_features + numerical_features]
y = df[target]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features),
        ('num', StandardScaler(), numerical_features)
    ])

# Create pipeline with preprocessor and RandomForestClassifier
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])

# Fit the model
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]


# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}")

pd.DataFrame({
    'feature': (model.named_steps['preprocessor']
                .named_transformers_['cat']
                .get_feature_names_out(categorical_features)
                .tolist() + numerical_features),
    'importance': model.named_steps['classifier'].feature_importances_
}).sort_values(by='importance', ascending=False).head(10)

0.85.jpg

Classifier clearly could separate ‘before’ and ‘after’ which means that responsible features could be identified
model_name_champion_general.jpg

About Alexander Petrov


I build products for fun and profit.
web page