A machine learning approach to classify time-clock punches as IN / OUT / ERROR and support shift inference in complex rotating schedules. The public version uses a synthetic dataset to preserve confidentiality.
In rotating-shift environments, raw punch logs can be noisy: duplicates, missing punches, out-of-order events, and role-dependent patterns. Misclassification creates payroll friction and forces manual audits.
Given a sequence of time-clock events per employee, classify each event as: IN (start of work), OUT (end of work), or ERROR (invalid / inconsistent punch).
Secondary objective: from cleaned sequences, derive interpretable indicators about shift type (day/night/rotating) and anomaly flags.
The public dataset is generated to emulate real-world patterns:
Key features operate at event level, using local sequence context:
hour_of_day, day_of_week, is_weekenddelta_prev_minutes, delta_next_minutespunch_index_in_day, punch_count_dayI used a Random Forest for its ability to capture non-linear interactions and provide practical interpretability through feature importance.
Punch data is highly employee-specific. To avoid “memorizing employees”, evaluation uses GroupKFold by employee (train/test splits never share employees).
In production contexts, the priority is to minimize false “valid” classifications for errors, and to surface uncertain cases for review.
# Pseudocode outline
X = build_features(punch_events)
y = labels # IN / OUT / ERROR
cv = GroupKFold(n_splits=5) # groups = employee_id
model = RandomForestClassifier(
n_estimators=400,
max_depth=None,
class_weight="balanced",
random_state=42
)
scores = cross_validate(model, X, y, cv=cv, groups=employee_id,
scoring=["f1_macro","precision_macro","recall_macro"])
Once events are classified and cleaned, shift indicators can be derived: