Project 03 - Zero-Dose Prediction Using ML

1) What We Did

Built zero-dose risk models using 10 predictors.
Trained Logistic Regression (L1, class-weighted) and Neural Network (focal loss + class weights).
Modeled regions separately: North, South, East, West, Northeast.
Used stratified splits and PR-curve thresholding for minority class.
Generated feature importance using LR coefficients, LOFO, and permutation analysis.

Operational Definition

Zero-dose means no DTP-containing first dose (no-DTP1; IA2030-aligned proxy).

2) Predictor Set

x_i = [Age, FemaleChild, Rural, state, BirthOrder, Deprived, FamilySize, NoAntenatalCare, UnassistedBirth, Maternal_Illiteracy]

Outcome:

y_i = 1 if no-DTP1, else 0

3) Class Prevalence by Region

Region	ZD=1 / Total	Prevalence
North	1015 / 14303	7.10%
South	268 / 5662	4.73%
East	515 / 10069	5.11%
West	463 / 6883	6.73%
Northeast	715 / 6134	11.66%
Overall	2976 / 43051	6.91%

4) Logistic Baseline

p_i = sigma(w^T x_i + b)

L_LR = -sum_i alpha_{y_i}[y_i log p_i + (1-y_i) log(1-p_i)] + lambda ||w||_1

with class_weight='balanced' (minority up-weighting).

5) Neural Comparator

h^(1)=phi(W1x+b1), ..., p_i=sigma(w_o^T h^(L)+b_o)

Training uses focal-loss emphasis and class weights to prioritize hard minority examples.

L_focal = -sum_i alpha_{y_i}(1-p_{t,i})^gamma log(p_{t,i})

6) Threshold Selection and Validation

Threshold is chosen from training PR curve by maximizing F-beta objective, then applied to held-out tests.

y_hat_i(tau)=1[p_i>=tau]

F_beta(tau)=(1+beta^2) * Precision(tau) * Recall(tau) / (beta^2 * Precision(tau) + Recall(tau))

Implemented betas: LR about 2.0, NN about 1.85 (class-1 focus).

7) Metrics Driving Model Choice

Accuracy = (TP+TN)/(TP+TN+FP+FN)

Precision_1 = TP/(TP+FP)

Recall_1 = TP/(TP+FN)

F1_1 = 2*Precision_1*Recall_1/(Precision_1+Recall_1)

Primary emphasis: minority class precision-recall tradeoff, not plain accuracy.

8) Explainability

Global explainability:

LR coefficient direction/magnitude.
LOFO importance.
Permutation importance.

Imp_j = E[M(f,X,y)-M(f,X_perm(j),y)]

9) Region-Specific Generalization and Practical Insights

Separate regional models prevent a single pooled model from masking heterogeneity. Stable high-signal factors include Rural, Deprived, Maternal_Illiteracy, NoAntenatalCare, and UnassistedBirth.

Failure modes in low-support settings include threshold instability and precision loss due to low class-1 support.

10) Current Gaps to Acknowledge

No full formal fairness audit yet (e.g., equal opportunity gaps across protected subgroups).
Need ongoing calibration and drift monitoring for repeated deployment.

Drift Monitoring

Track prevalence drift, feature drift, PR/F1 drift, calibration drift, and subgroup gap drift.

11) Intervention Workflow After High-Risk Flag

Generate district/block high-risk list.
ASHA/ANM verification (card + recall).
Household outreach and catch-up scheduling.
Reminder/defaulter follow-up.
Closure logging and feedback loop to model retraining.

12) Why This is Operationally Useful

The model identifies children who never entered the immunization pathway, allowing programs to prioritize first-contact outreach before schedule-completion interventions.

Extension project: this work is expanded in Insights into NN Across National and Regional Immunisation Outcomes.