Back to Projects

Project 03 · Completed · Zero-Dose Risk Modelling

Zero-Dose Prediction Using ML

Region-wise NFHS-5 zero-dose risk modeling for children 12-23 months, combining L1 Logistic Regression and Neural Networks with class-1-focused thresholding and explainability.

NFHS-5 (12-23 months) L1 Logistic + Neural Network PR-Based Thresholding Regional Models

1) What We Did

  • Built zero-dose risk models using 10 predictors.
  • Trained Logistic Regression (L1, class-weighted) and Neural Network (focal loss + class weights).
  • Modeled regions separately: North, South, East, West, Northeast.
  • Used stratified splits and PR-curve thresholding for minority class.
  • Generated feature importance using LR coefficients, LOFO, and permutation analysis.

Operational Definition

Zero-dose means no DTP-containing first dose (no-DTP1; IA2030-aligned proxy).

2) Predictor Set

x_i = [Age, FemaleChild, Rural, state, BirthOrder, Deprived, FamilySize, NoAntenatalCare, UnassistedBirth, Maternal_Illiteracy]

Outcome:

y_i = 1 if no-DTP1, else 0

3) Class Prevalence by Region

Region ZD=1 / Total Prevalence
North1015 / 143037.10%
South268 / 56624.73%
East515 / 100695.11%
West463 / 68836.73%
Northeast715 / 613411.66%
Overall2976 / 430516.91%

4) Logistic Baseline

p_i = sigma(w^T x_i + b)
L_LR = -sum_i alpha_{y_i}[y_i log p_i + (1-y_i) log(1-p_i)] + lambda ||w||_1

with class_weight='balanced' (minority up-weighting).

5) Neural Comparator

h^(1)=phi(W1x+b1), ..., p_i=sigma(w_o^T h^(L)+b_o)

Training uses focal-loss emphasis and class weights to prioritize hard minority examples.

L_focal = -sum_i alpha_{y_i}(1-p_{t,i})^gamma log(p_{t,i})

6) Threshold Selection and Validation

Threshold is chosen from training PR curve by maximizing F-beta objective, then applied to held-out tests.

y_hat_i(tau)=1[p_i>=tau]
F_beta(tau)=(1+beta^2) * Precision(tau) * Recall(tau) / (beta^2 * Precision(tau) + Recall(tau))

Implemented betas: LR about 2.0, NN about 1.85 (class-1 focus).

7) Metrics Driving Model Choice

Accuracy = (TP+TN)/(TP+TN+FP+FN)
Precision_1 = TP/(TP+FP)
Recall_1 = TP/(TP+FN)
F1_1 = 2*Precision_1*Recall_1/(Precision_1+Recall_1)

Primary emphasis: minority class precision-recall tradeoff, not plain accuracy.

8) Explainability

Global explainability:

  • LR coefficient direction/magnitude.
  • LOFO importance.
  • Permutation importance.
Imp_j = E[M(f,X,y)-M(f,X_perm(j),y)]

9) Region-Specific Generalization and Practical Insights

Separate regional models prevent a single pooled model from masking heterogeneity. Stable high-signal factors include Rural, Deprived, Maternal_Illiteracy, NoAntenatalCare, and UnassistedBirth.

Failure modes in low-support settings include threshold instability and precision loss due to low class-1 support.

10) Current Gaps to Acknowledge

  • No full formal fairness audit yet (e.g., equal opportunity gaps across protected subgroups).
  • Need ongoing calibration and drift monitoring for repeated deployment.

Drift Monitoring

Track prevalence drift, feature drift, PR/F1 drift, calibration drift, and subgroup gap drift.

11) Intervention Workflow After High-Risk Flag

  1. Generate district/block high-risk list.
  2. ASHA/ANM verification (card + recall).
  3. Household outreach and catch-up scheduling.
  4. Reminder/defaulter follow-up.
  5. Closure logging and feedback loop to model retraining.

12) Why This is Operationally Useful

The model identifies children who never entered the immunization pathway, allowing programs to prioritize first-contact outreach before schedule-completion interventions.

Extension project: this work is expanded in Insights into NN Across National and Regional Immunisation Outcomes.