Back to Projects

Project 05 · Extension of Project 03

Insights into Neural Networks Across National and Regional Immunisation Outcomes

A fresh retraining framework comparing Neural Networks and L1 Logistic Regression for ZD, MCV dropout, and DPT dropout, with explicit national-versus-regional interpretation and mathematical traceability.

Fresh Retrain NN vs LR Class-1 Optimization National vs Regional Analysis

1) Scope and Relation to Previous Work

This project extends Zero-Dose Prediction Using ML from a single-outcome setup to a unified three-outcome framework:

  • ZD (zero dose)
  • Dropout BCG to MCV
  • DPT1 to DPT3 dropout

Only fresh outputs from 202_final/retrain_outputs/*_fresh are used.

2) Data Selection and Sample Definition

The analytical framing follows the NFHS-style 12-23 month cohort logic you outlined.

Sample Filtering (Notation)

S0 = all surveyed children
S1 = {i in S0 : b5_i = 1 (alive)}
S2 = {i in S1 : 12 <= b19_i <= 23}

Working sample is S2 after complete-case filtering on model variables.

Data Sources

202_final/data/merged_data.csv (ZD)
202_final/data/biswasNew_BCG_MCV.xls (MCV dropout)
202_final/data/biswasNew_BDPT.xls (DPT dropout)

3) Predictor Definitions

Shared feature vector for each record:

x_i = [Age_i, FemaleChild_i, Rural_i, state_i, BirthOrder_i, Deprived_i, FamilySize_i, NoAntenatalCare_i, UnassistedBirth_i, MaternalIlliteracy_i]

Binary/categorical encodings follow the operational definitions used in retraining scripts.

4) Regional Partition

Let R = {india, north, east, west, south, northeast}.

D_{o,r} = {(x_i, y_i^o): outcome=o, region=r}

Each outcome o is trained and evaluated separately in each region r.

5) Train/Eval Protocol

Per outcome and region, splits are generated using stratified shuffling:

(Train_{o,r}^{(s)}, Test_{o,r}^{(s)}), s = 1,2

Class proportions are preserved in each split to stabilize minority-class evaluation.

Scaling

Continuous predictors are standardized on train statistics:

x'_{ij} = (x_{ij} - mu_j^{train}) / sigma_j^{train}

6) Logistic Regression Baseline

LR with L1 regularization and balanced class weighting:

p_i = sigma(w^T x_i + b)
L_LR = -sum_i alpha_{y_i}[y_i log p_i + (1-y_i) log(1-p_i)] + lambda ||w||_1

with class-balance weights:

alpha_c proportional 1 / N_c

7) Neural Network Comparator

Tuned MLP classifier:

h^(1) = phi(W1 x + b1), ..., h^(L) = phi(WL h^(L-1) + bL)
p_i = sigma(w_o^T h^(L) + b_o)

Hyperparameter search spans depth, width, regularization, learning rate, batch-size, and training iterations.

8) Class-1-Focused Threshold Optimization

For each trained model candidate, probabilities are converted to labels using threshold tau:

y_hat_i(tau) = 1[p_i >= tau]

Class-1 objective:

F1_1(tau) = 2 * Prec_1(tau) * Rec_1(tau) / (Prec_1(tau) + Rec_1(tau))
tau^* = argmax_tau F1_1(tau), with guardrails tau_min <= tau <= tau_max

This ensures optimization is aligned to missed-child detection rather than majority-class accuracy.

9) NN Candidate Selection Score

NN candidates are ranked with minority-class emphasis and LR comparison:

Score_NN = F1_1_NN + eta * max(0, F1_1_NN - F1_1_LR) - rho * DegeneracyPenalty

The LR-margin bonus prioritizes NN candidates that genuinely exceed LR on class-1 performance.

10) Evaluation Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision_1 = TP / (TP + FP)
Recall_1 = TP / (TP + FN)
F1_1 = 2 * Precision_1 * Recall_1 / (Precision_1 + Recall_1)

11) National vs Regional Aggregation

Split-level metric:

M_{o,r,m}^{(s)} for outcome o, region r, model m, split s

Region summary:

M_bar_{o,r,m} = (1/S) * sum_s M_{o,r,m}^{(s)}

National summary over regions (macro):

M_nat_{o,m} = (1/|R*|) * sum_{r in R*} M_bar_{o,r,m}

12) NN Feature Importance Mathematics

Permutation AP-drop for feature j:

Imp_j = AP(f, X, y) - AP(f, X_perm(j), y)

where AP is average precision on held-out data and X_perm(j) shuffles feature j to destroy its signal.

Cross-Outcome Alignment

V_{o,r} = normalized importance vector for outcome o and region r
Similarity_{r}(o1, o2) = cos(V_{o1,r}, V_{o2,r})

13) Required Comparison Views and Generated Charts

  1. National LR vs NN comparison for all outcomes.
  2. Inter-regional NN feature-importance heatmaps.
  3. Region-wise cross-outcome NN importance comparison.
  4. National cross-outcome NN importance comparison.
  5. Outcome-similarity by region.
retrain_outputs/cross_outcome_analysis/chart1_national_lr_vs_nn.png
retrain_outputs/cross_outcome_analysis/chart2_interregional_nn_importance_heatmaps.png
retrain_outputs/cross_outcome_analysis/chart3_regionwise_outcome_nn_compare.png
retrain_outputs/cross_outcome_analysis/chart4_national_nn_feature_importance_across_outcomes.png
retrain_outputs/cross_outcome_analysis/chart5_outcome_similarity_by_region.png

14) Feature-Importance Story Pack Outputs

Generated under retrain_outputs/feature_importance_story/:

chartA_national_importance_composition.png
chartB_national_top_drivers_lollipop.png
chartC_regional_deviation_heatmaps.png
chartD_region_top_driver_matrix.png
chartE_strength_vs_consistency.png
noteworthy_findings.md

15) Reproducibility and Change Log

New/Updated Scripts (2026-02-25)

  • 202_final/retrain_zd.py
  • 202_final/retrain_mcv.py
  • 202_final/retrain_dpt.py
  • 202_final/retrain_core.py
  • 202_final/build_cross_outcome_visuals.py
  • 202_final/build_feature_importance_story.py

Run Commands

python 202_final/retrain_zd.py python 202_final/retrain_mcv.py python 202_final/retrain_dpt.py python 202_final/build_cross_outcome_visuals.py python 202_final/build_feature_importance_story.py

This page is intentionally math-explicit so the project logic is transparent from data definition through modeling, metric design, and interpretation.