1) Project Overview
- Measure ECD levels using ECDI2030 for children 24-59 months.
- Identify socioeconomic, demographic, and geographic determinants of being developmentally on-track.
- Produce evidence for targeted policy interventions.
Sample Snapshot
- Raw sample: about 4,999 children.
- Complete-case ML sample: about 4,869 children.
- On-track: about 69.3%.
- Not on-track: about 30.7%.
2) End-to-End Workflow
- Clean/recode ECDI2030 items and key covariates (education, caste, religion, ANC, delivery place, wealth).
- Construct age-specific on-track outcome.
- Build MCA wealth index and stable grouped social variables.
- Run descriptive and stratified inequality analysis.
- Run first-learner item-level logistic models.
- Train Logistic, RF, XGBoost, and NN with repeated CV + SMOTE.
- Compare performance and feature importance.
- Prepare outputs for reporting and stakeholder discussion.
3) ECDI2030 Scoring Mathematics
Age-specific threshold T(a_i):
4) MCA Wealth Construction
With disjunctive asset vector z_i and first MCA axis loading v_1:
Quintile assignment:
5) Determinant Logistic Model
Odds ratio for predictor k:
6) Item-Level First-Learner Models
For ECD item q:
This isolates item-wise differences tied to first-learner status.
7) Imbalance and Learners
SMOTE
Random Forest
8) Boosting Objective
XGBoost captures nonlinear interactions and improves screening sensitivity.
9) Evaluation Metrics
Model selection balances interpretability (logistic) and screening sensitivity (tree-based learners).
10) Planned Spatial/Multilevel Extension
where u_j ~ N(0, sigma_u^2) and s_j is spatially structured (CAR/BYM-style).
Exploratory spatial autocorrelation metric:
11) Key Empirical Findings
- Strongest determinants: wealth, maternal education, ANC checkups, block-level location, and delivery place.
- Lower wealth quintiles and lower maternal education substantially reduce odds of being on-track.
- First-learner analysis shows pronounced deficits in core learning items (letters, numbers, counting).
- Tree models typically provide higher recall, while logistic remains most interpretable for determinants.
This project combines social-epidemiological interpretation with predictive modeling so findings are both statistically defensible and policy actionable.