Back to Projects

Project 02 · Completed · Early Childhood Development

Levels, Determinants, and Distribution of Early Childhood Development in Rural Uttar Pradesh (Hardoi)

ECDI2030-based analytic and ML pipeline to estimate developmental status, identify determinants, and generate policy-ready evidence for targeted interventions.

ECDI2030 MCA Wealth Index Logistic + RF + XGBoost + NN Repeated CV + SMOTE

1) Project Overview

  1. Measure ECD levels using ECDI2030 for children 24-59 months.
  2. Identify socioeconomic, demographic, and geographic determinants of being developmentally on-track.
  3. Produce evidence for targeted policy interventions.

Sample Snapshot

  • Raw sample: about 4,999 children.
  • Complete-case ML sample: about 4,869 children.
  • On-track: about 69.3%.
  • Not on-track: about 30.7%.

2) End-to-End Workflow

  1. Clean/recode ECDI2030 items and key covariates (education, caste, religion, ANC, delivery place, wealth).
  2. Construct age-specific on-track outcome.
  3. Build MCA wealth index and stable grouped social variables.
  4. Run descriptive and stratified inequality analysis.
  5. Run first-learner item-level logistic models.
  6. Train Logistic, RF, XGBoost, and NN with repeated CV + SMOTE.
  7. Compare performance and feature importance.
  8. Prepare outputs for reporting and stakeholder discussion.

3) ECDI2030 Scoring Mathematics

S_i = sum_{j=1}^{20} x_{ij}, x_{ij} in {0,1}

Age-specific threshold T(a_i):

T(a_i) = 7 (24-29), 9 (30-35), 11 (36-41), 13 (42-47), 15 (48-59)
y_i = 1[S_i >= T(a_i)]

4) MCA Wealth Construction

With disjunctive asset vector z_i and first MCA axis loading v_1:

w_i = z_i^T v_1

Quintile assignment:

Q_i in {1,...,5} from empirical quantiles of w_i

5) Determinant Logistic Model

Pr(y_i=1 | X_i) = sigma(beta_0 + X_i^T beta), sigma(t)=1/(1+e^{-t})
log(p_i/(1-p_i)) = beta_0 + X_i^T beta

Odds ratio for predictor k:

OR_k = e^{beta_k}

6) Item-Level First-Learner Models

For ECD item q:

log(Pr(q_i=1)/(1-Pr(q_i=1))) = alpha_0 + alpha_1 * FirstLearner_i
Item-OR = e^{alpha_1}

This isolates item-wise differences tied to first-learner status.

7) Imbalance and Learners

SMOTE

x_new = x_i + lambda(x_nn - x_i), lambda in (0,1)

Random Forest

y_hat = mode{T_b(x)}_{b=1}^B

8) Boosting Objective

y_hat_i^{(t)} = y_hat_i^{(t-1)} + f_t(x_i)
Obj = sum_i l(y_i, y_hat_i) + sum_t Omega(f_t)

XGBoost captures nonlinear interactions and improves screening sensitivity.

9) Evaluation Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
F1 = 2 * Precision * Recall / (Precision + Recall)

Model selection balances interpretability (logistic) and screening sensitivity (tree-based learners).

10) Planned Spatial/Multilevel Extension

logit(p_{ij}) = beta_0 + X_{ij} beta + u_j + s_j

where u_j ~ N(0, sigma_u^2) and s_j is spatially structured (CAR/BYM-style).

Exploratory spatial autocorrelation metric:

I = (n/S_0) * [sum_i sum_j w_ij(z_i-z_bar)(z_j-z_bar)] / [sum_i(z_i-z_bar)^2]

11) Key Empirical Findings

  1. Strongest determinants: wealth, maternal education, ANC checkups, block-level location, and delivery place.
  2. Lower wealth quintiles and lower maternal education substantially reduce odds of being on-track.
  3. First-learner analysis shows pronounced deficits in core learning items (letters, numbers, counting).
  4. Tree models typically provide higher recall, while logistic remains most interpretable for determinants.

This project combines social-epidemiological interpretation with predictive modeling so findings are both statistically defensible and policy actionable.