Project 02 - Early Childhood Development in Rural Uttar Pradesh

1) Project Overview

Measure ECD levels using ECDI2030 for children 24-59 months.
Identify socioeconomic, demographic, and geographic determinants of being developmentally on-track.
Produce evidence for targeted policy interventions.

Sample Snapshot

Raw sample: about 4,999 children.
Complete-case ML sample: about 4,869 children.
On-track: about 69.3%.
Not on-track: about 30.7%.

2) End-to-End Workflow

Clean/recode ECDI2030 items and key covariates (education, caste, religion, ANC, delivery place, wealth).
Construct age-specific on-track outcome.
Build MCA wealth index and stable grouped social variables.
Run descriptive and stratified inequality analysis.
Run first-learner item-level logistic models.
Train Logistic, RF, XGBoost, and NN with repeated CV + SMOTE.
Compare performance and feature importance.
Prepare outputs for reporting and stakeholder discussion.

3) ECDI2030 Scoring Mathematics

S_i = sum_{j=1}^{20} x_{ij}, x_{ij} in {0,1}

Age-specific threshold T(a_i):

T(a_i) = 7 (24-29), 9 (30-35), 11 (36-41), 13 (42-47), 15 (48-59)

y_i = 1[S_i >= T(a_i)]

4) MCA Wealth Construction

With disjunctive asset vector z_i and first MCA axis loading v_1:

w_i = z_i^T v_1

Quintile assignment:

Q_i in {1,...,5} from empirical quantiles of w_i

5) Determinant Logistic Model

Pr(y_i=1 | X_i) = sigma(beta_0 + X_i^T beta), sigma(t)=1/(1+e^{-t})

log(p_i/(1-p_i)) = beta_0 + X_i^T beta

Odds ratio for predictor k:

OR_k = e^{beta_k}

6) Item-Level First-Learner Models

For ECD item q:

log(Pr(q_i=1)/(1-Pr(q_i=1))) = alpha_0 + alpha_1 * FirstLearner_i

Item-OR = e^{alpha_1}

This isolates item-wise differences tied to first-learner status.

7) Imbalance and Learners

SMOTE

x_new = x_i + lambda(x_nn - x_i), lambda in (0,1)

Random Forest

y_hat = mode{T_b(x)}_{b=1}^B

8) Boosting Objective

y_hat_i^{(t)} = y_hat_i^{(t-1)} + f_t(x_i)

Obj = sum_i l(y_i, y_hat_i) + sum_t Omega(f_t)

XGBoost captures nonlinear interactions and improves screening sensitivity.

9) Evaluation Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Sensitivity = TP / (TP + FN)

Specificity = TN / (TN + FP)

F1 = 2 * Precision * Recall / (Precision + Recall)

Model selection balances interpretability (logistic) and screening sensitivity (tree-based learners).

10) Planned Spatial/Multilevel Extension

logit(p_{ij}) = beta_0 + X_{ij} beta + u_j + s_j

where u_j ~ N(0, sigma_u^2) and s_j is spatially structured (CAR/BYM-style).

Exploratory spatial autocorrelation metric:

I = (n/S_0) * [sum_i sum_j w_ij(z_i-z_bar)(z_j-z_bar)] / [sum_i(z_i-z_bar)^2]

11) Key Empirical Findings

Strongest determinants: wealth, maternal education, ANC checkups, block-level location, and delivery place.
Lower wealth quintiles and lower maternal education substantially reduce odds of being on-track.
First-learner analysis shows pronounced deficits in core learning items (letters, numbers, counting).
Tree models typically provide higher recall, while logistic remains most interpretable for determinants.

This project combines social-epidemiological interpretation with predictive modeling so findings are both statistically defensible and policy actionable.