A Quantitative Data-Driven Evaluation of Cost Efficiency in Cloud and Distributed Computing for Machine Learning Pipelines
DOI:
https://doi.org/10.63125/7tkcs525Keywords:
Cloud Computing, Distributed Systems, Machine Learning, Cost Efficiency, TelemetryAbstract
This study presents a quantitative evaluation of cost efficiency, performance behavior, and resource utilization in machine learning pipelines executed under cloud-based and distributed computing environments. Using a consolidated dataset that integrated pipeline execution logs, fine-grained telemetry, and infrastructure billing records, the analysis examined 120 replicated pipeline runs spanning training-dominant and inference-dominant workloads. Each run was decomposed into five pipeline stages—ETL, preprocessing, training, evaluation, and serving—enabling both run-level and stage-level assessment of cost and runtime behavior. Descriptive results indicated that cloud executions exhibited higher median total cost per run (USD 42.80, IQR 31.20–66.50) compared to distributed executions (USD 38.10, IQR 29.40–52.30), alongside greater cost dispersion driven by autoscaling, orchestration overhead, and network egress. Distributed runs demonstrated higher average compute utilization (median 74.8% vs. 61.2%) and lower idle time shares (11.6% vs. 18.9%), contributing to more stable cost behavior. Multivariate regression models explained a substantial proportion of cost variability (R² = 0.68 for total cost; R² = 0.54 for normalized cost efficiency). Job completion time (β = 0.018, p < 0.001) and network egress (β = 0.031, p = 0.001) emerged as the strongest positive predictors of total cost, while higher average compute utilization was associated with lower cost after controlling for runtime (β = −0.007, p < 0.001). In the cost-efficiency model, higher training throughput (β = 0.00042, p < 0.001) and higher utilization intensity improved efficiency, whereas orchestration overhead (β = −0.091, p = 0.015), higher inference concurrency (β = −0.033, p = 0.026), and increased tail latency reduced efficiency. Hypothesis testing confirmed statistically significant differences favoring distributed environments for normalized cost efficiency (Cohen’s d = 0.44) and training completion time, while cloud environments achieved lower inference tail latency (p95 difference ≈ −65 ms). Overall, the findings demonstrate that cost efficiency in machine learning pipelines is driven less by median throughput differences than by utilization stability, orchestration behavior, and network-driven variability. By empirically linking telemetry-derived predictors to monetary outcomes, this study provides robust quantitative evidence for infrastructure-aware cost optimization in modern machine learning systems.
