AI-READY DATA ENGINEERING PIPELINES: A REVIEW OF MEDALLION ARCHITECTURE AND CLOUD-BASED INTEGRATION MODELS
DOI:
https://doi.org/10.63125/51kxtf08Keywords:
Medallion Architecture, AI-Ready Data Pipelines, Cloud Integration Models, Data Lakehouse, MLOps EnablementAbstract
This systematic review investigates AI-ready data engineering pipelines by analyzing 106 studies published between 2010 and 2022, focusing on Medallion Architecture, cloud-native integration models, metadata management, and lakehouse infrastructure. Following PRISMA guidelines, sources were retrieved from IEEE Xplore, Scopus, Web of Science, ScienceDirect, and Google Scholar. The review examines key architectural strategies, integration patterns, and governance mechanisms that support scalable and explainable AI workflows. Medallion Architecture was discussed in 42 studies, highlighting its tiered bronze-silver-gold design that supports modular transformations and data traceability. Case studies demonstrated reduced redundancy, enhanced reproducibility, and compatibility with MLOps practices, making it well-suited for use cases in fintech, retail, and predictive maintenance. Cloud-native tools such as AWS Glue, Azure Data Factory, and GCP Dataflow appeared in 58 articles. These platforms support real-time orchestration, autoscaling, and serverless execution. Studies reported a 30% reduction in deployment time when pipelines leveraged containerization, low-code orchestration, and cloud-native storage systems. Multi-cloud and hybrid models were noted for addressing data sovereignty, latency, and vendor lock-in concerns. Metadata and data lineage were central to 39 studies, which emphasized the importance of schema versioning, transformation tracking, and audit readiness. Tools like Apache Atlas, Amundsen, and Microsoft Purview were shown to enhance model explainability and reproducibility, reducing audit time and enabling ethical AI deployment. Thirty-six studies focused on lakehouse platforms such as Delta Lake and Apache Hudi. These systems combined the scalability of data lakes with the reliability of warehouses, enabling schema-on-read, real-time feature updates, and versioned data snapshots across training and serving pipelines. However, 31 studies noted challenges including metadata inconsistency in multi-region setups, storage overhead from versioning, and organizational gaps in MLOps responsibilities. These findings underscore the need for integrated governance, standardized roles, and cross-functional collaboration.