Abstract
The rapid ascent of big data technologies has fundamentally reshaped the computational landscape, creating a significant demand for a workforce proficient in distributed data processing. Traditional pedagogical methods in computer science, which often emphasize discrete algorithmic problems and localized execution environments, are increasingly misaligned with the practical, systems-oriented challenges inherent in this domain. This article proposes a comprehensive project-based learning framework designed specifically for teaching distributed data processing. The framework moves beyond theoretical exposition and simple syntax tutorials, instead situating learning within the context of a sustained, complex, and authentic project that mirrors the realities of data engineering in industry and research. We argue that this approach is not merely beneficial but essential for cultivating a deep, integrated understanding of concepts such as parallelization, fault tolerance, and cluster resource management. The article details the core principles of the framework, outlines a phased implementation strategy, discusses the challenges of managing a distributed systems classroom, and presents a qualitative analysis of the competencies developed. The primary thesis is that by grappling with the entire data lifecycle - from ingestion and storage to processing and analysis - within a project-based paradigm, students develop the robust technical skills and, more critically, the systemic problem-solving mindset required to navigate the complexities of modern data infrastructure.
This work is licensed under a Creative Commons Attribution 4.0 International License.
