A project-based learning framework for teaching distributed data processing

Rashid Turgunbaev

PDF

Keywords

distributed data processing
project-based learning
big data education
apache spark
computational pedagogy
data engineering

Abstract

The rapid ascent of big data technologies has fundamentally reshaped the computational landscape, creating a significant demand for a workforce proficient in distributed data processing. Traditional pedagogical methods in computer science, which often emphasize discrete algorithmic problems and localized execution environments, are increasingly misaligned with the practical, systems-oriented challenges inherent in this domain. This article proposes a comprehensive project-based learning framework designed specifically for teaching distributed data processing. The framework moves beyond theoretical exposition and simple syntax tutorials, instead situating learning within the context of a sustained, complex, and authentic project that mirrors the realities of data engineering in industry and research. We argue that this approach is not merely beneficial but essential for cultivating a deep, integrated understanding of concepts such as parallelization, fault tolerance, and cluster resource management. The article details the core principles of the framework, outlines a phased implementation strategy, discusses the challenges of managing a distributed systems classroom, and presents a qualitative analysis of the competencies developed. The primary thesis is that by grappling with the entire data lifecycle - from ingestion and storage to processing and analysis - within a project-based paradigm, students develop the robust technical skills and, more critically, the systemic problem-solving mindset required to navigate the complexities of modern data infrastructure.

PDF

This work is licensed under a Creative Commons Attribution 4.0 International License.