TU Berlin, Department of Telecommunication Systems
Distributed dataflow systems process large volume of data in parallel on multiple machines. In production, multiple dataflow applications are scheduled for execution in virtual containers on a per-job basis. Furthermore, they access datasets partitioned into datablocks across the cluster machines´ disks. Runtime performance is important for many of these jobs, as their users expect fast results. However, optimizing performance is difficult, because dataflow jobs are very diverse and used in a wide variety of domains such as relational processing, machine learning, and graph processing. Container and datablock placement decisions impact a job´s runtime performance significantly. Furthermore, changing placements affects runtime performance without modifying the application´s code, and thus can be applied to many jobs without much configuration effort from the user´s side. However, jobs<br> benefit differently from placement decisions, because their resource demands differ from job to job. Hence, there is not a single placement strategy that is optimal for all possible jobs. Besides that, users require a secure long-term data retention for their documents and datasets. This thesis presents container and datablock placement strategies to optimize the runtime performance of distributed dataflow applications running on shared data analytics platforms. It contributes two placement methods for this. The first method improves the efficiency of a job´s dataflow operations and the degree of data locality by colocating its input datablocks and containers on a selected set of nodes.<br> The second method places a job´s containers based on network distances between containers and its input datablocks as well as container interference. In addition, this thesis explores the problem of data retention in shared data analytics platforms. Therefore, it contributes a method of storing and accessing lineage metadata through smart-contracts executed on a decentralized blockchain network.<br> The methods presented in this thesis have been implemented in a research prototype that has been integrated with Hadoop and Ethereum. For evaluation, we used a 64 nodes commodity cluster and workloads consisting of applications implemented in Flink from the domains of relational processing, machine learning, and graph processing. We compared the runtime performance of workloads scheduled with our methods with Hadoop´s default placement method. For our blockchain-based<br> data retention method, we measured overhead in terms of additional response time and reported costs using it on Ethereum´s blockchain network.
Information and Communication Technology