High-Performance Framework for Dataset Generation
I'm researching geometry processing with deep learning. For this, I often need to create large synthetic datasets.
The dataset creation can take multiple days. Removing redundancy and efficiently using multi-processing would gain a huge speed-up.
The framework must be abstract enough to be valuable for many applications. Therefore, I think that a graph-based approach is the way to go.
Users will only need to define programm calls as edges with command-line arguments (e.g. constants, input and output directories).
- *Graph-based pipeline definition
- View input, intermediate and final results (e.g. files) as nodes
- View program executions as edges
- *Check if edge executions are necessary (e.g. compare file timestamps, check if the program used by the edge has changed)
- *Simple yet powerful definition of edges (e.g. command-line parameters for procesing programs in XML)
- *Handle failed and non-terminating edge executions by e.g. terminating processes and removing broken input files
- *Efficient multi-threading or multi-processing
- Display progress (for each edge)
- Log command-line prints of edges so that it can be searched (e.g. with text files in a log directory or a small database)
- Stop conditions for edges, e.g. max memory and max CPU time
- Thorough testing to ensure that the code is free of bugs (e.g. with some unit tests)
- Depth-first approach, so that we quickly get some output files
- Use clusters like the TU Wien Hadoop cluster and maybe even the Vienna Scientific Cluster
- Let it run as a service / daemon in the background and react to new files in the input directories (e.g. via filesystem hooks on Windows)
- Complexity estimation of edges (e.g. log required time for an edge to process a file of a certain size to better predict the progress)
- Execute fast edges on-the-fly and delete their results to minimize the size on HDD
Tasks with * are necessary, other tasks are chosen by preference and time constraints. This topic can be scaled for bachelor and master theses, as well as for student projects.
- English (code and report must be in English)
- Basic experience in Windows (WSL) and Linux
- Basic understanding of multi-threading and multi-processing
- Basic understanding graph theory
- Nice to have:
- Experience with Python (the framework should be in Python)
- Experience with multi-threading and multi-processing (in Python)
- Experience with (Hadoop) clusters
- Basic understanding of deep learning, computer graphics and geometry processing
The project should be implemented as a Python (command-line) application working on both Windows and Linux.