High-Performance Framework for Dataset Generation
I'm researching geometry processing with deep learning. For this, I need to create large datasets on a regular basis.
The dataset creation often takes multiple days. Removing redundancy and efficiently using multi-processing would gain a huge speed-up.
The framework must be abstract enough to be valuable for many applications. Therefore, I think that a graph-based approach is the way to go.
Users will only need to define programm calls as edges with command-line arguments (e.g. constants, input and output directories).
- *Graph-based pipeline definition
- View input, intermediate and final results (e.g. files) as nodes
- View program executions as edges
- *Check if edge executions are necessary (e.g. compare file timestamps, check if the program used by the edge has changed)
- *Simple yet powerful definition of edges (e.g. command-line parameters for procesing programs)
- *Handle failed and non-terminating edge executions by e.g. terminating processes and removing broken input files
- *Efficient multi-threading or multi-processing
- *Display progress (for each edge)
- *Log command-line prints of edges so that it can be searched (e.g. with text files in a log directory or a small database)
- *Stop conditions for edges, e.g. max memory and max CPU time
- *Thorough testing to ensure that the code is free of bugs (e.g. with some unit tests)
- *Depth-first approach, so that we quickly get some output files
- Use clusters like the TU Wien Hadoop cluster and maybe even the Vienna Scientific Cluster
- Let it run as a service / daemon in the background and react to new files in the input directories (e.g. via filesystem hooks on Windows)
- Complexity estimation of edges (e.g. log required time for an edge to process a file of a certain size to better predict the progress)
- Execute fast edges on-the-fly and delete their results to minimize the dataset size
Tasks with * are necessary, other tasks are chosen by preference and time constraints.
- English (code and report must be in English)
- Basic C++ skills (to port code to Linux and modify it if necessary)
- Nice to have:
- Experience with Python (the framework should be in Python)
- Experience with multi-threading and multi-processing (in Python)
- Experience with (Hadoop) clusters
- Building C++ code on Linux and Windows (the clusters run on Linux, the framework should run on both OS, maybe via Windows Subsystems for Linux)
- Basic understanding of deep learning, computer graphics and geometry processing
The project should be implemented as a Python command-line framework.