High-Performance Framework for Dataset Generation



I'm researching geometry processing with deep learning. For this, I need to create large datasets on a regular basis.
The dataset creation often takes multiple days. Removing redundancy and efficiently using multi-processing would gain a huge speed-up.

The framework must be abstract enough to be valuable for many applications. Therefore, I think that a graph-based approach is the way to go.
Users will only need to define programm calls as edges with command-line arguments (e.g. constants, input and output directories).


  • *Graph-based pipeline definition
    • View input, intermediate and final results (e.g. files) as nodes
    • View program executions as edges
  • *Check if edge executions are necessary (e.g. compare file timestamps, check if the program used by the edge has changed)
  • *Simple yet powerful definition of edges (e.g. command-line parameters for procesing programs)
  • *Handle failed and non-terminating edge executions by e.g. terminating processes and removing broken input files
  • *Efficient multi-threading or multi-processing
  • *Display progress (for each edge)
  • *Log command-line prints of edges so that it can be searched (e.g. with text files in a log directory or a small database)
  • *Stop conditions for edges, e.g. max memory and max CPU time
  • *Thorough testing to ensure that the code is free of bugs (e.g. with some unit tests)
  • *Depth-first approach, so that we quickly get some output files
  • Use clusters like the TU Wien Hadoop cluster and maybe even the Vienna Scientific Cluster
  • Let it run as a service / daemon in the background and react to new files in the input directories (e.g. via filesystem hooks on Windows)
  • Complexity estimation of edges (e.g. log required time for an edge to process a file of a certain size to better predict the progress)
  • Execute fast edges on-the-fly and delete their results to minimize the dataset size

Tasks with * are necessary, other tasks are chosen by preference and time constraints.


  • English (code and report must be in English)
  • Basic C++ skills (to port code to Linux and modify it if necessary)
  • Nice to have:
    • Experience with Python (the framework should be in Python)
    • Experience with multi-threading and multi-processing (in Python)
    • Experience with (Hadoop) clusters
    • Building C++ code on Linux and Windows (the clusters run on Linux, the framework should run on both OS, maybe via Windows Subsystems for Linux)
    • Basic understanding of deep learning, computer graphics and geometry processing


The project should be implemented as a Python command-line framework.


For more information please contact Philipp Erler (perler@cg.tuwien.ac.at).