High-Performance Framework for Dataset Generation

Type: 
BA/PR/DA
Persons: 
1-2
Workgroup: 

Description

I'm researching geometry processing with deep learning. For this, I often need to create large synthetic datasets.
The dataset creation can take multiple days. Removing redundancy and efficiently using multi-processing would gain a huge speed-up.

The framework must be abstract enough to be valuable for many applications. Therefore, I think that a graph-based approach is the way to go.
Users will only need to define programm calls as edges with command-line arguments (e.g. constants, input and output directories).

Tasks

  • *Graph-based pipeline definition
    • View input, intermediate and final results (e.g. files) as nodes
    • View program executions as edges
  • *Check if edge executions are necessary (e.g. compare file timestamps, check if the program used by the edge has changed)
  • *Simple yet powerful definition of edges (e.g. command-line parameters for procesing programs in XML)
  • *Handle failed and non-terminating edge executions by e.g. terminating processes and removing broken input files
  • *Efficient multi-threading or multi-processing
  • Display progress (for each edge)
  • Log command-line prints of edges so that it can be searched (e.g. with text files in a log directory or a small database)
  • Stop conditions for edges, e.g. max memory and max CPU time
  • Thorough testing to ensure that the code is free of bugs (e.g. with some unit tests)
  • Depth-first approach, so that we quickly get some output files
  • Use clusters like the TU Wien Hadoop cluster and maybe even the Vienna Scientific Cluster
  • Let it run as a service / daemon in the background and react to new files in the input directories (e.g. via filesystem hooks on Windows)
  • Complexity estimation of edges (e.g. log required time for an edge to process a file of a certain size to better predict the progress)
  • Execute fast edges on-the-fly and delete their results to minimize the size on HDD

Tasks with * are necessary, other tasks are chosen by preference and time constraints. This topic can be scaled for bachelor and master theses, as well as for student projects.

Requirements

  • English (code and report must be in English)
  • Basic experience in Windows (WSL) and Linux
  • Basic understanding of multi-threading and multi-processing
  • Basic understanding graph theory
  • Nice to have:
    • Experience with Python (the framework should be in Python)
    • Experience with multi-threading and multi-processing (in Python)
    • Experience with (Hadoop) clusters
    • Basic understanding of deep learning, computer graphics and geometry processing

Environment

The project should be implemented as a Python (command-line) application working on both Windows and Linux.

Contact

For more information please contact Philipp Erler (perler@cg.tuwien.ac.at).