Information

Abstract

Statistical modeling is a key technology for generating business value from data. While the number of available algorithms and the need for them is growing, the number of people with the skills to effectively use such methods lags behind. Many application domain experts find it hard to use and trust algorithms that come as black boxes with insufficient interfaces to adapt. The field of Visual Analytics aims to solve this problem by a human-oriented approach that puts users in control of algorithms through interactive visual interfaces. However, designing accessible solutions for a broad set of users while re-using existing, proven algorithms poses significant challenges for the design of analytical infrastructures, visualizations, and interactions. This thesis provides multiple contributions towards a more human-oriented modeling process: As a theoretical basis, it investigates how user involvement during the execution of algorithms can be realized from a technical perspective. Based on a characterization of needs regarding intermediate feedback and control, a set of formal strategies to realize user involvement in algorithms with different characteristics is presented. Guidelines for the design of algorithmic APIs are identified, and requirements for the re-use of algorithms are discussed. From a survey of frequently used algorithms within R, the thesis concludes that a range of pragmatic options for enabling user involvement in new and existing algorithms exist and should be used. After these conceptual considerations, the thesis presents two methodological contributions that demonstrate how even inexperienced modelers can be effectively involved in the modeling process. First, a new technique called TreePOD guides the selection of decision trees along trade-offs between accuracy and other objectives, such as interpretability. Users can interactively explore a diverse set of candidate models generated by sampling the parameters of tree construction algorithms. Visualizations provide an overview of possible tree characteristics and guide model selection, while details on the underlying machine learning process are only exposed on demand. Real-world evaluation with domain experts in the energy sector suggests that TreePOD enables users with and without statistical background a confident identification of suitable decision trees. As the second methodological contribution, the thesis presents a framework for interactive building and validation of regression models. The framework addresses limitations of automated regression algorithms regarding the incorporation of domain knowledge, identifying local dependencies, and building trust in the models. Candidate variables for model refinement are ranked, and their relationship with the target variable is visualized to support an interactive workflow of building regression models. A real-world case study and feedback from domain experts in the energy sector indicate a significant effort reduction and increased transparency of the modeling process. All methodological contributions of this work were implemented as part of a commercially distributed Visual Analytics software called Visplore. As the last contribution, this thesis reflects upon years of experience in deploying Visplore for modeling-related tasks in the energy sector. Dissemination and adoption are important aspects of making statistical models more accessible for domain experts, making this work relevant for practitioners and application-oriented researchers alike.

Additional Files and Images

Additional images and videos

Additional files

Weblinks

No further information available.

BibTeX

@phdthesis{Muehlbacher_diss_2018,
  title =      "Human-Oriented Statistical Modeling: Making Algorithms
               Accessible through Interactive Visualization",
  author =     "Thomas M\"{u}hlbacher",
  year =       "2018",
  abstract =   "Statistical modeling is a key technology for generating
               business value from data. While the number of available
               algorithms and the need for them is growing, the number of
               people with the skills to effectively use such methods lags
               behind. Many application domain experts find it hard to use
               and trust algorithms that come as black boxes with
               insufficient interfaces to adapt. The field of Visual
               Analytics aims to solve this problem by a human-oriented
               approach that puts users in control of algorithms through
               interactive visual interfaces. However, designing accessible
               solutions for a broad set of users while re-using existing,
               proven algorithms poses significant challenges for the
               design of analytical infrastructures, visualizations, and
               interactions. This thesis provides multiple contributions
               towards a more human-oriented modeling process: As a
               theoretical basis, it investigates how user involvement
               during the execution of algorithms can be realized from a
               technical perspective. Based on a characterization of needs
               regarding intermediate feedback and control, a set of formal
               strategies to realize user involvement in algorithms with
               different characteristics is presented. Guidelines for the
               design of algorithmic APIs are identified, and requirements
               for the re-use of algorithms are discussed. From a survey of
               frequently used algorithms within R, the thesis concludes
               that a range of pragmatic options for enabling user
               involvement in new and existing algorithms exist and should
               be used. After these conceptual considerations, the thesis
               presents two methodological contributions that demonstrate
               how even inexperienced modelers can be effectively involved
               in the modeling process. First, a new technique called
               TreePOD guides the selection of decision trees along
               trade-offs between accuracy and other objectives, such as
               interpretability. Users can interactively explore a diverse
               set of candidate models generated by sampling the parameters
               of tree construction algorithms. Visualizations provide an
               overview of possible tree characteristics and guide model
               selection, while details on the underlying machine learning
               process are only exposed on demand. Real-world evaluation
               with domain experts in the energy sector suggests that
               TreePOD enables users with and without statistical
               background a confident identification of suitable decision
               trees. As the second methodological contribution, the thesis
               presents a framework for interactive building and validation
               of regression models. The framework addresses limitations of
               automated regression algorithms regarding the incorporation
               of domain knowledge, identifying local dependencies, and
               building trust in the models. Candidate variables for model
               refinement are ranked, and their relationship with the
               target variable is visualized to support an interactive
               workflow of building regression models. A real-world case
               study and feedback from domain experts in the energy sector
               indicate a significant effort reduction and increased
               transparency of the modeling process. All methodological
               contributions of this work were implemented as part of a
               commercially distributed Visual Analytics software called
               Visplore. As the last contribution, this thesis reflects
               upon years of experience in deploying Visplore for
               modeling-related tasks in the energy sector. Dissemination
               and adoption are important aspects of making statistical
               models more accessible for domain experts, making this work
               relevant for practitioners and application-oriented
               researchers alike.",
  month =      aug,
  address =    "Favoritenstrasse 9-11/E193-02, A-1040 Vienna, Austria",
  school =     "Institute of Computer Graphics and Algorithms, Vienna
               University of Technology ",
  URL =        "https://www.cg.tuwien.ac.at/research/publications/2018/Muehlbacher_diss_2018/",
}