# Statistical analysis of machine learning algorithms

## What is "Machine Learning"?

In many applications and domains, massive amounts of data are collected and processed every day. To be able to make efficient use of such data, there is an urgent need for tools to extract important pieces of information from the flood of unimportant detail.Machine learning tries to deal with this problem by constructing algorithms that are able to analyze large amounts of complex data in a principled way. Machine learning algorithms are used in very diverse contexts: to recognize hand-written text, to extract information from images, to build automatic language-translation systems, to predict the behavior of customers in an online shop, to find genes that might be related to a particular disease, and so on. Generally speaking, machine learning algorithms can always be used when we want to extract "patterns" from complex data.

In order to ensure that the patterns we extract are "really there" and not only arise due to noise in the data, it is important that the behavior of machine learning algorithms is well understood, that its results can be controlled, reproduced, and come with strict performance guarantees. This is exactly what machine learning theory is focused on. Its goal is to analyze machine learning algorithms from a theoretical point of view and to answer questions such as: Can we characterize the behavior of a machine learning algorithm a priori? Can we make fundamental statements regarding when a certain algorithm should or should not be applied, or when it is superior to some other algorithm? How can we give guarantees of the quality of its results and the confidence we have?

## Ordinal data analysis

Often we do not have access to exact distance values or similarity values between objects. Instead, we only have vague information such as "Object A is more similar to object B than to object C". We call such comparison-based data "ordinal" (as opposed to "cardinal"). Ordinal data arises in particular in the context of crowdsourcing or human computing, or in the context of extremely unreliable measurement devices. The fundamental question is whether machine learning algorithms can still be successful if they just get to see ordinal data. From a theoretical point of view, many basic questions in this context are widely open.

## Statistics of complex networks

Many modern data sets have a natural graph structure: social networks (like facebook), the world wide web, interactions between neurons in the brain, sensor networks, etc. Often, such networks arise by random processes. The field of complex network science tries to describe, analyze and model such networks. However, due to the inherent randomness in networks it is not obvious how to distinguish ``random artifacts'' from ``true structure''. We need statistical tools to be able to make this distinction. Unfortunately, the classical methods from statistics cannot be applied to network data (the data is discrete and has too little mathematical structure). One of our research directions is to develop such statistical procedures. Statistical analysis of algorithms on graphs

In machine learning, there exist many algorithms that try to explicitly make use of graph structures in the data: graph partitioning for clustering, label propagation for semi-supervised learning, manifold methods for dimensionality reduction, graph kernels to describe either the similarity of objects in a graph or the similarity between different graphs, etc. In this area it is important to understand how particular algorithms interact with (random) graphs. This raises interesting questions between graph theory, statistics, and geometry. Exactly which graphs do we want to use to model our data? Which properties of graphs are attractive for machine learning? Which ones are misleading? Do algorithms behave differently on different kinds of graphs? How do we know that our algorithms model true underlying structure and not just fit random noise?