Website Title
print


Breadcrumb Navigation


Content

SS 2017

Online platform for the intelligent and structured acquisition of high-quality biomedical labels

In life sciences, giant amounts of raw data are accumulated. At the same time, advanced data science techniques like deep learning are getting more accessible. In general, supervised learning methods are capable to handle, process and ultimately draw meaningful information from these large-scale datasets. However, the available biomedical data is mainly unstructured and lacks large-scale, high-quality, training data that is correctly labeled. This issue impacts nearly all applications of supervised machine learning to the life sciences, from automatic image interpretation to electronic health records to genomics to objective diagnostic or monitoring tools. While the challenges of training complex, high-parameter deep learning models from a limited number of samples is obvious, uncertainty in the labels of those samples can be just as problematic. In this project students developed a structured, formalized framework for a closed-loop machine learning system, which includes expert label collection and label meta information computation (like quality measures). Different (semi-)supervised machine learning approaches that leverage those labels were compared. Ultimately, the system incorporates an active learning enabled feedback loop which partly steers the label collection component. The whole work was evaluated in the use case of clinical research in the field of Parkinsons Disease.

Parallel best prediction model search

The need for parallel architectures is often caused by huge datasets which do not fit on one machine's memory. However, due to the variety of different algorithms, need for parameter search and cross validation, the search for the best solution might take a lot of time on one machine. Therefore, the "small data" projects can also take advantage of parallel computation to accelerate the process of finding the best solution. In this project students developed a web based system for distributed machine learning. Using the web user inteface a user uploads a dataset and formulates the problem. The system trains different models with different parameters on cluster of computers in parallel and reports the best score, model and parameter set to the user. The students evaluated the system on the real-life task - prediction of travel time.

Website similarity search engine

Graphs are a common way to describe interactions in the real life. The successes of machine learning in image and natural language processing have motivated application of these approaches on the graphs recently. However, the evaluation of new research methods is done based on few benchmark datasets. Therefore, the scalability and also usefulness of these methods in real life is questionable. In this project students developed a web based system for website similarity search. The system uses the representations learned by using state-of-the-art machine learning techniques for similarity search. The most expensive step, the learning of website representations is parallelized and implemented on MapReduce system Flink.

MapReduce based search engine

In this project students developed a web based system for text search. The search engine uses the state-of-the-art algorithms for indexing, importance ranking and synonym matching. To accelerate search and preprocessing, students designed parallel versions of these approaches and implemented them on MapReduce based system.

Parallel subspace clustering framework

Fast growing datasets with a very high number of attributes become a common situation in social, industry and scientific areas. A meaningful analysis of these datasets requires sophisticated data mining techniques as projected clustering that are able to deal with such complex data. In this project students developed a framework for the implementation of MapReduce based algorithms for subspace clustering of large datasets. Using this framework, they implemented state-of-the-art projected clustering algorithm P3C+MR which is tailored to MapReduce systems. The algorithm was evaluated in on various use-cases.