SNN Symposium, 17 March 2015

Intelligent Machines 2015

thumb_image thumb_image thumb_image thumb_image thumb_image
Machine learning and artificial intelligence become more and more important in business and society. On Tuesday March 17, SNN organizes a one day symposium entitled Intelligent Machines, where we present an overview of recent developments in this fast evolving field. The meeting aims to establish a dialogue and to build connections between academic research, industry and public institutions in the Netherlands.

The program of the day will consist of a number of international invited speakers, and plenty time for networking during poster and exhibition sessions. The aim is to present a very comprehensive overview of recent academic and industrial research. It will take place at Concertgebouw De Vereeniging in Nijmegen.

Program & Speakers

09:30-11:00Zoubin GhahramaniCambridge UniversityThe automated statistician
We will live an era of abundant data and there is an increasing need for methods to automate data analysis and statistics. I will describe the "Automatic Statistician" ( , a project which aims to automate the exploratory analysis and modelling of data. Our approach starts by defining a large space of related probabilistic models via a grammar over models, and then uses Bayesian marginal likelihood computations to search over this space for one or a few good models of the data. The aim is to find models which have both good predictive performance, and are somewhat interpretable. Our initial work has focused on the learning of unknown nonparametric regression functions, and on learning models of time series data, both using Gaussian processes. Once a good model has been found, the Automatic Statistician generates a natural language summary of the analysis, produci ng a 10-15 page report with plots and tables describing the analysis. I will discuss challenges such as: how to trade off predictive performance and interpretability, how to translate complex statistical concepts into natural language text that is understandable by a numerate non-statistician, and how to integrate model checking.
11:30-12:15Daan WierstraGoogle Deep MindGeneral Learning Algorithms at Google DeepMind
At Google DeepMind, we focus on developing new general learning algorithms alongside applications for them. This talk will provide an overview of our research areas, which range from reinforcement learning and deep learning to generative models, variational inference methods and recurrent neural networks. I will highlight recent research in each topic and present results, including challenging high dimensional control problems, sequence processing tasks and semi-supervised learning. Lastly, I will touch upon our view on the future of artificial intelligence research and its ramifications for industry and academia.
13:30-14:30The Perils of Artificial Intelligence and Big Data (pdf). A panel discussion led by Max Welling
15:00-15:45Sethu VijayakumarEdinburgh UniversityRobots that Learn: Old Dreams and New Tools!
Getting robots to learn autonomously has been a long standing dream. With the advent of compact, more reliable sensors and high fidelity actuation systems, we are getting to a stage where, in theory, one can achieve remarkable dexterity and responsiveness. However, there are several open challenges due to the complexity introduced by these high dimensional systems. Machine Learning and data driven techniques can contribute significantly to alleviate this. I will look at novel approaches to relational representations for planning, optimal control and variable impedance optimisation as exemplars of where this is making a difference.
15:45-16:30Ralf HerbrichAmazon Research BerlinMachine Learning at Amazon
In this talk, I will give an overview of the Machine Learning efforts at Amazon - ranging from Forecasting, Recommendation, Search to linking digital media and computer vision. I will highlight the technical difficulties in each of these problem areas and discuss some initial efforts. In particular, I will talk in detail about our approaches to learning to rank and learning time-series in a distributed system.
16:30-17:45Posters and Reception

Based on our experience with earlier similar events, we expect a large part of Dutch academic researchers to present their research at the symposium. In addition, the meeting tends to be well attended by industry. Companies are invited to present themselves with a stand. All academic researchers as well as researchers from industry and business are invited to present their latest application oriented research with a poster.

About the organizers

Intelligent Machines is organized by SNN, a non-profit organization that aims to promote research and applications on machine learning and artificial intelligence in the Netherlands ( SNN hosts the Dutch Machine Learning Platform ( The program has been compiled by Bert Kappen (Radboud University, SNN) with help from Max Welling (University of Amsterdam), Tom Heskes (Radboud University) and John Shawe-Taylor (University College London).

Practical information

Poster Abstracts

Robert Babuska 1, Sander Bohté 2, Guido de Croon 1, Pieter Roelfsema 3, Paul Vogt 4 en Max Welling 5, (1) Delft University of Technology, (2) CWI, (3) Netherlands Instituut for Neuroscience, (4) University of Tilburg, (5) University of Amsterdam.
Natural Artificial Intelligence
In the NWO research programme Natural Artificial Intelligence, natural and artificial intelligence come together. The six projects combine insights from neuroscience, psychology and social interaction with the latest developments in the fields of machine learning, neural networks and robotics. Intelligent algorithms are increasingly pervasive. They observe search engine users, listen to voice commands, register our license plates, and vacuum our rooms. They perform specific filtering and detection tasks with superhuman speed and accuracy. Humans, on the other hand, are almost unreasonably good at tasks that intelligent algorithms still fail to do accurately: resolving ambiguities in text, recognizing video and sound, and controlling flexible and variable actuators. All of these tasks are on the agenda of artificial intelligence. Precisely these tasks, where people still outperform computers, are an important source of inspiration for improving computer algorithms.

Tim van Erven, Wojciech Kotlowski and Manfred Warmuth. Leiden University
Sequential Prediction with Dropout Perturbations
We consider sequential prediction with expert advice. This is an online learning setting, in which the data arrive in rounds, one example at a time. Our job in each round is to predict the label for the example that will arrive in that round, just before it arrives. For example, an electricity company might be interested in predicting the amount of electricity that it needs to produce, every day, for the next day. To aid us in our predictions, we assume that K domain experts have already constructed K expert algorithms that provide us with advice: at the start of each round, each expert algorithm gives us their prediction for that round. For example, an electricity company might already have a number of algorithms available; they just do not know which one to use. Our goal is make predictions that are, on average, essentially as good as the predictions of the expert algorithm that turns out to be the best in hindsight, after having predicted all the data. So, implicitly, we are learning which expert algorithm is the best one. Formally, we measure prediction error by a loss function, and our goal is to minimize the regret, which is the difference between our cumulative loss and the cumulative loss of the best expert. The two most popular algorithms for this setting are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (``the leader''). To achieve the optimal worst-case regret as a function of the loss L* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. By relating our setting to a simple neural network with only a single neuron, we show that our perturbations are an instance of dropout, which is a popular technique to train neural networks. There are very few theoretical results that provide formal performance guarantees for dropout, but in our setting we are able to show that our simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(sqrt(L* ln K) + ln K) regret for any losses, and optimal O(ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. These results have direct applications, for example for electricity companies that want to predict how much electricity to produce every day, and indirect applications to so-called online convex optimization problems, which may be reduced to our setting.

Wouter Kouw, Marco Loog en Laurens van der Maaten, Delft University of Technology.
Modeling transfer in domain adaptation
Domain adaptation is a pattern recognition setting in which we train a classifier on a labeled dataset in one domain and adapt it to classify an unlabeled set in a different domain. Domains are different probability distributions over a common feature space. For instance, in natural language processing, the probability of using a particular word in a book review is different from the probability of using that word in a movie review. Therefore, book reviews and movie reviews can be considered different domains. Adapting a classifier to perform well in a different domain is in general infeasible, because there is no knowledge of how the labels are distributed in the new domain. Therefore, we assume that there exists some dependency between two domains, i.e. a stochastic mapping from source samples to target samples. We study this dependency and model it with a conditional distribution over target samples given source samples. For cases where our model is correct, we show that our adapted classifier converges to the true target classifier. Furthermore, we investigate how well we can model dependencies by evaluating our adapted classifier on a number of natural datasets and comparing with existing methods.

Marco Cox, Eindhoven University of Technology
Online LDA Model Inference for Collaborative Filtering
Background We describe and evaluate an online learning algorithm for latent Dirichlet allocation (LDA) in the context of collaborative filtering recommendation systems. LDA is a well known hidden topic model for text document modelling. However, it is also very well suited for modelling users and items in a recommendation system, e.g. to recommend Netflix movies to users based on their past behaviour. The hidden topics are shared among all users and define probability distributions over the item set, which might be interpreted as encoding abstract concepts like 'romance' or 'comedy'. Users are modelled as a mixture of topics, where the mixing weights encode the user's interest in the corresponding topics. Model inference amounts to finding the optimal topics as well as the user-specific mixing weights. The LDA model assigns Dirichlet priors to both the topic definitions and the user interest distributions, which results in Dirichlet posteriors under multinomial likelihood. The posteriors are usually approximated using variational inference or Monte Carlo sampling. Predictions are generated by integrating out the user interest distribution and the hidden topics, which can be done analytically in the case of Dirichlet posteriors. Contribution We derive a streaming inference algorithm for LDA that processes mini-batches of user-item co-occurrences (user X viewed/liked item Y) by performing recursive variational Bayesian (VB) updates. In every update, the posterior from the previous step is used as prior, and the new posterior is approximated by iteratively tightening a variational lower bound on the log-likelihood of the current mini-batch until convergence. This approach allows the processing of huge data sets as well as fast model updates as new data comes in, two aspects that are crucial in recommendation systems. Results Evaluation on a data set containing 750k movie ratings shows that our streaming VB algorithm gets close to the predictive performance resulting from variational inference on the whole dataset at once, which takes 10-30 times as long depending on the mini-batch size. The impact of the mini-batch size and the number of topics on the predictive performance of the final model are evaluated. The proposed algorithm is also compared to stochastic gradient descent (SGD). Where the streaming VB algorithm incrementally improves the variational approximation to the true posterior, SGD only performs a single step in the direction of the natural gradient. After careful tuning of the SGD step size (learning rate), both algorithms result in very similar predictive performance. In contrast to SGD, the streaming VB algorithm does not have any parameters that have to be tuned, which makes it more robust at the cost of a higher computational load. However, this higher computational load is largely compensated by the need to run SGD multiple times to find an appropriate learning rate. Moreover, SGD cannot be applied to streaming data directly since the size of the data set has to be constant and (approximately) known. The streaming VB algorithm allows the user to define the trade-off between the approximation error and the computational load by setting the stopping condition for the convergence.

Peter Grünwald and Thijs van Ommen, CWI
Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It
We empirically show that Bayesian inference can be inconsistent under misspecification in simple linear regression problems, both in a model averaging/selection and in a Bayesian ridge regression setting. We use the standard linear model, which assumes homoskedasticity, whereas the data are heteroskedastic, and observe that the posterior puts its mass on ever more high-dimensional models as the sample size increases. To remedy the problem, we equip the likelihood in Bayes' theorem with an exponent called the learning rate, and we propose the Safe Bayesian method to learn the learning rate from the data. SafeBayes tends to select small learning rates as soon the standard posterior is not 'cumulatively concentrated', and its results on our data are quite encouraging.

Ali Hürriyetoglu, Mustafa Erkan Başar, Florian Kunneman, Antal van den Bosch, Radboud University Nijmegen
LAMA Events: Twitter Based Social Event Calendar
Lama Events is a calendar application listing events in the near future. The events are detected and selected by a fully automatic procedure in the Dutch Twitter stream. Lama Events is open-domain. No event keywords are incorporated that might bias the detection to certain types of events. Events that people anticipate on, typically social events, are favoured by taking the variety of event references into account. Lama Events is different from other future calendar applications because of its fully automatic operation, and because of its broad focus on any social events. It does not only list big events such as sports games, concerts, or iPhone releases, but it will also catch events that no other calendar lists: smaller-scale regional events. It is a clear example of a useful application based on language technology, and could be easily provided as a service to journalists, companies and the general public. In the digital age, journalists and companies are more and more relying on digital sources for their enquiries, They would be helped a lot if they were provided with an overview of future social events - of any type - that are announced and discussed on Twitter. In addition, it is very amusing for the general public to browse through the diverse set of future events that are presented on the web site [1]. [1]

Carsten Hansen 1, Melanie Tosik 2, Gerard Goossen 1, Chao Li 1, Lena Bayeva 1, Florence Berbain 1, Mihai Rotaru 1, (1) Textkernel, (2) Universität Potsdam.
How to get the best word vectors for resume parsing
Information extraction from CVs (resumes) is one of the success stories of applying NLP in industry. Word type (i.e. word itself) based sequence labeling models (e.g. HMM) are typically used for this task. However, one disadvantage of this approach is its poor generalization to CVs from new sectors due to unknown words. For example, a typical training set contains about 3000 annotated CVs. When parsing CVs from other sectors (e.g. off-shore oil industry), many job titles or company words will be unknown. To solve this, one approach is to annotate more CVs from these sectors but this is an expensive solution. Our solution to the unknown word problem is to replace word types with continuous vector representations of words. For generation of vector representations we use the word2vec package which learns the representations from large amounts of unlabelled data. Previous work has shown that these representations can capture semantic similarities between words (e.g. “director” and “CEO” will be close in the vector space). In this work we study the effect of three parameters that influence the vector generation process: data source, amount of data and vector size. These factors have important practical implications in terms of performance, resource consumption and feasibility of this approach. For data source, we experiment with two different sources: Wikipedia and/or domain specific data (i.e. large collections of unlabelled CVs). We find that the domain specific data yields better performance and this can be improved further by combining all the available data. In the absence of domain specific data, a common situation when starting on a new language, using non-domain data such as Wikipedia provides a good fallback. For data size, our experiments show that including more data results in better performance but this tends to level off. One negative impact of using more data is that the memory footprint grows linearly with the dictionary size. Lastly, for vector size, we find an efficient trade off between dimension size and model performance at 150 dimensions. This vector size seems to be specific to our domain as larger vector sizes are desirable for other previous applications.

Sep Thijssen, Vicenc Gomez, Andrew Symington, Bert Kappen, Radboud University Nijmegen
Real-Time Stochastic Optimal Control for Multi-agent Quadrotor Swarms
We present a novel method for controlling swarms of unmanned aerial vehicles using Stochastic Optimal Control (SOC) theory. The approach consists of a centralized high-level controller that computes optimal state trajectories as velocity sequences, and a platform-specific low-level controller which ensures that these velocity sequences are met. The high-level control task is expressed as a centralized Path Integral control problem, for which optimal control computation corresponds to a probabilistic inference problem that can be solved by efficient sampling methods. Through simulation we show that our SOC approach (a) has significant benefits compared to deterministic control and other SOC methods in multimodal problems with noise-dependent optimal solutions, (b) is capable of controlling a large number of platforms in real-time, and (c) yields collective emergent behavior in the form of flight formations. Finally, we show that our approach works for real platforms, by controlling a swarm of three quadrotors.

Decebal Constantin Mocanu 1, Haitham Bou Ammar 2, Dietwig Lowet 3, Kurt Driessens 4, Antonio Liotta 1, Gerhard Weiss 4, Karl Tuyls 5,
(1) Eindhoven University of Technology, (2) University of Pennsylvania, (3) Philips Research, (4) Maastricht University, (5) University of Liverpool.
Abstract title: Factored Four Way Conditional Restricted Boltzmann Machines for Activity Recognition
This paper introduces a new learning algorithm for human activity recognition capable of simultaneous regression and classification. Building upon Conditional Restricted Boltzmann Machines (CRBMs), Factored Four Way Conditional Restricted Boltzmann Machines (FFW-CRBMs) incorporate a new label layer and four-way interactions among the neurons from the different layers. The additional layer gives the classification nodes a similar strong multiplicative effect compared to the other layers, and avoids that the classification neurons are overwhelmed by the (much larger set of) other neurons. This makes FFW-CRBMs capable of performing activity recognition, prediction and self auto evaluation of classification within one unified framework. As a second contribution, Sequential Markov chain Contrastive Divergence (SMcCD) is introduced. SMcCD modifies Contrastive Divergence to compensate for the extra complexity of FFW-CRBMs during training. Two sets of experiments one on benchmark datasets and one a robotic platform for smart companions show the effectiveness of FFW-CRBMs.

Wenjie Pei, Hamdi Dibeklioglu and Laurens v.d. Maaten, Delft University of Technology
Time Series Classification using the Hidden-Unit Logistic Model
Time series classification is the problem of assigning a single label to a sequence of observations (i.e., to a time series). Time series classification has a wide range of applications in computer vision. A state-of-the-art model for time series classification problem is the hidden-state conditional random field (HCRF) , which models latent structure in the data using a chain of k-nomial latent variables. The HCRF has been successfully used in, amongst others, gesture recognition, object recognition, and action recognition. An important limitation of the HCRF is that the number of model parameters grows linearly with the number of latent states in the model. This implies that the training of complex models with a large number of latent states is very prone to overfitting, whilst models with smaller numbers of parameters may be too simple to represent a good classification function. We propose to circumvent this problem of the HCRF by replacing each of the k-nomial latent variables by a collection of H binary stochastic hidden units. To keep inference tractable, the hidden-unit chains are conditionally independent given the time series and the label. Similar ideas have been explored before in discriminative RBMs for standard classification problems and in hidden-unit CRFs for sequence labeling. The binary stochastic hidden units allow the resulting model, which we call the hidden-unit logistic model (HULM), to represent 2H latent states using only O(H) parameters. This substantially reduces the amount of data needed to successfully train models without overfitting, whilst maintaining the ability to learn complex models with exponentially many latent states. Exact inference in our proposed model is tractable, which makes parameter learning via (stochastic) gradient descent very efficient. We show the merits of our hidden-unit logistic model in experiments on computer-vision tasks ranging from online character recognition to activity recognition and facial expression analysis. Moreover, we present a system for facial action unit detection that, with the help of the hidden-unit logistic model, achieves state-of-the-art performance on a commonly used benchmark for facial analysis.

Sultan Imangaliyev, Academisch Centrum Tandheelkunde Amsterdam
Deep Learning of Human Microbiome in Health and Disease
Research on human microbiome has seen dramatic growth over the past decade in terms of new available data as well as new computational approaches for studying microbial compositions and their associations with health or disease status. Frequently, we need to apply and develop novel methods to analyze diverse and high-dimensional metagenomic datasets because standard techniques do not lead to satisfactory results. In this work, we turn to modern statistical machine learning algorithms, namely deep neural networks, to build state-of-the-art predictive models of microbial composition and its association with health or environmental factors. Deep learning, which involves training artificial neural networks with many layers, became one of the most significant recent developments in machine learning. Deep learning has been recently demonstrated to work particularly well on complex, high dimensional dataset in variety of domains such as natural language processing, computer vision, etc. To demonstrate efficacy of our deep learning algorithm on metagenome data, we use several biomedical datasets, including the one from the National Institutes of Health Human Microbiome Project (NIH HMP) study, which is publicly available. We show that our method is well suited for learning on complex metagenome datasets and it notably outperforms standard statistical methods in modelling microbial composition profiles.

Taygun Kekeç, Laurens van der Maaten and David Tax, Delft University of Technology
MPLBL Language Model
Word embedding models learn a distributed vectorial representation for words, which can be used as the basis for (deep) learning models to solve a variety of natural language processing tasks. One of the main disadvantages of current word embedding models is that they learn a single representation for each word in a metric space, as a result of which they cannot appropriately model polysemous words. In this work, we develop a new word embedding model that can accurately represent polysemous words by automatically learning multiple representations for each word, whilst remaining computationally efficient. Without any supervision, our model learns multiple, complementary embeddings that all capture different semantic structure. We demonstrate the potential merits of our model by training it on large text corpora, and evaluating it in a word prediction and a word similarity task.

Max Hinne, Ronald J. Janssen, Tom Heskes and Marcel A.J. van Gerven, Radboud University Nijmegen
Bayesian data fusion for brain connectivity
In the last decade, neuroscience has devoted much attention to the study of brain connectivity. Typically, a distinction is made between functional connectivity, which is concerned with the correlated activity between neuronal populations in spatially segregated regions of the brain, and structural connectivity, that attempts to identify the anatomical fiber bundles that wire these regions together. The former, functional connectivity, may be studied using functional magnetic resonance imaging (fMRI). Coupled activity between areas is conveniently expressed using covariance, but this measure fails to distinguish between direct and indirect effects. A popular alternative that addresses this issue is partial correlation, which regresses out the signal of potentially confounding variables, resulting in a measure that reveals only direct connections. Partial correlation provides an important link with conditional independence as the independence between two variables (i.e. brain regions), conditioned on all other variables, implies that their partial correlation is zero. The latter, structural connectivity, is studied through diffusion weighted MRI (dwMRI ) and a mapping of the fiber distributions, a process called tractography. In this paper, we propose a Bayesian generative model that integrates the estimation of functional and structural connectivity, by equating the conditional independence structure with anatomical connectivity. The intuition behind this idea is simple: two regions can only be conditionally dependent if they are directly connected through a fiber bundle. Because the two sources of data, fMRI and dwMRI are integrated in the generative model, inform our estimates better than if we would estimate functional connectivity using fMRI and structural connectivity using dwMRI separately. Additional benefits follow from our Bayesian approach. Instead of obtaining point estimates, we now have access to the posterior distribution instead, allowing is to quantify the uncertainty associated with our results. This reveals that while we are able to infer a clear backbone of connectivity, the data is not accurately described by simply looking at the mode of the distribution. The implication of this is that deterministic alternatives to estimating connectivity may misjudge results by drawing conclusions from noisy and limited data.

Elena Mocanu, Phuong H. Nguyen, Madeleine Gibescu, Wil Kling, Eindhoven University of Technology
Deep Learning to estimate building energy demands in the smart grid context
Prediction of temporal energy consumption plays an essential role in the current transition to future energy systems. Quantification of uncertainty introduced with the advent of new renewable energy sources only strengthens the role of accurate predictions methods, in order to be included later in more complex decision making process able to control and plan the energy consumption. At the same time these methods should be easily expandable to higher levels of aggregation such as neighborhoods and the power distribution grid. Many approaches have been proposed aiming at accurate and robust prediction of the energy consumption. For the purpose of this presentation two different Deep Learning methods are detailed. Additionally these complex neural networks methods are compared with a much popular method, namely Artificial Neural Network, able to faithfully reproduce the energy consumption of buildings under various time horizons.

Davide Zambrano 1, Pieter R. Roelfsema 2, Sander M. Bohte 1, (1) Centrum Wiskunde & Informatica (CWI), Amsterdam, (2) Netherlands Institute for Neuroscience (KNAW), Amsterdam.
Continuous-time neural reinforcement learning of working memory tasks
A self-driving car travels along the way when suddenly a man crosses the street. The car has to stop immediately to avoid the impact. This is an example of the complexity of the environment that we live in, where we need to respond effectively to many novel and unexpected events. To achieve this, we are able to learn to recognize an event or a sequence of events and also learn to respond properly: the right actions in the right order at the right time, in many cases just by the very limited information of success/failure. Despite advances in machine learning, current cognitive robotic systems are not able to learn from trial-and-error to rapidly and efficiently respond in the real world: the challenge is to learn to recognize both what is important, and also when to act. In this work, starting from machine learning principles as well as neurobiology, we present a neural network model that is capable of learning both what and when. For an animal, or agent, to learn to respond quickly – when – it has to sample the environment often enough. After each sample, the agent has to decide whether to act, that is, to select an action. To sample often “enough”, a programmer has to decide on the step-size as a time-representation, choosing between a fine-grained representation of time or to a coarse temporal resolution. The former corresponds to having to learn long sequences, which is difficult in trial-and-error learning, while for the latter, action sequences are shorter and thus easier to learn, at the expense of precise timing. For a learning self-driving car, this means it will either be very difficult to learn to avoid unexpected obstacles, or it will respond too late and hit the man. We developed a continuous-time version of on-policy trial-and-error learning – formally known as Reinforcement Learning (RL) – in a working-memory neural network model, AuGMEnT. On-policy RL methods take the action selection into account and learn to minimize the risk to incur large penalties. AuGMEnT uses working memory to learn useful sensory representations, and can efficiently solve non-linear RL problems, including those that require past information to be remembered. We demonstrate how we can decouple action duration from the internal time-steps in the neural RL model using an action selection system. In this way, actions that are being executed can be interrupted if another action is more important or urgent. The continuous-time framework changes the way standard RL problems are defined: it defines time as an intrinsic property of the task and it considers unavoidable delays in action selection and execution. The resultant CT-AuGMEnT neural network model successfully learns to react to the events of a continuous-time task, without any pre-imposed specifications about the duration of the events or the delays between them. Effectively, CT-AuGMEnT allows autonomous agents like a self-driving car to learn to recognize important events given only very limited feedback (like braking or accelerating), while doing so rapidly and effectively.

Fabian Gieseke, Radboud University Nijmegen
Abstract title: Big Data Analytics in Astronomy
Many scientific and industrial fields are nowadays faced with huge amounts of data. A prominent example is astronomy: Current projects such as the Sloan Digital Sky Survey (SDSS) gather terabytes of data every month. Upcoming ones such as the Large Synoptic Sky Telescope (LSST) or the Square Kilometre Array (SKA) will produce such data volumes per night or even per hour and the final databases will contain data volumes in the peta- and exabyte range. The field of data analytics aims at constructing models that can retrieve useful information in automatic manner. A typical example is the classification of objects, where the goal is to generate models that can automatically assign classes to new data items (e.g., star/galaxy classification). Usually, taking more data into account improves a model's performance. However, in most cases, this also leads to a significant increase of the runtime for generating or applying such models. A recent trend in data analytics is to make use of graphics processing units (GPUs) to speed up the involved computations. Such devices, previously only used in the context of computer graphics, offer massive parallelism and can nowadays also be used to accelerate general-purpose computations. This poster presents some recent work related to massively parallel implementations that are adapted to the specific needs of today’s GPUs. In particular, a variant of k-d trees is described, which can be used to signicantly reduce the runtime needed for the application of models that aim at detecting the most distant galaxies that can be observed from Earth.

Umut Güçlü and Marcel A.J. van Gerven, Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour
Neural encoding and decoding with deep convolutional neural networks
Human beings are extremely adept at recognizing complex objects based on elementary visual sensations. Object recognition appears to be solved in the primate brain via a cascade of neural computations along the visual ventral stream that represents increasingly complex stimulus features, which derive from the retinal input. That is, neurons in early visual areas have small receptive fields and respond to simple features such as edge orientation, whereas neurons further along the ventral pathway have larger receptive fields, are more invariant to transformations and can be selective for complex shapes. Despite converging evidence concerning the steady progression in feature complexity along the ventral stream, this progression has never been properly quantified across multiple regions in the human ventral stream. Furthermore, while the receptive fields in early visual area V1 have been properly characterized in terms of preferred orientation, location and spatial frequency, exactly what stimulus features are represented in downstream areas is more heavily debated. In order to isolate how stimulus features at different representational complexities are represented across the cortical sheet, we made use of a deep convolutional neural network (CNN). Deep CNNs consist of multiple layers where deeper layers can be shown to respond to increasingly complex stimulus features and provide state-of-the-art object recognition performance in computer vision. We used the representations that emerge after training a deep CNN in order to predict blood-oxygen-level dependent (BOLD) hemodynamic responses to complex naturalistic stimuli in progressively downstream areas of the ventral stream, moving from striate area V1 along extrastriate areas V2 and V4, all the way up to area LO in posterior inferior temporal (IT) cortex. We used individual layers of the neural network to predict single voxel responses to natural images. This allowed us to isolate different voxel groups, whose responses are best predicted by a particular layer in the neural network. Using this approach, we determined how layer depth correlates with the position of voxels in the visual hierarchy. Furthermore, by testing to what extent individual features in the neural network can predict voxel responses, we mapped how individual low-, mid- and high-level stimulus features are represented across the ventral stream. This provides a unique and fully automated approach to determine how stimulus features of increasing complexity are represented across the visual stream. Finally, we showed that the predictions of neural responses afforded by our framework give rise to state-of-the-art decoding performance, allowing identification of perceived stimuli from observed BOLD responses.

Jesse Krijthe and Marco Loog, Delft University of Technology
Implicitly Constrained Semi-Supervised Linear Discriminant Analysis
Implicitly Constrained Semi-Supervised Linear Discriminant Analysis In many machine learning tasks, apart from a set of labeled data, a large amount of unlabeled observations is often available. The goal of semi-supervised learning is to use this unlabeled data to improve the supervised classification or regression model that was learned based on the labeled data alone. For classification using linear discriminant analysis (LDA) specifically, several semi-supervised variants have been proposed. Using any one of these methods is, however, not guaranteed to outperform the supervised classifier which does not take the additional unlabeled data into account. They may, in fact, reduce performance. To counter this problem, [2] introduced moment constrained LDA, which offers a more robust type of semi-supervised LDA. This approach required the identification of specific constraints that link parameter estimates that rely on the labeled data to parameters that do not rely on the labels. Ideally, we would like these constraints to emerge implicitly from the choice of the supervised learning model and a given set of unlabeled objects. Implicitly constrained semi-supervised learning, introduced in [3] attempts to do just that. The underlying intuition is that if we could enumerate all possible labelings of the unlabeled data, and train the corresponding classifiers, the classifier based on the true but unknown labels is in this set. This classifier would generally outperform the supervised classifier. In practice, however, we can not enumerate over all possible labelings, nor do we know which one corresponds to the true labeling. One way to know how well any of these classifiers is going to perform is to estimate its performance using the supervised objective function evaluated on labeled objects alone. Based on this objective, it turns out one can efficiently find the optimal classifier in this set of possible classifiers by allowing for soft label assignments to the unlabeled objects. This all leads to a convex optimization problem that can be solved using a simple bounded gradient descent procedure. We compare the constraint based approaches to other semi-supervised methods, in particular, expectation maximization and self-learning. We also consider the question if and in what sense we can expect improvement in performance over the supervised procedure. The main conclusion from these analyses is that the constraint based approaches are more robust to misspecification of the original supervised model, and may outperform alternatives that make more assumptions on the data, in particular when performance is measured in terms of the log-likelihood of unseen objects. This work was presented in [1], while the idea of implicitly constrained learning, applied to the least squares classifier, is described in [3]. [1] Krijthe, Jesse H., and Marco Loog. "Implicitly Constrained Semi-Supervised Linear Discriminant Analysis.", 22nd International Conference on Pattern Recognition (ICPR). IEEE, 2014. [2] Loog, Marco. "Semi-supervised linear discriminant analysis through moment-constraint parameter estimation." Pattern Recognition Letters 37 (2014): 24-31 [3] J. H. Krijthe and M. Loog, “Implicitly Constrained Semi-Supervised Least Squares Classification,” Tech. Rep., 2013.

Klamer Schutte , Henri Bouma , John Schavemaker , Laura Daniele , Maya Sappelli , Gijs Koot , Pieter Eendebak , George Azzopardi , Martijn Spitters , Maaike de Boer , Maarten Kruithof , Paul Brandt, TNO.
Interactive detection of incrementally learned concepts in images with ranking and semantic query interpretation
The number of networked cameras is growing exponentially. Multiple applications in different domains result in an increasing need to search semantically over video sensor data. In this paper, we present the GOOSE demonstrator, which is a real-time general-purpose search engine that allows users to pose natural language queries to retrieve corresponding images. Top-down, this demonstrator interprets queries, which are presented as an intuitive graph to collect user feedback. Bottom-up, the system automatically recognizes and localizes concepts in images and it can incrementally learn novel concepts. A smart ranking combines both and allows effective retrieval of relevant images

Nanne van Noord, Tilburg University
Towards discovery of the artist’s style
Author attribution through the recognition of visual characteristics is a commonly used approach by art experts. By studying a vast number of artworks, art experts acquire the ability to recognise the unique characteristics of artists. In this paper we present an approach that uses the same principles in order to discover the characteristic features that determine an artist’s touch. By training a Convolutional Neural Network (PigeoNET) on a large collection of digitised artworks to perform the task of automatic artist attribution, the network is encouraged to discover artist-specific visual features. The trained network is shown to be capable of attributing previously unseen artworks to the actual artists with an accuracy of more than 70%. In addition, the trained network provides fine-grained information about the artist specific characteristics of spatial regions within the artworks. We demonstrate this ability by means of a single artwork that combines characteristics of two closely collaborating artists. PigeoNET generates a visualisation that indicates for each location on the artwork who is the most likely artist to have contributed to the visual characteristics at that location. We conclude that PigeoNET represents a fruitful approach for the future of computer supported examination of artworks.

Ruud Mattheij, Tilburg University
Improved Body-Part Detection with Microsoft Kinect
In the last few years, the automatic detection of objects from digital video and image sources has gained considerable attention within the field of image analysis and understanding. Many object-detection approaches focus on two-dimensional visual features in order to segregate objects from their backgrounds. Despite the widespread and successful use of two-dimensional (2D) visual features in visual detection tasks, they have some limitations. Their main limitation is that they typically respond to local visual transitions without taking the larger spatial context into account. As a consequence, they are sensitive to local changes in scene properties, such as illumination conditions. Employing depth cues may help to overcome the limitations of local 2D features. Depth cues can provide contextual information for a scene, thereby facilitating image segmentation. Indeed, visual objects such as faces or persons are much easier to distinguish in a 3D space than from a 2D image. In recent years, the use of depth cues became feasible by the development of affordable depth sensors, such as the Microsoft Kinect. For instance, researchers from Microsoft Research proposed a state-of-the-art depth-based detector that is able to detect body-parts in depth images generated by a Kinect device. Their algorithm relies on the comparison of depth values of pixel pairs which makes it fast and computationally efficient. However, the detection speed of the algorithm comes at the cost of accuracy. The classification accuracy is hampered by two limitations: (1) the limited quality of the depth images generated by the Kinect device, and (2) the limited resolution of the depth images. These two limitations result in noisy measurements, particularly when detecting body parts at larger distances. Our poster presents an improvement of the object detection algorithm used in the Microsoft Kinect device in the form of a new feature computation method, which is both fast and accurate. It is able to deal efficiently with the background noise in the depth images that are generated by the Kinect device. Our improvement relies on the comparisons of regions, rather than on the comparison of individual pairs of pixel values. Inspired by the work by Viola and Jones, Haar-like region features are used to detect transitions in adjacent regions of depth images. The resulting region comparison features are suitable to deal with the noise in depth images by averaging over large groups of pixels. In a comparative evaluation, our detector and an implementation of the Kinect detection algorithm (Microsoft Research) are trained and evaluated on three challenging object detection experiments: two face detection tasks and a person detection task. The results of our evaluation show that our approach outperforms the Kinect algorithm in both detection accuracy as well as prediction time, especially when processing noisy depth images.

Dominik Thalmeier1, Marvin Uhlmann, Hilbert J. Kappen, Raoul-Martin Memmesheimer, (1) Radboud University Nijmegen
Learning universal computations with spikes
Providing the neurobiological basis of information processing in higher animals, spiking neural networks must be able to learn a variety of complicated computations, including the generation of appropriate, possibly delayed reactions to inputs and the self-sustained generation of complex activity patterns, e.g. for locomotion. Many such computations require previous building of intrinsic world models. Here we show how spiking neural networks may solve these different tasks. Firstly, we derive constraints under which classes of spiking neural networks lend themselves to substrates of powerful general purpose computing. The networks contain dendritic or synaptic nonlinearities and have a constrained connectivity. We then combine such networks with learning rules for outputs or recurrent connections. We show that this allows to learn even difficult benchmark tasks such as the self-sustained generation of desired low-dimensional chaotic dynamics or memory-dependent computations. Furthermore, we show how spiking networks can build models of external world systems and use the acquired knowledge to control them.

Thomas Mensink, Efstratios Gavves, and Cees G.M. Snoek, ISLA, Informatics Institute, University of Amsterdam
COSTA: Co-Occurrence Statistics for Zero-Shot Classification
In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Our main contribution is COSTA, which exploits co-occurrences of visual concepts in images for knowledge transfer. These inter-dependencies arise naturally between concepts, and are easy to obtain from existing annotations or web-search hit counts. We estimate a classifier for a new label, as a weighted combination of related classes, using the co-occurrences to define the weight. We propose various metrics to leverage these co-occurrences, and a regression model for learning a weight for each related class. We also show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three multi-labeled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming fully supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification.

Tameem Adel, Institute for Computing and Information Science (iCIS) of the Faculty of Sciences, Radboud University Nijmegen
A Probabilistic Covariate Shift Assumption for Domain Adaptation
The aim of domain adaptation algorithms is to establish a learner, trained on labeled data from a source domain, that can classify samples from a target domain, in which few or no labeled data are available for training. Covariate shift, a primary assumption in several works on domain adaptation, assumes that the labeling functions of source and target domains are identical. We present a domain adaptation algorithm that assumes a relaxed version of covariate shift where the assumption that the labeling functions of the source and target domains are identical holds with a certain probability. Assuming a source deterministic large margin binary classifier, the farther a target instance is from the source decision boundary, the higher the probability that covariate shift holds. In this context, given a target unlabeled sample and no target labeled data, we develop a domain adaptation algorithm that bases its labeling decisions both on the source learner and on the similarities between the target unlabeled instances. The source labeling function decisions associated with probabilistic covariate shift, along with the target similarities are concurrently expressed on a similarity graph.We evaluate our proposed algorithm on a benchmark sentiment analysis (and domain adaptation) dataset, where state-of-the-art adaptation results are achieved. We also derive a lower bound on the performance of the algorithm.

The meeting is sponsored by SNN Adaptive Intelligence, SMART Research BV, the NWO program on Natural Artificial Intelligence and SIKS.