Not-so-supervised learning of algorithms

About a month ago I gave a talk at UC Dublin, titled “Not-so-supervised learning of algorithms and academics”. I talked both a bit about my research as well as things I’ve learned through Twitter and my blog. The slides are available here but to give some context to everything, I thought I would write a few things about it. In this post I discuss the first part of the talk – not-so-supervised learning of algorithms. All links below are to publishers’ versions, but on my publications page you can find PDFs.

Not-so-supervised learning of algorithms

Machine learning algorithms need examples to learn how to transform inputs (such as images) into outputs (such as categories). For example, an algorithm for detecting abnormalities in a scan, would typically need scans where such abnormalities have been annotated. Such annotated datasets are difficult to obtain. As a result, algorithms could learn input-output patterns that only hold for the training examples, but not for future test data (overfitting).

There are various strategies to address this problem. I have worked on three such strategies:

  • multiple instance learning
  • transfer learning
  • crowdsourcing

Multiple instance learning

The idea in multiple instance learning is to learn from groups of examples. If you see two photos and I tell you “Person A is in both photos”, you should be able to figure out who that person is, even if you don’t know who the other people are. Similarly, we can have scans which have abnormalities somewhere (but we are not sure where), and we can figure out what things they have in common, which we cannot find in healthy scans. During my PhD I worked on such algorithms, and applying them to detecting COPD.

Transfer learning

Another strategy is called transfer learning, where the idea is to learn from a related task. If you are learning to play hockey, perhaps other things you already know, such as playing football, will help you learn. Similarly, we can first train an algorithm on a dataset on a larger source dataset, like scans from a different hospital, and then further train it on our target problem. Even seemingly unrelated tasks, like recognizing cats, can be a good source task for medical data.

There are several relationships between multiple instance learning and transfer learning. To me, it feels like they both constrain what our algorithm can learn, preventing it from overfitting. Multiple instance learning is itself also a type of transfer learning, because we are transferring from the task with global information, to the task with local information. You can read more about these connections here.


A different strategy is to gather more labels using crowdsourcing, where people without specific expertise label images. This has been successful in computer vision for recognizing cats, but there are also some promising results in medical imaging. I have had good results with outlining airways in lung scans and describing visual characteristics of skin lesions. Currently we are looking into whether such visual characteristics can improve the performance of machine learning algorithms.


This was an outline of the first part of the talk – stay tuned for the not-so-supervised learning of academics next week!

Firsts: designing an undergraduate project course

One of the new things I had to do during my first year on the tenure track was to design a course. I have not designed entire courses before, but this was a great experience that I learned a lot from, and even managed to integrate my research with my teaching. As I am writing about this in my teaching portfolio, I thought I could share some of the insights in a blog post as well, that could be helpful to others in a similar situation.


The goal was to design a project course on image analysis for first year students. As most courses at my university, this would be a 5 ECTS course and run for 8 weeks (+2 weeks for evaluation/exams). A project course meant that many of the learning goals focused on (already defined) project skills. My job was to create an assignment on which students could work together. A first year course meant that I could not assume a lot of prior knowledge of the students. I also had to align my course with the other existing project courses, connecting theory, modeling and experiments. And of course, I wanted to create a course that was fun.


I started designing the course early on – as I started my job in February 2017, while the course would only start in November 2017. I used the other project courses in our department, and other courses I could find online as inspiration. I also searched for information about how to organize group projects, and what aspects of projects students like or dislike. I saved all of this Evernote and later used these notes during brainstorming.

During brainstorming, it became clear I wanted to add real-world components to the course, such as having a client for the assignment and gathering data. I also wanted to design the course in such a way that success was not too dependent on programming skills. So, I needed an idea which had all these components, and somehow involved analyzing medical images.

The project I settled on was extracting visual properties, like “asymmetry”, from images of skin lesions. Dermatologists look at such properties when making a diagnosis, and by automatically measuring such properties, we can design machine learning algorithms (which students will come across later in their studies).

Real-world components

I found a client for the assignment – the developers of the app Oddspot, which asks the user questions about the lesion, and then calculates a risk score. The developers could be interested in extending the app with imaging, and the students’ assignment was to investigate the possibilities. This way I had the basics for the theory – which features to measure in images, the model – an algorithm that actually does this, and the experiment – testing whether the measurements were effective.

I thought that another real-life component would be for the students to gather data. My first idea was to gather images of skin lesions with smartphones. But my own phone was not good enough to produce good quality images so I doubted this would work. Instead, I decided to use a public dataset from the ISIC melanoma detection challenge.

To still have a data gathering process, I asked each group of students to visually assess the features they were planning to measure with the algorithms. This way, even if a group would not be able to get their algorithms to work, they could still perform experiments – for example, by looking at interobserver agreement.

The project courses are assessed with a final report and a presentation. I decided to replace this traditional presentation with a Youtube video, aimed at a more general audience, such as prospective students. I thought this would allow for more creativity than a traditional presentation, but also build in some accountability, since the presentation could in fact be watched by other people.

As I was thinking of all these things, I was writing the guide that the students would get from me, to try to understand if any important information was missing, and filling in required documents related to the design of the course – for example, which learning goal would be assessed in which assignment.

During the course

Since this was a project course, I actually gave only one lecture to the students, where I talked about image analysis, measuring features in images, and of course explaining the assignment. After that, the students met in groups, together with teaching assistants (TAs). The role of the TAs in this case is to oversee the project skills part is going well – they are not required to have any background in image analysis and are not supposed to help the students with the content of the assignment. During these weeks, I would meet with the TAs and the study coordinator, who took care of all the logistics of these project courses, to discuss the progress of the groups. I made notes during these meetings, to take into account when updating the course next year.

After having read lots of “advice for tenure trackers” types of blog posts, I was afraid that teaching a large course would leave me overwhelmed with email. So, both in materials I gave to students and during the lecture, I asked the students to ask all questions related to the course content via discussions on Canvas. Of course I still got emails, but I redirected those students to Canvas and then answered their questions there, so that the answers would be visible to everyone.

What this system achieved was that (i) I didn’t get any repeat questions (ii) all students had the same information, so it was more fair and (iii) students could learn from each others questions/answers. Another advantage for me was that I would get a digest from all new questions in Canvas at the end of the day, so I could schedule times I would go through them, rather than multi-tasking during the day, as happens with email.

Another thing I have to emphasize is that a lot of the logistics were handled by the study coordinator, who found the TAs, checked which students were absent too often etc. Meanwhile, I could just focus on the content of course, which takes a lot of stress away from the experience of teaching for the first time. So, hats off to my department for setting it up this way.

After the course

The course grade consisted of project skills, which were assessed by the TAs, and the content part, based on the report and the Youtube videos. Although I gave general criteria for how I was going to assess these and made an “assessment matrix”, in the process I decided I needed a more detailed rubric to keep grades fair, so grading and then re-grading took quite a bit of time.

The students really surprised me with their Youtube videos (in Dutch), which were all very well done. I even tweeted this:

At the end I had a short meeting with each of the groups to give them feedback on their assignments and get more input for the course. For example, I asked them what they found the most surprising and the most difficult (of course, recording this in Evernote). I also brainstormed a bit with them how to update the assignment next year.

Next year

I’m happy to report that overall I got good feedback about the project. The students said they particularly enjoyed that it was a real assignment and not something that was already done many times. This is great, but of course also means that I will need to update the assignment each year, so that the students are building upon each other’s work, and not doing exactly the same things.

I worried about the programming part being too difficult. During the course, the students did find programming challenging, but at the same time it was clear they were figuring things out. And all groups did submit code which was of sufficient quality. Most students indeed complained about the level of difficulty in the course evaluations, although a few students commented that they liked having to figure it all out themselves. This is definitely something I will address next year.

Finally, I of course also received course ratings. I know I should take these as a grain of salt, since group projects probably get higher ratings overall, student evaluations are not correlated with learning, but… it still feels pretty great to have a success in the middle of all my rejected grants and unfinished papers.

Integrating research and teaching

Remember all those visual ratings the students had to do? I’m not sure I realized it at the time, but in fact, I had just crowdsourced a lot of annotations for medical images. I am now using these results in my current research, and recently I submitted a paper about it, where I acknowledge the students who took the course. Another real-world component?


My take-aways from this experience would be:

  • Take a lot of time for brainstorming
  • Find examples of other courses
  • Evernote is great for keeping track of ideas, feedback, etc.
  • Use the learning environment to reduce your email load
  • Think how large classes of undergraduates can still participate in your research
  • Having teaching support is absolutely the best and made this potentially stressful experience very enjoyable


I’d like to thank Josien Pluim for brainstorming about the course, Chris Snijders for participating with Oddspot, Rob van der Heijden for coordinating a LOT of things, Nicole Garcia, Jose Janssen and Nilam Khalil for administrative support, Maite van der Knaap, Femke Vaassen, Nienke Bakx and Tim van Loon for supervising the student groups, and last but not least, students who followed 8QA01 in 2017-2018.

Choosing machine learning datasets for your experiments

I have another website,, which I started during my PhD on multiple instance learning and where I collected datasets – therefore “Multiple Instance problems”.  I don’t plan to maintain the website anymore, so I’m in the process of migrating all the datasets to Figshare. But I also saw that I had posted there about “Choosing datasets for your experiments”, so here’s an (updated) version of that post.

Choose diverse datasets?

A lot of papers I read during my PhD were using several datasets to test their classifiers. How to choose which datasets to test on? I have several times seen (and used) the reasoning that, datasets are chosen to be “diverse” based on characteristics like sample size or dimensionality. However, it turns out that datasets we consider similar, may behave very differently in terms of classifier performances, and vice versa. This is what we showed in our SIMBAD 2015 paper “Characterizing Multiple Instance Datasets” [PDF | Publisher].

Measuring similarity

First we took many of the datasets I have collected, which were typically used in papers about multiple instance learning. Several of these were one-against-all datasets generated from multi-class datasets. In other words, an image dataset with 20 categories like “cat”, “dog” etc becomes 20 datasets, the first of which is labeled as “cat” or “not a cat”, etc. These datasets are identical in terms of their size and dimensionality.

Then we compared several respresentations for datasets – describing them by “meta-features” like size, and describing them by classifier performances directly. Based on such a representation, we can create an embedding showing the similarities and differences of the datasets. I even made a basic, non-user-friendly Shiny app where you can select a few datasets and classifiers (from the MIL toolbox) and see how the embedding changes. (Please get in touch if you have more experience with Shiny and want to help out!)

This is one of the figures from the paper where we show the dataset embeddings – with meta-features on the left, and “classifier features” on the right. Each point is a dataset, and it is colored by which “group” of dataset it is from. For example, the shaded black circles (Corel) and empty gray triangles (SIVAL) are in fact several datasets generated from multi-class problems. We also used several artificial datasets, illustrating some toy problems in multiple instance learning.

Take-home message

There were two surprising results here. First of all, datasets we expected to have similar behavior, like SIVAL and Corel, are not that similar in the embedding on the right. Secondly, the artificial datasets were not similar to any of the real datasets they were supposed to illustrate.

The take-away is that, if you are choosing (small) datasets for your machine learning experiments based on your intuition about them, your results might not be as generalizable as you think.


Did you like reading about some of my older research? Let me know and I could write more about this in the future!



Valleywatch: a game for medical image annotation

Last year I had the pleasure of supervising Dylan Dophemont, then a student game & media technology and now a game developer at Organiq. During his project Dylan created a game for medical image annotation. In this blog post I summarize this project.

Annotating airways

The game addresses the problem of annotating airways in chest CT images. The measurements of airways are important for diagnosis and monitoring of different lung diseases. Currently this is done manually, by looking at 2D slices of the 3D chest image, and outlining the airway – a dark ellipse with a lighter ring around it. The dark part is the inside of the airway, and the lighter ring is the airway wall.

Left: slice through a lung image. Image by Dylan Dophemont. Right: annotation of an airway, screenshot by Wieying Kuo.

This is very time-consuming – annotating 1 CT image can cost an expert the whole day. Machine learning algorithms for this purpose exist too, but they are not yet robust enough because there are not enough annotated images that are available for training.


Crowdsourcing and games

To address the problem, prior to this project we experimented with using crowdsourcing on Amazon Mechanical Turk to annotate the airways. This was challenging, as it turned out that most people did not read the instructions!

Screenshot of the instructors to annotators on Amazon Mechanical Turk


Most did try to annotate airways, but only annotated only the inside or the outside of the airway, not both as we wanted. Nevertheless, it was easy to filter out the annotations that weren’t usable. For the remaining annotations, the measurements correlated well with the expert measurements of the airways – see our paper about it (also on arXiV).


Outline of the ValleyWatch game. Image by Dylan Dophemont.


In Dylan’s project the goal was to gamify this process. After creating several game concepts, he settled on ValleyWatch. This is a casual game, where the world is generated by the lung image – notice several round valleys (round valleys) in the screenshot. The players have to take care of the world by sending rangers to put out forest fires. To put out a fire, the player switches to a screen where he or she can outline the valley – and therefore, the airway!



Screenshot of Valleywatch prototype.

Dylan conducted several playtests for this prototype, both with game design students and medical imaging researchers. Although it was not always a clear to the players what to do, the game was received enthusiastically by both. Here, too, we saw that reading the instructions paid off – the players who did this created accurate annotations!


Next steps in this project would be to extend the prototype by adding a tutorial, add functionality to “change world” (i.e. load a lung iamge of a different person), investigate how different game elements affect the quality of annotations, and more. If you are interested in a student project on this or related topics, please get in touch!

Stable detection of abnormalities in medical scans

This is my second post about a paper I wrote. This time it is about Label stability in multiple instance learning, published at MICCAI 2015. Here you can see a short spotlight presentation from the conference. The paper focuses is on a particular type of algorithms, which are able to detect abnormalities in medical scans, and a potential problem with such algorithms.

What if I told you that I like words “cat”, “bad” and “that”, but I don’t like the words “dog”, “good” and “this”. Did you spot a pattern? To give you a hint, there is one letter that I like in particular*, and I don’t like any words that don’t have that letter. Note that now you are able to do two things: to tell whether I’d like any other word from the dictionary, AND why that is.

You are probably asking yourself what this has to do with medical scans. Multiple instance learning (MIL) algorithms out there that are faced with a similar puzzle: each of the scans in group 1 (words that I like) has abnormalities (a particular letter), and each of the scans in group 2 is healthy. How can we tell whether a scan has abnormalities or not, and where those abnormalities are? If the algorithm figures this out, it could detect abnormalities in images it hasn’t seen before. This is a toy example of how the output could look like:


Detecting whether there are any abnormalities is slightly easier than finding their locations. For example, imagine that the scan above has the lung disease COPD, and therefore contains abnormalities. Now imagine, that our algorithm output has changed:


The algorithm still correctly detects that the scan contains abnormalities, but the locations are different. If the locations of the abnormalities were clinically relevant, this would be a problem!

Of course, in the ideal case we would evaluate such algorithms on scans, where the regions with abnormalities are manually annotated by experts. But the problem is that we don’t always have such annotations – otherwise we probably would need not to use MIL algorithms. Therefore, the algorithms are often evaluated on whether they have detected ANY abnormalities or not.

In the paper I examined whether we can say something more about the algorithm’s output, without having the ground truth. For example, we would expect a good algorithm to be stable: for one image, slightly different versions of the algorithm should detect the same abnormalities. My experiments showed that an algorithm that is good at detecting whether there are any abnormalities, isn’t necessarily stable. Here is an example:

copd stability

Here I compare different algorithms – represented by different colored points – on a task of detecting COPD in chest CT scans. The y-axis measures how good an algorithm is at detecting COPD (whether there are any abnormalities) – the higher this value, the better. Typically this would be the measure of how researchers in my field would choose the “best” algorithm.

I proposed to also examine the quantity on the x-axis, which measures the stability of the detections: a value of 0.5 means that multiple slightly different versions of the same algorithm only agree on 50% of the abnormalities they detected. Now we can see that the algorithm with the highest performance (green square) isn’t the most stable one. If the locations of the abnormalities are clinically relevant, it might be a good idea to sacrifice a little bit of the performance by choosing a more stable algorithm (blue circle).

A more general conclusion is: think carefully whether your algorithm evaluation measure really reflects what you want the algorithm to do.

%d bloggers like this: