Not-so-supervised learning of algorithms

About a month ago I gave a talk at UC Dublin, titled “Not-so-supervised learning of algorithms and academics”. I talked both a bit about my research as well as things I’ve learned through Twitter and my blog. The slides are available here but to give some context to everything, I thought I would write a few things about it. In this post I discuss the first part of the talk – not-so-supervised learning of algorithms. All links below are to publishers’ versions, but on my publications page you can find PDFs.

Not-so-supervised learning of algorithms

Machine learning algorithms need examples to learn how to transform inputs (such as images) into outputs (such as categories). For example, an algorithm for detecting abnormalities in a scan, would typically need scans where such abnormalities have been annotated. Such annotated datasets are difficult to obtain. As a result, algorithms could learn input-output patterns that only hold for the training examples, but not for future test data (overfitting).

There are various strategies to address this problem. I have worked on three such strategies:

  • multiple instance learning
  • transfer learning
  • crowdsourcing

Multiple instance learning

The idea in multiple instance learning is to learn from groups of examples. If you see two photos and I tell you “Person A is in both photos”, you should be able to figure out who that person is, even if you don’t know who the other people are. Similarly, we can have scans which have abnormalities somewhere (but we are not sure where), and we can figure out what things they have in common, which we cannot find in healthy scans. During my PhD I worked on such algorithms, and applying them to detecting COPD.

Transfer learning

Another strategy is called transfer learning, where the idea is to learn from a related task. If you are learning to play hockey, perhaps other things you already know, such as playing football, will help you learn. Similarly, we can first train an algorithm on a dataset on a larger source dataset, like scans from a different hospital, and then further train it on our target problem. Even seemingly unrelated tasks, like recognizing cats, can be a good source task for medical data.

There are several relationships between multiple instance learning and transfer learning. To me, it feels like they both constrain what our algorithm can learn, preventing it from overfitting. Multiple instance learning is itself also a type of transfer learning, because we are transferring from the task with global information, to the task with local information. You can read more about these connections here.

Crowdsourcing

A different strategy is to gather more labels using crowdsourcing, where people without specific expertise label images. This has been successful in computer vision for recognizing cats, but there are also some promising results in medical imaging. I have had good results with outlining airways in lung scans and describing visual characteristics of skin lesions. Currently we are looking into whether such visual characteristics can improve the performance of machine learning algorithms.

***

This was an outline of the first part of the talk – stay tuned for the not-so-supervised learning of academics next week!

Choosing machine learning datasets for your experiments

I have another website, miproblems.org, which I started during my PhD on multiple instance learning and where I collected datasets – therefore “Multiple Instance problems”.  I don’t plan to maintain the website anymore, so I’m in the process of migrating all the datasets to Figshare. But I also saw that I had posted there about “Choosing datasets for your experiments”, so here’s an (updated) version of that post.

Choose diverse datasets?

A lot of papers I read during my PhD were using several datasets to test their classifiers. How to choose which datasets to test on? I have several times seen (and used) the reasoning that, datasets are chosen to be “diverse” based on characteristics like sample size or dimensionality. However, it turns out that datasets we consider similar, may behave very differently in terms of classifier performances, and vice versa. This is what we showed in our SIMBAD 2015 paper “Characterizing Multiple Instance Datasets” [PDF | Publisher].

Measuring similarity

First we took many of the datasets I have collected, which were typically used in papers about multiple instance learning. Several of these were one-against-all datasets generated from multi-class datasets. In other words, an image dataset with 20 categories like “cat”, “dog” etc becomes 20 datasets, the first of which is labeled as “cat” or “not a cat”, etc. These datasets are identical in terms of their size and dimensionality.

Then we compared several respresentations for datasets – describing them by “meta-features” like size, and describing them by classifier performances directly. Based on such a representation, we can create an embedding showing the similarities and differences of the datasets. I even made a basic, non-user-friendly Shiny app where you can select a few datasets and classifiers (from the MIL toolbox) and see how the embedding changes. (Please get in touch if you have more experience with Shiny and want to help out!)

This is one of the figures from the paper where we show the dataset embeddings – with meta-features on the left, and “classifier features” on the right. Each point is a dataset, and it is colored by which “group” of dataset it is from. For example, the shaded black circles (Corel) and empty gray triangles (SIVAL) are in fact several datasets generated from multi-class problems. We also used several artificial datasets, illustrating some toy problems in multiple instance learning.

Take-home message

There were two surprising results here. First of all, datasets we expected to have similar behavior, like SIVAL and Corel, are not that similar in the embedding on the right. Secondly, the artificial datasets were not similar to any of the real datasets they were supposed to illustrate.

The take-away is that, if you are choosing (small) datasets for your machine learning experiments based on your intuition about them, your results might not be as generalizable as you think.

***

Did you like reading about some of my older research? Let me know and I could write more about this in the future!

 

 

How to find medical imaging companies (in the Netherlands)

A slightly different type of post today! Inspired by Dr. Raul Pacheco-Vega, who writes many of his amazing blog posts for his students, I decided to write about a question that has already come up a few times, and will probably come up again. The question is – where can I find companies who do (medical) image analysis (in the Netherlands)? This is important for students looking for internships, graduation projects, and jobs.

In this post I outline my search strategies to find such companies – especially small ones, which are difficult to find otherwise. These strategies might be useful to you even if you are searching for companies in other fields or countries.  The strategies are based on searching online, so they don’t assume you already have a network of people to rely on.

1. Who is advertising for jobs

The most straightforward way is to search for keywords on LinkedIn. If I search for “medical imaging” in the Netherlands, I get a lot of vacancies at Philips and a few at research institutions. There are also several vacancies which do not have a connection to medical imaging.

My intuition is that this type of search would overlook companies that do not have a specific vacancy, but would welcome open applications from people with the right qualifications. The same holds for internships – often these are not advertised on any website, but there might be opportunities if you contact a company directly.

2. Where are alumni working

The next place I’m going to look, is where alumni of biomedical engineering at Eindhoven University of Technology (TU/e) are working. Here is a LinkedIn page with alumni of TU/e.  I cannot filter by faculty here, but I can enter search terms related to the names of the programs offered, for example, “biomedical”. I can also filter for alumni living in the Netherlands, and filter by date to filter out any current students.

Now I just click on a lot of profiles, if the description suggests the person is working in a company, and screening the companies for doing (medical) imaging. This is quite a time-intensive process. There are many companies that hire biomedical engineering graduates, but that do not focus on imaging. But I did find many more examples than with the first strategy:

3. Who is sponsoring the conferences

Moving away from LinkedIn, a way that helped me discover several companies, is through sponsoring at academic conferences. The first step is to find out what the main conferences are, either from reading papers or searching online. For medical imaging I’m going to look at MICCAI, which has been running for 21 years, but also a a new conference MIDL, which is held this year in Amsterdam.

Now simply search for a “Sponsors” page and you are good to go! Some conferences (or rather, professional societies that organize the conference) also have a dedicated job page, for example the MICCAI job board. Here are the results from the sponsor pages (not limited to the Netherlands):

 

4. Who tweets about it

Of course this post would not be complete without Twitter! First I’m going to try searching for keywords, starting with “medical imaging”. If I then click on the tab “People”, I see accounts who have “medical imaging” in their bio. This list already includes several companies, for example:

Another strategy would be to look at the who follows medical imaging researchers and companies. The trick is to find an account with not too few, but also not too many followers. In this example I will look at who follows Quantib, a company based in the Netherlands. From the list of followers, I find the following:

These accounts should also give you some ideas of what other keywords or hashtags to search for.

I hope this was useful! Happy internship / job searching, and please comment below if you have other tips!

%d bloggers like this: