Choosing machine learning datasets for your experiments

I have another website,, which I started during my PhD on multiple instance learning and where I collected datasets – therefore “Multiple Instance problems”.  I don’t plan to maintain the website anymore, so I’m in the process of migrating all the datasets to Figshare. But I also saw that I had posted there about “Choosing datasets for your experiments”, so here’s an (updated) version of that post.

Choose diverse datasets?

A lot of papers I read during my PhD were using several datasets to test their classifiers. How to choose which datasets to test on? I have several times seen (and used) the reasoning that, datasets are chosen to be “diverse” based on characteristics like sample size or dimensionality. However, it turns out that datasets we consider similar, may behave very differently in terms of classifier performances, and vice versa. This is what we showed in our SIMBAD 2015 paper “Characterizing Multiple Instance Datasets” [PDF | Publisher].

Measuring similarity

First we took many of the datasets I have collected, which were typically used in papers about multiple instance learning. Several of these were one-against-all datasets generated from multi-class datasets. In other words, an image dataset with 20 categories like “cat”, “dog” etc becomes 20 datasets, the first of which is labeled as “cat” or “not a cat”, etc. These datasets are identical in terms of their size and dimensionality.

Then we compared several respresentations for datasets – describing them by “meta-features” like size, and describing them by classifier performances directly. Based on such a representation, we can create an embedding showing the similarities and differences of the datasets. I even made a basic, non-user-friendly Shiny app where you can select a few datasets and classifiers (from the MIL toolbox) and see how the embedding changes. (Please get in touch if you have more experience with Shiny and want to help out!)

This is one of the figures from the paper where we show the dataset embeddings – with meta-features on the left, and “classifier features” on the right. Each point is a dataset, and it is colored by which “group” of dataset it is from. For example, the shaded black circles (Corel) and empty gray triangles (SIVAL) are in fact several datasets generated from multi-class problems. We also used several artificial datasets, illustrating some toy problems in multiple instance learning.

Take-home message

There were two surprising results here. First of all, datasets we expected to have similar behavior, like SIVAL and Corel, are not that similar in the embedding on the right. Secondly, the artificial datasets were not similar to any of the real datasets they were supposed to illustrate.

The take-away is that, if you are choosing (small) datasets for your machine learning experiments based on your intuition about them, your results might not be as generalizable as you think.


Did you like reading about some of my older research? Let me know and I could write more about this in the future!



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: