KD2718

Becoming a Kaggle Expert with a Crontab and a Few Lines of Code

When I logged into Kaggle after 1.5 years, I was surprised to see that I had earned the rank of 'Expert.' I hadn’t participated in any competitions for over three years. It didn’t make sense—until I remembered a simple script I had written that had quietly worked in the background, earning me this unexpected title.

Background

I was pretty interested in the Kaggle competitions years ago. Kaggle is a data science competition platform that allows amateurs to compete against each other on machine learning problems. The winners can earn read money and bragging rights. I was never good enough to win a competition, but I enjoyed my time participating. You learn a lot about machine learning by participating in a competition.

Recently, I wanted to see if there was an LLM competition that I could participate in. It turns out there was one going on at the time of this blog and I was happy to enter. I logged in and was surprised to see I had an expert rank around my profile picture. I don't think i ever broke into the top 100 of a competition that I participated in. What was going on?

The Expert Rank

Kaggle has rankings for more than just competitions. There are also experts for other categories like Commenting, notebooks, and data sets. More info can be found here.

It turns out a dataset I uploaded to kaggle years ago had received over 70 upvotes. This was enough to get me into the top 100 of the experts. This was enough to earn it a gold medal, and an expert ranking in datasets. This dataset wasn't just data though. Datasets typically have, well, data in them. However there is no restriction on the type of file. So people figured out you can use a dataset as a type of local repository. Many competitions don't allow you to have a full internet connection, but you can install a "dataset" that is actually just a python wheel file. The dataset that earned all the upvotes was a collection of python wheel file.

The Script

During a competition and wanted to use a library called tabnet. Tabnet is an algorithm to for deep learning on tabular data. Note, I have no involvement of Tabnet's actual code on Github. Tabnet was relatively new at the time was getting updates frequently. I decided I. In the competition I was participating in, many people were using this get get good results on the early leaderboards. I noticed that there were a few different datasets being used and it was hard to find the latest version of the library. I decided to write a script that would check for updates to the library and download the latest version. That way I knew I could use my own "dataset" to always find the most up to date version of the library.

I thought I could write a script that would check for updates to the library, then send the any new versions to kaggle. I came up with this script:

#! /usr/bin/bash
set -e

echo '*******************************'
echo $(date)

<path_to_python>/pip download pytorch-tabnet --no-deps -d /opt/pytorch-runner/bin/

ls -tl bin/ > .temp.toc

if ! diff -q .temp.toc .toc &>/dev/null; then
          >&2 echo "different"
          #kaggle version -d bin -m "new version"
          <path_to_python_env>/envs/tabnet_kg/bin/kaggle d version -p bin -m "new version"
          ls -tl bin/ > .toc
else
        echo "no diff"
fi

Let me break this down. pip download downloads a python wheel file to a directory, rather than installing it to a venv. --no-deps tells pip to not download the libraries dependencies. The -d flag is the output directory of the wheel file. So now I have a way to download the latest version of the library.

Next I make a temporary file called .temp.toc and use ls -tl to list the files in the output directory from the -d flag. I then compare this to a file called .toc which is the last time I ran the script. If the files are different, I know that the library has been updated and I want to update my dataset.

Using the kaggle api/cli, I am then able to update the folder with all the different tabnet wheels to kaggle. This folder is my dataset. Last I update the .toc file so that I know the next time I run the script, I will know if the dataset has been updated.

Forgotten

Automating this script was as easy as setting up a crontab to run every 4 hours. I set the job and remember happily surprised when I logged in one day ahd it had updated on it's own. The competition ended, I didn't win but that was ok. I moved on to other things and forgot all about this script. The script continued quietly running on my old laptop that I use as a home server.

The Discovery

Fast forward to January 2025 and I thought it might be fun to try a new Kaggle competition. I log in and was surprised to see that not only had the script still been running all these years, but that many people had used it. The dataset was even part of a 4th place victory in a recent competition!

Pretty exciting stuff. The shell script is pretty ruff and I think I am using two different python environments. It works, but I would like to revisit it one day. Maybe it would be worth paying more attention to some kaggle competitions to identify other hard to get libraries, and automating them. This could help me get a dataset master badge!

social