One of the most exciting parts about being a data scientist is the pace of advancements in the field. There are constantly new innovations. New algorithms. New techniques. New stories. It is a battle to stay up-to-date with all of the changes in the field, but always worth it. While those new innovations rarely apply immediately to my day-to-day work, they spark ideas and allow me to be a better data scientist in the long-run.
While I’m subscribed to many newsletters and listservs that compile recent articles and trends in the data science / machine learning / artificial intelligence space, I have always felt that there was a gap between those articles and the actual work going on in the field. To solve this, I wanted to get closer to the source material. I wanted to read the journal articles themselves.
That sounds obvious, of course, but the problem lies in the pure volume of articles produced, which has only increased over time. To characterize that, let’s talk about some of the usage statistics for one of the best sources of journal articles: arXiv. One good place to start is the number of article submissions arXiv receives per month. Spoiler: it’s a lot of submissions and it’s only growing.
Data provided by arXiv.
In January 2019, there were 11,537 journal entries submitted to arXiv. That’s a staggering number. Finding the right article in that proverbial haystack is an imposing challenge. Further, it is extremely challenging to dive into reading articles directly without understanding the history of the research more broadly. There is a great deal of context that needs to be understood before being able to fully grapple with new material. With all of that in mind, I keep up with the latest research in three ways.
Read Seminal Papers
The base of all of your future reading should be rooted firmly in history. The problem with that is figuring out which papers to read. As a solution to that problem, Github user Flood Sung compiled a roadmap of key works for deep learning. The repository has been forked thousands of times. According to the repository README, the content is organized by four guidelines:
From outline to detail
From old to state-of-the-art
from generic to specific areas
focus on state-of-the-art
Over 100 papers have been included and are organized into convenient categories and sub-categories. In addition, there is a brief reason for the inclusion of the paper, to provide deeper context behind the importance of the paper.
Take Advantage of arXiv Sanity
There are almost too many papers on arXiv to reasonably sort through them. However, there are tools that have been built to help make that task easier. One such tool is arXiv Sanity built by Andrej Karpathy.
The tool helps to sort through the vast troves of papers in several ways: most recent, top recent, top hype, and recommended. One of the most interesting features of arXiv Sanity is the ability to find similar papers to what you’re reading based upon their tf-idf similarity. This makes it extremely simple to find articles that build upon work that you’re particularly interested in.
Subscribe to Papers with Code
Papers with Code is exactly what it sounds like — a repository of machine learning papers…with code. The team of developers behind the site, Robert Stojnic and Ross Taylor, set out to build a website dedicated to creating “a free and open resource with Machine Learning papers, code and evaluation tables.”
There are quite a few ways to navigate the site to make it easier to find what you’re looking for. Papers can be sorted by “trending”, “latest”, or “greatest” (and searching is always an option). More recently, Papers with Code got a massive update and launched a “State of the Art” section which makes it simpler to browse by topic.
Most conveniently, you can also subscribe to a weekly digest of papers sent to your inbox once per week. This will include the top trending papers from the past week.