NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.
Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:
When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?
I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?
Experimentation, Performance, and GPU Accelerated Data Science Tooling
Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.
Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X!
The Data Science Workstation allowed him to design a Pubmed healthcare article search engine where he:
- Ingested a larger database than ever imagined (30,000,000 research article abstracts!)
- Didn’t require massive code changes to GPU accelerate the algorithm
- Used familiar looking tools for his workflow
- Had unsurpassed agility—he could search large portions of the abstract database in .1 seconds!
This last point is really critical and shows why we believe the NVIDIA Data Science Workstation Platform and its RAPIDS tools are so special. As Kyle put it:
Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.
Walkthrough with Step-by-Step Instructions
The new article is available on Medium. It provides a complete step-by-step walkthrough of how NVIDIA Rapids tools and NVIDIA Quadro RTX 6000 with NVLink were utilized to revolutionize this process.
A short set of Kyle’s key findings about the environment and the hardware are below. We’re excited about how this kind of rapid development could change healthcare:
Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run
Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time
With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.