Visualize Large Datasets by Sampling

written by Eric J. Ma on 2017-09-14

Just a little tip, putting it here for myself and others in case it helps.

Sometimes, you need to visualize a large dataset, but it takes a ton of time to render it or compute the necessary transforms.

If your samples are statistically sampled independently of one another (i.e. basically not timeseries), and the goals are to do some statistical visualizations, then it's basically valid to visualize a downsampled set of the dataset.

I recently encountered this point at work. After running a clustering analysis, I wanted to see a pair plot of the distribution of features in each cluster. However, with cluster sizes ranging from 200-2 million, rendering times were unreasonably long (making things non-interactive) for the large sized clusters. I thus decided to downsample the large clusters to a maximum of 2,000 data points. Instantly, render times improved, and I could start interacting with my data again.

Little things matter!

Cite this blog post:

@article{
    ericmjl-2017-visualize-sampling,
    author = {Eric J. Ma},
    title = {Visualize Large Datasets by Sampling},
    year = {2017},
    month = {09},
    day = {14},
    howpublished = {\url{https://ericmjl.github.io}},
    journal = {Eric J. Ma's Blog},
    url = {https://ericmjl.github.io/blog/2017/9/14/visualize-large-datasets-by-sampling},
}

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!

Eric J Ma's Website