written by Eric J. Ma on 2017-11-30 | tags: pydata conferences python
With that, we’ve finished PyData NYC! Here's some of my highlights of the conference.
There were three keynotes, one each by Kerstin Kleese van Dam, Thomas Sargent, and Andrew Gelman. Interestingly enough, they didn't do what I would expect most academics to do -- give talks highlighting the accomplishments of their research groups. Rather, Kerstin gave a talk that highlighted the use of PyData tools at Brookhaven National Labs. Thomas Sargent gave a philosophical talk on what economic models really are (they're "games", in a mathematical sense), and I took back the importance of being able to implement models, otherwise, "you're just bull*****ing".
Andrew Gelman surprised me the most - he gave a wide-ranging talk about the problems we have in statistical analysis workflows. He emphasized that "robustness checks" are basically scams, because they're basically methods whose purpose is reassurance. He had a really cool example that highlighted that we need to understand our models by modifying our models, perhaps even using a graph of models to identify perturbations to our model that will help us understand our model. He also peppered his talk with anecdotes about how he made mistakes in his analysis workflows. I took home a different philosophy of data analysis: when we evaluate how "good" a model is, the operative question is, "compared against what?"
The talks were, for me, the highlight of the conference. A lot of good learning material around. Here's the talks from which I learned actionable new material.
This talk by the prolific PyMC blogger Austin Rochford is one that I really enjoyed. The take-home that I got from him was towards the end of his talk, in which I picked up three ways to diagnose probabilistic programming models.
The first was the use of residuals - which I now know can be used for classification problems as well as regression problems.
The second was the use of the energy plots in PyMC3, where if the "energy transition" and "marginal energy distribution" plots match up (especially in the tails), then we know that the NUTS sampler did a great job.
The third was the use of the Gelman-Rubin statistic to measure the in-chain vs. between-chain variation; measures close to 1 are generally considered good.
Check out the talk slides here.
scikit-learn
-compatible model stackingThis talk was a great one because it shows how to use model stacking (also known as "ensembling", a technique commonly used in Kaggle competitions) to enable better predictions.
Conceptually, model stacking works like this: I train a set of model individually on a problem, and use the predictions from those models as features for a meta-model. The meta-model should perform, at worst, on par with the best model inside the ensemble, but may also perform better. This idea was first explored in Polley and van der Laan's work, available online.
Civis Analytics has released their implementation of model stacking in their GitHub repository, and it's available on PyPI. The best part of it? They didn't try inventing a new API, they kept a scikit-learn
-compatible API. Kudos to them!
Check out the repository here.
Matthew Rocklin, as usual, gave an entertaining and informative talk on the use of Streamz, a lightweight library he built, to explore the use of Dask for streaming applications. The examples he gave were amazing showcases of the library's capabilities. Given the right project, I'd love to try this out!
Check out his slides here.
This talk, delivered by James Cropcho, defined what asynchronous programming was all about. For me, things finally clicked at the end when I asked him for an example of how asynchronous programming would be done in data analytics workflows -- to which he responded, "If you're querying for data and then performing a calculation, then async is a good idea."
The idea behind this is as such: web queries written serially are often "blocking", meaning we can't do anything while we wait for the web query to return. If we want to do a calculation on the returned data point, we have to wait for it to return first. On the other hand, if written asynchronously, we could potentially do a calculation on the previous data point while waiting for the current result to return, shaving the total time off potentially by some considerable fraction.
scikit-learn
This talk was by Nicole Carlson, and she did a tremendously great job delivering this talk. In it, she walked through how to wrap a PyMC3 model inside a scikit-learn
estimator, including details on how to implement the .fit()
, .predict()
, and .predict_proba()
methods. The code in her repository provides a great base example to copy from.
One thing I can't emphasize enough is that from a user experience standpoint, it's super important to follow idioms that people are used to. What Nicole did in this talk is to show how we can provide such idioms to end-users, rather than inventing a slightly modified wheel. Props to her for that!
Her slides are online here.
This talk was my own, put at the end of the 2nd day. The title definitely contributed to the hype. I popped into the room early, but then left for the restroom. When I got back, there was a lineup in the front door and in the back door. Totally unexpected. That said, big credit to the Boston Bayesians organizers Jordi and Colin, who let me do the talk as a rehearsal for PyData, so I felt very grounded.
The talk went mostly smoothly. I think I was channeling my colleague, Brant Peterson, with his sense of humour during that time. There was one really hilarious hiccup - right after mentioning that I wouldn't overdo the "math" and "equations", I accidentally opened an adjacent tab with an alternate version of the slides... with, surprise surprise, a math equation on it! During the Q&A, when I shared the point of not needing to do train/test splits in Bayesian analysis, I could sense the jaws dropping and eyes widening in disbelief; more than just a handful of people came up and asked for the reference later on.
Having done the talk, I now realize how much people will appreciate a lighthearted and lightweight introduction to a topic that's very dense and filled with jargon. Conference speakers, we need to do more of this!
From an emotional standpoint, many people brought me joy with their positive comments on the visuals and structure of the talk. Others put out positive comments on Twitter, which I collected together in a Twitter Moment. It was very encouraging, especially on this deep learning journey that I'm on right now.
matplotlib
FiguresThis was something I totally didn't realize was possible before - we can create interactive matplotlib
figures very easily! I have cloned the repository, and I think it'll be neat to hack on some projects at work to use this.
The tutorial repository can be found here.
This one was by Colin Carroll, a software engineer at the MIT Media Lab (previously at Kensho). I sat in and learned a good deal of math from him. One thing new I learned was how we can specify a model without requiring the use of observed variables. Any sampling we do will take into account the hierarchical and mathematical relationships we've done. This makes it neat to implement Bayesian nets to test how things will look under different scenarios!
His tutorial repository can be found here.
This one was led by the ever-entertaining, ever-surprising James Powell. I wish the tutorial was recorded, because even though this is a "novice" tutorial, it nonetheless was still an eye-opening talk. Anybody who thinks they know Python should go listen to James' talks, whenever he gives them live - it's bound to be entertaining and eye-opening!
I'm glad I made it to PyData NYC 2017 this year. Made many new friends and connections, and caught up with old friends in the PyData community. As always, learned a ton as well!
@article{
ericmjl-2017-pydata-recap,
author = {Eric J. Ma},
title = {PyData NYC 2017 Recap},
year = {2017},
month = {11},
day = {30},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2017/11/30/pydata-nyc-2017-recap},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!