NBA Data Wants To Be Public

The NBA is making my life difficult. Yet again, they’re making it harder to programmatically access their data, rendering Vorped kinda useless.

If you’re curious how everything works, Vorped sources its data from nba.com/stats, periodically scraping data from the site as games complete. Underneath the fancy and unintuitive nba.com site are “links” to raw data files, and the collection of “links” are typically called an API, or application programming interface. APIs are a little more complicated than that, but think of it as a common language that allows computer programs anywhere in the world to share and update data with each other.

In mid-March, the NBA decided to make it more difficult for people like me to programmatically gather data from these undocumented APIs. I’m not the only one, as you can see from the comments on this conversation on Github.

Why would the NBA do this? I don’t know for sure, but I can speculate on some reasons why:

  • It’s expensive to serve up all that data, specifically paying for the bandwidth. The NBA website is likely accessed by millions of people per day, and paying the bandwidth costs to support all those requests adds up.
  • Perhaps it’s actually not that expensive to serve up the data, but third parties like me comprise the vast majority of requests, making those bandwidth costs a lot more expensive than they should be. Because scripts and bots can make exponentially more requests for data than the typical human physically clicking around a site, NBA’s bandwidth costs might be exponentially inflated.
  • Protecting ad revenue. Allowing others to replicate data found on nba.com means people can consume that data outside of nba.com. And if you don’t visit nba.com, you don’t notice all the advertisements for SAP, who seems to have paid to sponsor the site. It wouldn’t be great for SAP if they could no longer get exposed to all those eyeballs visiting nba.com.

From a short-term accounting perspective, this all makes sense for the NBA’s business. Reduce unnecessary costs, and protect existing revenue streams. The economist in me applauds the efficiency.

That’s great for the NBA, but terrible for me. Because instead of spending my time immersing myself in the data and understanding what’s going on with the league (aka being a fan of the league), I’m playing pointless cat and mouse games with the nba.com programmers. And it’s getting tiresome.

What’s most frustrating is there seems to be an obvious, straightforward technical solution that would make both me and the NBA happy: publicize the API, and rate limit it. Let people register with nba.com as a consumer of data, and limit how many data requests can be submitted over some length of time, say 10 requests per minute. All of which would help regulate these presumed runaway bandwidth costs while enabling the NBA to partner with entities that can extend the NBA’s reach into nontraditional areas of the population, now and in the future.

APIs are pretty common nowadays. Whatever tech company or product you can think of, they almost certainly have an API. Twitter, Facebook, Slack, Pinterest, Instagram, etc. While there are certainly limitations around how a person can use these APIs, companies understand the real value is in their data, and that these API developers act as partners/champions/ambassadors of their company, mavens who extend the reach and influence of their company’s data far beyond the boundaries of the company’s products, creating virtuous loops back to those products and benefiting companies in the long run.

For example, Twitter allows bloggers to embed tweets in their own blog posts, and it’s really simple. You don’t need a computer science degree to do it. Today, Twitter plays a central role in global public discourse, and their public APIs have no doubt helped them ascend to that standing.

Haven’t you ever wondered why we basketball fans couldn’t do the same simple embedding with box scores, or player stat lines, or shot charts? The nba.com site tries to provide this functionality, but it’s so unintuitive that it FEELS like you need a computer science degree to use it competently.

Twitter understands that you don’t have to be on a Twitter app or website to consume information residing in Twitter. Similarly, you don’t need to visit nba.com to consume NBA data.

And if you think about it, this has never been the case with NBA information. In the past we would get our NBA fix from ESPN, radio, newspapers, or the local TV station. Today, in addition to those traditional media channels, we can also choose to consume NBA information from blogs like SB Nation, discussion forums like Reddit, or on any of your preferred social networks.

Public APIs are nothing more than a modern, programmatic manifestation of this same idea. While NBA data live in a central place, public APIs allow that data to be consumed not just directly by human beings, but by other applications, websites, and computer programs, all of which ultimately get consumed by even more human beings, who aren’t necessarily the same people as those who directly consume.

Consumers today have so many choices where they consume information, so it makes sense to bring the data where they already are, rather than assume they’ll always come to where you are. Which the NBA has always done with highlights and analysis, just not with their raw data.

Consumers also have so many more choices than before in how they entertain themselves, and a public API allows the NBA to remain adaptable to whatever next disruption occurs with media, whether that be evolutions in consumer tastes or new ways of interacting with content or people. And chances are good that disruption will have something to do with synthesizing large volumes of data, as the rest of world becomes increasingly inundated with and driven by data and algorithms (in some cases, literally driven). APIs will likely play a key role in such a world.

Like all other consumers, I also have many choices in how I spend my time. Because of this increased, neverending friction in accessing NBA data, my interest in the NBA has waned, and I’m considering spending my time with other data sets in other domains of knowledge, where the utility and opportunity and challenges seem plentiful. It would be great to continue analyzing basketball data along with the rest of this awesome basketball statistics ecosystem, but the NBA’s current choices around data transparency are making that decision to move on from basketball more apparent.

The Future of Vorped

I created the first iteration of Vorped 6 years ago. At the time, basketball data seemed underutilized: shot chart data existed in pockets, and play-by-play text were widely available but lacked real insight about what happened in a game. Units data was also hard to come by.

Nowadays, many resources exist to find this data. NBA’s stats website now provides all this information (but dear god is it difficult to use). Awesome other efforts, like Nylon Calculus, NBA Savant, and the variety of parsing libraries for the hidden nba.com API, all enable the average NBA fan to be more informed about the league. I began wondering whether Vorped needed to exist anymore.

Within that same time period, my career has evolved, my challenges have changed, and my interests have shifted. And to be honest, Vorped hasn’t been as useful as I imagined, even if it’s just my nightly side project. Feeling a lack of accomplishment, I considered shutting this site down.

But here’s the thing: I still have fun running Vorped, and I’m still learning a TON in the process.

About 4-5 months ago, I decided not to shut the site down, but instead invest a little more effort into it. With that said, I want to share what I’ve been working on, and what I hope to accomplish with this website going forward.

A short history on Vorped, and what’s wrong with it today

I’m a data analyst, not a programmer.

When assessing what’s bad (and good) about Vorped today, I come back to this basic fact. Vorped was optimized for performing data analysis, not for scale or reliability or anything that a competent software developer would care about.

When I first started out, my choices were driven by naive logic. This was the exact logic:

  1. I need data. I’ll alter a bunch of Python scripts I found on the internet, adapt it to scraping basketball data websites, and run it at 11pm Pacific every night.
  2. I need access to data. I know Excel and some PHP, but Excel is kinda annoying. I’ll make a PHP app on my laptop.
  3. Wow, this is sorta useful… I bet other people can use this app. How can I give them access? I’ll buy (cheap) shared web hosting, and port my Python scrapers and PHP app onto those servers.

To my surprise, this worked pretty well. It was a good minimum viable product (or MVP, to annoying Silicon Valley types like me), where a few people told me they found it useful, and where I didn’t invest inordinate amounts of time or effort.

Yet, products rarely remain minimum: people always want more features, me included.

To satisfy that hunger for more, I began tacking on additional features onto this simple application. Over time, I built features like feeds, game flows, game loggers, and automated recaps. All interesting ideas. But, not many of those ideas proved useful, and all of the code supporting those ideas were terribly written, and a huge pain to maintain or improve.

As I struggled to produce features that could keep up with my growing aspirations, my rate of analysis fell off a cliff (as of this writing, I had not written a new blog analysis in over 3 years). Increasingly, I was solving software problems, not analytical problems. This made me feel conflicted. If I couldn’t produce analysis myself, while also promoting a tool that supposedly enables others to do analysis, would I have any credibility? Was I becoming a fraud of an analyst, the type of faux intellectual that this website aimed to challenge and discredit?

No. The real problem was that I was still building only for me, not for others. I expected that if I repeat the 3-step MVP process over and over again, based solely on what I need, things would work fine. I began realizing that MVP is necessary but not sufficient, and that there are at least 2 more steps to delivering a solid tool for others:

  1. How do I keep this app running day-to-day?
  2. How quickly can the app recover when something unexpected happens?

Note how none of this concerned basketball analytics. If I wanted to truly create a useful tool, I would need to commit to solving problems I didn’t initially set out to tackle. Software problems, not basketball analytics problems. Others’ problems, our problems, not necessarily my problems. To move forward, I needed to let go of my needs. So I put aside basketball analytics, and decided to directly address these software problems.

Vorped and the cloud

I want to make a better basketball analysis tool. But first, it’s important to define what “better” means. And for me, “better” means:

  1. Able to capture and analyze data, for any basketball league
  2. Able to ask quick, deep, ad hoc questions of the data, and share it easily to the world
  3. Not too expensive
  4. Flexible enough to adapt to product features I haven’t yet imagined
  5. Minimizes ops drudgery (i.e. provisioning machines), but enables responsive to process failures

The biggest hurdle was point 2. It turns out, cheap $10/month shared web hosting machines weren’t optimized for supporting my ad hoc, memory-intensive SQL queries. If I needed the ability to execute these kinds of queries, I required a different solution.

To achieve this, I decided to move my database to Amazon Web Services (AWS). As a test, I pointed Vorped at this AWS’s relational data service (RDS) during the NBA offseason. And to my surprise, it worked admirably. Moving to RDS helped me achieve points 2) and 5) above: RDS minimized ops drudgery (fewer failures, automated backups, scheduled updates), taking care of areas of expertise I had no interest in developing. Ad hoc questions also became less painful to do. With automatic snapshots of the database, I could quickly create a copy of the database, provision a new database instance, and hammer the living hell out of it.

And when I dug deeper into AWS and all the other tools it provides, I realized I could accomplish a lot more with AWS if I re-thought the architecture of my scripts and apps.

After taking a step back and examining the intricate tapestry of interconnected logic and other crap that were my scripts, web apps, and databases, I realized I could break down Vorped into a handful of logical applications, each with clearly-defined purpose and relationships with the other apps.

If you’re curious, the 4 major applications are:

  • League Manager (metadata about leagues, teams, and schedules)
  • Scraper (manages getting and conforming data from external sources)
  • Core (your typical “data warehouse” where aggregates are calculated and against which ad hoc queries are executed)
  • Consumer (web app, the thing you’re probably looking at right now)
Vorped’s new set of logical applications

The past 4 months have been a standard exercise in decoupling logic, and fitting them into the AWS ecosystem. Much of this work heavily relied on Lambda, a relatively new tool that enables you to process data in response to an event (like a game ending), without having to worry about managing the machines that run those processes. With this, Vorped now updates game data and shot charts about 30 minutes after a game ends. And best thing about Lambda: it’s pretty inexpensive (I’m a cheap person).

This decoupling ultimately helps me achieve points 1) and 4) above. Managing data processing across different leagues (with different rule sets and court dimensions and period counts) are much easier to maintain.

This flexibility also makes it easier to iterate on new statistics, new visualizations, and even integrate new datasets to the website. For example, nba.com redesigned their website (again) in early December 2016, changing how they serve certain types of data. I was able to adapt to the new data within a few nights, all in my spare time after my day job.

The one downside: this all costs more than I’d prefer to spend (point 3). It’s not that expensive, really (it costs less than the gym membership that I rarely never use), but again, I’m cheap.

Is this better?

Technically- and egotistically-speaking, one could describe this updated system as a service-oriented architecture utilizing serverless computing to process streaming big data in the cloud to enable machine learning scenarios. Sounds impressive, it has to be objectively better, right?

Well, it’s definitely more complex than before. On the downside, I can’t just run the same scripts on my laptop anymore. I have to think a lot more about the meta-processes that glue everything together. The dev workflow feels weird still. And dealing with Lambda without having direct control over the computing has its own quirks. Thing are more scalable, more fault-tolerant, and I have more out-of-the-box monitoring, so there are a ton of benefits. But better?

I suppose the definition of “better” depends on what your goals are. If I were trying to perform analysis for myself only, this system would be unnecessarily complex and costly, definitely not “better.” But in letting go of my own analytical aspirations, I hope I’ve set up a system that can be useful for you today, and be adaptable enough to be useful for you months from now. From that perspective, yeah, it’s probably better.

I’ve come to understand that making a tool for yourself is much, much, much simpler than making a tool for everyone, by at least an order of magnitude. In any product, very little effort goes into the things the users see and touch. Rather, most of the effort goes into the sub-tools and sub-systems behind the scenes that ensure the tool can run, and keep running. Put another way, the tools that build the tool are far more important than the tool itself. If you don’t believe me, maybe you’ll believe this guy. He stole that idea from me /s.

I have no delusions about what this website is, and will continue to be: a handy niche tool used by a small set of basketball fanatics, who want to the power to inform themselves about their favorite teams and players. This isn’t the next Google. But I hope Vorped can be just as useful as Google for these specific use cases, for me and for you.

The Vorped Game Logger – A Path to Better Insights

Data analysis is only as useful as the data you collect. That’s my core belief when it comes to the “field” of data analysis, or data science, or statistics or whatever it’s called now.

The quality of publicly available basketball analysis has always frustrated me, because the underlying data doesn’t wholly capture the necessary data to draw solid conclusions. For example, who is the best passer? Use assists. But assists only capture passes of made shots in the field. What about passes on missed shots, or on fouls? And when we compare across teams, this potentially biases toward players whose teammates may be better scorers.

We currently have numerous “advanced” stats which do fancy weightings of box stats to estimate the things we care about. These work well as an first pass analysis, but what if we want to go further?

I believe the answer is not to throw more statistical modeling and/or machine learning techniques at the problem (sorry, RAPM), but to collect more, different, better datapoints.

Turns out, collecting data (“scorekeeping”) in basketball is pretty difficult. To make it less difficult, I started building a tool which I call the Game Logger.

Vorped Game Logger

The Game Logger tracks both play by play and location data, and derives a boxscore based on that data. It differs from existing scorekeeping tools in a couple ways:

  • Logs the location of any event (fouls, rebounds, etc.), not just shots
  • Track per-possession stats, not just field goal percentages
  • Be flexible enough to define your own play by play events – you’re not limited to points/rebounds/assists/blocks/steals

If you’ve ever attempted scorekeeping a game, you know a basketball game moves very fast, sometimes faster than you can log data. The Game Logger tries to minimize this issue with a few features:

  • Text-based and keyboard-based – no hunting and clicking buttons to log the right play and player
  • Enter multiple events at one time
  • Easy to fix/update past events
Screen Shot 2015-01-02 at 12.00.58 PM
Vorped Game Logger, main screen

 

Using a text input lets you spend more time watching the game instead of your screen. I found click-based systems problematic when scoring a game in-person. It also allows you to track as many events types as you wish, now that you’re no longer constrained by your device’s screen space.

You can try it out here: http://vorped.com/1-nba/2014-2015/game_logger. Click Learn More on the front page for a short tutorial on how to score the basics of a game.

You may be wondering, why is this useful at all? After all, Sportvu tracks all location data (literally), and Vantage Sports tracks almost every event in an NBA game, why would anyone use this?

To which I answer, why not track *any* basketball game? Why not track your college or high school basketball team’s games? You can. Many people ask me how they can create their own shot charts, and now, you can use the Game Logger to do so.

Screen Shot 2015-01-02 at 12.04.53 PM
Court, NCAA men’s basketball

 

 

Screen Shot 2015-01-02 at 12.05.13 PM
Court, high school basketball

 

In fact, Game Logger was designed to be used live at a game. The actual Game Logger application is written as a Chrome App, which you install through a Google Chrome browser, which allows you to log games without being on the internet.

As with all things, there are many, many bugs and missing features, so please feel free to give me your feedback here or on Twitter.

Screen Shot 2015-01-02 at 12.11.03 PM
Creating a game, from within the Vorped Game Logger Chrome app. Note how you can choose a rule set (NBA, NCAA, or high school).

 

Who are the WNBA All-Stars? An Introduction, Using Shot Location Data

Knowing nothing about the WNBA, but wanting to learn more about the league, I try to compare the top WNBA players to their NBA counterparts, purely using shot chart data.

I don’t watch the WNBA.  Most people don’t.  I want to believe that I personally don’t watch because the level of play isn’t adequately entertaining.  But maybe (probably) I’m actually prejudiced: they’re women.  They can’t dunk like men, they’re slower than men, and many of them have funny-looking jump shots.  And, sad to say, most WNBA players aren’t as physically attractive as other women you see on television, or even other women’s sports like tennis.

These are all unfair criticisms.  We should judge WNBA players based on their talent and productivity, not on what our eyeballs see.  After all, there are entertaining NBA players that can’t dunk (Andre Miller), aren’t fast (Andre Miller, again), and have funny-looking jump shots (Joakim Noah, Kevin Martin, and Andre Miller, yet again).  Plus, as a straight male, I don’t watch the NBA for the players’ physical attractiveness (though I suppose the female cheerleaders on the sidelines do help).

So that’s what I wanted to do: to be less prejudiced against WNBA players, and judge them based on their basketball playing abilities.

Again, I turn to data, because data don’t have eyeballs.  In particular, I acquired shot location of every game in the current 2013 WNBA regular season.  And to start off, I wanted to learn about the best players, so I looked at the 2013 season’s All-Star rosters.

In doing so, I realized I knew only 2 or 3 of the players, and even then, I didn’t know what kind of player each woman was.  To keep it simple, I just wanted to know what kinds of shots each all-star took.  And to further draw the mental image, I wanted to see which NBA player each woman resembles the most.

For each WNBA player, I took the shot frequencies within each of the 14 shot zones you see on vorped.com, and compared them to all NBA players’ shot zone frequencies from the 2012-2013 NBA regular season.  Then I calculated which NBA player had the most similar shot frequency pattern, using the sum of squared deviations from the WNBA player’s shot frequency across all zones.

So here’s what I got.  Let’s start off in the East.

Eastern Conference All-Stars

Cappie Pondexter

Most similar to: Kyrie Irving, DeMar DeRozan, Mo Williams, Jrue Holidaycappie_pondexter

Epiphanny Prince

Most similar to: Nate Robinson, Deron WIlliams, Marcus Thornton

Apparently she likes to take last-second half-court shots.

epiphanny_prince

Tina Charles

Most similar to: Carlos Boozer, Elton Brand, Luis Scola, Brandon Bass

tina_charles

Tamika Catchings

Most similar to: Gordon Hayward, Reggie Jackson, Andre Iguodala

tamika_catchings

Angel McCoughtry

Most similar to: Russell Westbrook, Darren Collison, Monta Ellis, Eric Gordon, Lebron James

McCoughtry basically shoots from everywhere, but likes to get to the rim, which is apparently why she resembles many point guard-like shot charts.

angel_mccoughtry

Erika de Souza

Most similar to: DeJuan Blair, Derrick Favors, Larry Sanders, Bismack Biyombo, Nikola Pekovic

erika_desouza

Sylvia Fowles

Most similar to: Tiago Splitter, Javale McGee, Kosta Koufos, Kenneth Faried

Apparently she should play for the Denver Nuggets, because she shoots exactly like all their big men.

sylvia_fowles

Allison Hightower

Most similar to: Beno Udrih, Steve Nash, Andrea Bargnani

Hightower shoots from everywhere, which is why she has similarities to shoot-first types of players, ranging from an NBA point guard to a center.

allison_hightower

Crystal Langhorne

Most similar to: Iam Mahinmi, Jordan Hill, Tyler Hansbrough, Meyers Leonard, DeMarcus Cousins

This is a very interesting shot chart.  Langhorne doesn’t take any high paint shots, only taking the long 2-pointer or the 4-footer near the rim.  I would be interested in further analyzing her shot selection, seeing if this is a natural consequence of the spacing of the offense, or perhaps even her ability to pass up sub-optimal shots.

crystal_langhorne

Ivory Latta

Most similar to: Jimmer Fredette, Vince Carter, Roddy Beaubois, Nate Robinson, Deron Williams

ivory_latta

Shavonte Zellous

Most similar to: J.R. Smith, Marco Belinelli, Brandon Jennings, Danilo Gallinari, Paul George

shavonte_zellous

Elena Della Donne

Most similar to: Arron Afflalo, Rudy Gay, Michael Beasley, O.J. Mayo, Kobe Bryant, Jordan Crawford

As you can see, Della Donne definitely has a swing(wo)man’s type of game, given her NBA comparisons.

elena_delladonne

Western Conference All-Stars

Diana Taurasi

Most similar to: Ryan Anderson, Omri Casspi, CJ Miles

This seems to me the oddest comparison my crappy model spit out.  I would have guessed James Harden.  But you get the idea here: lots of threes, lots of drives to the paint.  I suppose she’s pretty unique in that she shoots more from the baseline than from the long-two wings.

diana_taurasi

Seimone Augustus

Most similar to: Shannon Brown, Monta Ellis, Kevin Durant, Kobe Bryant, Rudy Gay

seimone_augustus

Candace Parker

Most similar to: Dwyane Wade, Josh Smith, Lebron James, Paul Millsap

Lots of action at the rim, but does have some shot frequency from the three-point line, albeit not successfully.

candace_parker

Maya Moore

Most similar to: Nate Robinson, Marco Belinelli, Brandon Jennings, O.J. Mayo, Paul George

Another swingwoman’s/shooting guard type of game, with a lot of similarities to shoot-first types of point guards as well.

maya_moore

Rebekkah Brunson

Most similar to: Luis Scola, Tim Duncan, Jason Maxiell, Glen Davis

rebekkah_brunson

Tina Thompson

Most similar to: Mirza Teletovic, Byron Mullens, A.J. Price, Ryan Anderson

She was all over the place with her comparisons.  A really unique shot chart: lots of threes, but also lots of low block and baseline shots befitting of a true frontcourt player.

tina_thompson

Glory Johnson

Most similar to: Jared Sullinger, Zaza Pachulia, DeJuan Blair, Kendrick Perkins

glory_johnson

Nneka Ogwumike

Most similar to: Larry Sanders, Chris Wilcox, DeAndre Jordan

nneka_ogwumike

Danielle Robinson

Most similar to: Tim Duncan, Jason Maxiell, Tyler Zeller

This has to be the most interesting shot chart of all.  Shots at the rim, and a bunch of 17-footers.  And, is it a coincidence that this San Antonio Sliver Stars most similar comparison is the quintessential San Antonio Spur?  They don’t play the same position, but the basketball philosophies around floor spacing and shot selection may be the same.

danielle_robinson

Kristi Toliver

Most similar to: Stephen Curry, Gary Neal, Jose Calderon

kristi_voliver

Lindsay Whalen

Most similar to: Jason Thompson, Jeff Adrien, Serge Ibaka

Yes, Lindsay Whalen, a guard, has power forwards as her closest comparisons.  Her nearest guard comparison is Dwyane Wade.

lindsay_whalen

Brittney Griner

Most similar to: Roy Hibbert, Marcin Gortat, Tristan Thompson

brittney_griner

How accurate was this exercise?  Am I right in the NBA player comparisons?  I would like to extend this analysis to rebounding, assists, and defense too, and also tighten up the comparison calculation.  But for a first analysis, I learned quite a bit about our WNBA All-Stars.

What I love about data is that it can help you see past superficial things like skin color, sexual orientation, and physical attractiveness, all things somewhat associated with the league.  I want to see the WNBA product for what it is: basketball.  And though I find the quality of the WNBA game a little rough and aesthetically/qualitatively disjointed, I’ve learned through this data that the WNBA can spark my curiosity and teach me interesting things about basketball, even if it isn’t the NBA.

Lawler’s Law: How Accurate Is It, And Can It Be Improved?

“The first team to score 100 points wins.” That sounds true, but how true is it?

If you’ve ever watched a Los Angeles Clippers game on NBA League Pass or on local TV, you’ve probably listened to play-by-play man Ralph Lawler.  And if you happened to have stuck around toward the end of that Clipper game (which was an especially painful thing to do during the ’80s and ’90s), you may have learned of Lawler’s Law.

The first team to score 100 points wins.

– Lawler’s Law

Lawler’s Law is presented tongue-in-cheek (“it’s the LAW”), but obviously exists more as a guideline than a law.

I’ve always wondered how accurate Lawler’s Law is, so I did the math on play-by-play data for games over the past 3 seasons.

It turns out, Lawler’s Law is pretty accurate.  Teams that scored 100 points first ultimately won the game around 93.5% of the time.  Doing some Googling, other sources have the accuracy at around 91-92%.  That difference could be statistically significant, but isn’t practically much different, and could probably be attributed to my smaller, recent sample of games.

I don’t know what I expected, but I found that percentage to be high, and surprisingly so.  That led me to ask a follow-up question: does a better Lawler’s Law exist with a score threshold other than 100 points?  After all, 100 points is an arbitrary threshold.  It looks nice because that’s when we move from 2 digits to 3 digits in the base 10 numbering system.  But it’s entirely possible that a less sexy number like 101 points could be a better threshold that predicts the winner of the game.

So I went through the same process for a bunch of different scoring thresholds.  And it turns out that you can do better than 100 points.  The higher threshold you choose, the more likely that the first team to hit that threshold will be predictive of the ultimate winner.  A threshold of 101 points is better than 100, and 119 points is better than 101.

correct_vs_threshold

But that result seems obvious, because at some point it becomes difficult to score a certain number of points.  Looking again at the data, a typical team scores somewhere around 95-97 points a game (Average final team score is 97ppg.  Calculating in a different way, you can get 95ppg, based on an average of 80 shots per game, 0.98 points per shot attempt, with 23 free throws at a 75% rate).  Scoring 100 points isn’t far-fetched, but scoring 115 proves more difficult, usually because the team must have shot very well over the course of the night, which doesn’t happen very often.  But how often does that occur for various scoring thresholds?

As you can see from the graph below, nearly all games end with at least one team breaking the 80 point barrier, and over 90% of games with one team breaking 90.  But that percentage falls precipitously as we move into the 90-something point thresholds.  Only about 62% of games ever reach 100 points, meaning Lawler’s Law applies for about 3 of every 5 games.  Notice how the 100 point threshold falls in the middle of the steeply downward-sloping portion of the graph.

games_vs_threshold

Here we see the downside of Lawler’s Law: it only applies some of the time.  Having a 93.5% predictive accuracy for 3 of every 5 games, but having no opinion on the other 2 games doesn’t seem all that predictive.  In fact, you could argue that the true accuracy of Lawler’s Law can be no higher than 62%.

So what if we judged Lawler’s Law not only on the truthfulness of the Law itself, but also by the number of games that the Law is applicable?  Or more simply, what if we defined accuracy (“total accuracy”) as the number of games the Law correctly predicts, divided by the total number of games (regardless if the Law is applicable)?

By essentially combining the two previous graphs, we get this graph below.  Notice how total accuracy slowly increases as we move from scoring thresholds in the 60s to those in the 70s.  This makes sense: over 99% of games have at least one team hitting at least 79 points, and accuracy of Lawler’s Law increases from 80% accurate at 69 points to 84.4% accurate at 79 points.

new_metric

Moving to scoring thresholds in the 80s, total accuracy flattens out, reaching a high of 84.8% accuracy at 82 points and 87 points.  Practically speaking, total accuracy doesn’t differ much for scoring thresholds in the 80s.  Accuracy continues to get better, with the 89 point threshold being nearly 90% accurate, though only 93% of the games ever get to 89 points.

At this point, losses in applicability begin to dominate gains in accuracy, and you see total accuracy fall when we move into the 90-something point thresholds.  Judged by this new metric, Lawler’s Law at the 100 point threshold has a total accuracy of just 58.9%.  Again, Lawler’s Law is 93.5% accurate, but only becomes applicable for 62% of the games.

Using the total accuracy metric, we find that we can improve Lawler’s Law.  Total accuracy is maximized in the 80-something point thresholds, and because there is very little practical difference in accuracy among them, let’s choose the smallest one and restate the Law.

The first team to score 80 points wins.

– Lawler’s Law, version 2.0

Under the new Law, you’ll be right only around 5 in every 6 games, or stated differently, you’ll be WRONG about 15% of the time, which doesn’t sound nearly as cool as being 93.5% right.

But Lawler’s Law wasn’t designed to be 100% truthful: that’s part of the joke.  The Law is a pretty good approximation for the truth, and for most purposes, that’s good enough.  Statistics don’t always have to be absolutely precise.  I love Lawler’s Law because it isn’t perfect but does the job of predicting winners effectively and cheaply, living comfortably with all its flaws and error rates.  Sometimes, good enough is good enough.

Stephen Curry’s 2013 Playoff Shot Chart

Hi there.  I’m just testing my new embedded shot chart functionality.  You probably don’t care, you probably shouldn’t care, but I think it’ll yield very cool things in the near future.

To make this trip worthwhile for you, here’s Stephen Curry’s playoff shot chart.  Perhaps you’ve heard of him?  As an added bonus, it’ll get automatically updated as the playoffs progress.

Stephen Curry

[player_shotchart player_id=1041 report_type=season season=2012-2013 season_type=POST metric=fg ui=zone14 baseline=vs_league_zone]

Sloan Sports Analytics Conference 2013: My Thoughts and Opinions

I travel to Boston to see what all the hubbub is about this so-called sports nerd convention.

I attended the Sloan conference last week, and I left with a lot of opinions.

My opinions come from an outsider’s perspective, but an outsider who has done analytics in disciplines other than sports, and who works every day on a team of analytics nerds at your typical Silicon Valley tech company.  Also note that I’ve attended the prior two Sloan conferences, so I knew what to expect to hear at the various panels.  Also, I’m a terrible networker.  That’s important to know, because…

1. You don’t go to Sloan to learn how to do analytics

You go to Sloan to network.  Despite its reputation as a geek convention, many of its 2,700 attendees are not actually analytics/Bill James-y types of people.  Many are MBA candidates (Sloan is MIT’s business school, after all), many are high school/college students who want to be close to sports people, some are vendors hawking their products, and many are marketing/sales/business operations people who work for teams, who want to work for teams, or work in the sports industry.  You have a lot of consumers of analytics, but not necessarily many producers of analytics.

I would guess that these people comprise the majority of the attendees.  After them, you have the people that do some form of sports analysis: academic researchers, bloggers, ESPN employees, and the team’s sports analytics employees themselves.  If I were to guess, the count of sports analysts definitely wouldn’t be greater than 700, and is probably no more than 150-200.

The Sloan conference isn’t as geeky as its reputation implies, and it’s questionable whether the Sloan sports conference is a good place to find up-and-coming analytics talent.  I had spoken to a few people who volunteered that the level of skills found at this conference weren’t as high as they expected.  My boss wanted me to keep my eyes out for good analysts, and probably the best unemployed analyst I heard was Stan van Gundy during the Basketball Analytics panel.

There’s probably good analysis being done by teams.  But you’re not going to learn that at Sloan, because teams keep it secret.  Why should they share, right?

Funny enough, the best analytics at the Sloan conference actually happens on the business panels.  An Orlando Magic executive presented good stuff around sales analytics, in particular the likelihood of selling season tickets as a function of time of day, as well as time during season.  I even heard the words “decision tree” and “linear model” thrown around.

In fact, the only piece of analysis I remember from last year’s conference came from NBA marketing, who shared the effectiveness of making the Boston Celtics’ center court ad display shorter, and putting two small ad displays in the corners near the players bench.  It turns out that television cameras spend more time at each end of the court, and having ad consoles at each end of the court increased exposure time for the advertiser, leading to bigger ad dollars.  It also had the positive side effect of substituting lower-priced, less-desired corner seats for higher-priced, more-desired courtside seats.

You may not like ads or business-y things, but man that’s some interesting analysis.

In my opinion, if you want to learn how to do analytics or learn new methods, go to the business panels.  They’ll actually share information.  Also, go to the research paper presentations.  Most of the other panels aren’t illuminating.

2. Data analysis isn’t the issue, it’s data management

When people think “analytics”, I get the impression that people think it means fancy algorithms and distribution curves and lots of Greek letters.  In my opinion, a big chunk of data analysis is just smart counting, and optionally dividing by something so a number makes sense contextually.

I think the sports analytics community does this fairly well.  But this smart counting assumes that the data collection piece has been taken care of.  In basketball, the box score has typically been the place where analytics begins.

But in the new world of basketball analytics, the starting point will probably be SportVu’s XYZ coordinate data.  The definition of the word “data” will change.  Where “data” used to mean sanitized, summarized, and tractable, “data” will mean messy, overwhelming, error-prone, and frustrating.  Living in a world of data streaming 25 times per second means you won’t be living in an Excel spreadsheet, but in something much, much bigger.  Databases matter more.  Parallel computing might have to be used.  Data cleansing will really, really matter.

I believe too much of sports analytics focuses on stats, and not enough on data management.  To me, stats are the by-product of good data management.  Good data management makes for good analytics, and leads to confidence in your final results.

In fact, Vorped is simply an exercise in data warehousing, of which some components I’ve actually plugged into stuff at my day job.  About 80% of my effort on Vorped is spent on making sure I have clean data, and even then, I know that parts of my data aren’t 100% accurate.  Despite this, I have confidence in the data I present, because I know the underlying data has been processed pretty rigorously.

We need more people who are both willing and able to endure the drudgery of data management. It’s completely unsexy, but absolutely vital to making analytics work.  And in the coming world of streaming sensor data similar to SportVu, having more data won’t make things easier, it’ll make things harder, because you have much more noise to sift through to find the signal, let alone finding the right signal.

At some point in the future, the XYZ coordinate data will likely lead to awesome findings, especially around screening, defense, and off-the-ball movement.  However, I believe that there exist simpler methods and data sources that could answer useful basketball-related questions as good as, and in some cases better than, the current SportVu data.  But if the NBA put cameras in all 30 arenas and disseminated all that data, my opinion would probably change.

3. Communication matters

I found it interesting that so many panels devoted time to discussing how to communicate analytics findings.  Often communication proves to be the most challenging part of analysis, because human beings have emotions and egos that can prevent objectivity from carrying the discussion.  I’ve experienced this countless times myself.  If your listener fundamentally does not believe in data, or in you as an analyst, it usually doesn’t matter how good your models are, because in the end, that knowledge won’t be used.

The big exception is baseball.  Moneyball worked so well because baseball’s rules create situations and data that make statistical analysis very natural.  Assigning credit and blame for an at-bat is relatively straightforward.  You have a batter and a pitcher and sometimes a fielder with an error.  Also in baseball, at-bats are well-defined by the rules of the game, which make counting events pretty easy, which then allow stats to be relatively self-explanatory.

Basketball is so much harder.  Assigning credit and blame gets very complicated when you consider non-box score things like screening, cuts to the basket, missed rotations, and bad spacing.  A player can play an effective 30 minutes without registering a single shot attempt or assist.

I think this is why communicating analytics is much harder in non-baseball sports: collecting the right data to get the right model is hard, so we have to make-do with simpler data, which limits the depth of actionable knowledge we can gain from that analysis.

Current basketball statistics do a good job of identifying what teams and lineups are good (i.e. efficient).  But they don’t necessarily tell us why they’re efficient.  Is it because the lineup has better ball movement?  Better screening?  Better shot selection?  Questions that start with “who” and “what” can be answered.  Answering “why” is much, much harder.

I would guess that communication becomes challenging because basketball analytics has a hard time answering “why” questions.  Decision-makers want actionable insights.  In these cases, stats (or metrics) aren’t good enough alone.  You need interpretation, too, which requires contextual knowledge outside of the data.  And in my opinion, this is where the next opportunity lies for sports analysts in the near future, to deftly combine quantitative data with qualitative contextual information to tell a believable and accurate story.

In my experience, I’ve always tried to communicate to decision-makers that data will tell us some things, but won’t explain things fully.  Like Nate Silver says, data analysis tends to be probabilistic.  If you can use data to make a CEO or coach 70% confident instead of 50% confident in using a particular strategy, that’s a win.  Learnings from data are typically incremental, and I think the goal should be to accumulate as many incremental learnings as possible, instead of searching for the silver bullet analysis that explains everything.

The prevalence of this topic makes me believe that the statistical movement hasn’t truly taken hold.  To me, the statistical revolution will have happened when teams operate as data-driven organizations, not just organizations that happen to use data.  Being data-driven means questioning assumptions, measuring the right things, and continually testing those assumptions with the data you’ve collected.  Based on the chatter at the conference, I would guess that not many basketball teams meet these criteria.

Too long; didn’t read (TL;DR)

The Sloan conference isn’t as nerdy as its media coverage implies.  Sports analytics is still in its nascent stages, more evolution than revolution, and still behind business analytics that have been doing this for decades.

While there are plenty good stats and quality data analyzers out there, we need more people involved in the ugly but important work of data collection.  We also need open data, because that’s how we’re going to discover the next generation of sports analysts.

Finally, we need to be comfortable communicating both what data analysis does and doesn’t tell us, because we’re comfortable knowing that data analysis can’t explain everything.

Overall, the conference was a good experience.  I met many good people doing good things, and yet I didn’t get to meet as many people as I hoped (I’m terrible at networking).  I just wish more actual analytics happened.  It would be awesome if there were a hackathon during next year’s conference.

Lebron James’ Hot Streak: Individual Brilliance Doesn’t Always Translate to Team Dominance

Lebron James is currently playing at ridiculous levels.  Yes, shooting 71.4% FG% over a 5-game span usually qualifies as ridiculous.

Lebron’s not just getting a bunch of dunks, he’s shooting more efficiently.  It’s true that he’s been shooting from better spots on the court (i.e. not long 2-pointers).  See below: in the past 5 games, James has shot from the paint about five percentage points more than normal (50.7% to 55.8%), and about two percentages points more from three-point range.

Lebron James shot distribution - 2012-2013 Regular Season Lebron James shot distribution - Feb 3 - Feb 10

But what if we did some basic math to see how many more points you’d expect Lebron to score, solely from better shot locations?  Over the course of the 2012-2013 regular season prior to February 3 (the start of this ridiculous streak), James averaged 1.38 points per shot (pps) in the paint, 0.8pps from the long two, and 1.2pps from three-point range.

Assuming Lebron shoots just as efficiently, but shoots more from the paint and three-point line, we would expect Lebron to overall score about 1.2 points per shot.  Before, we would have expected him to score 1.16pps.  So it’s not a huge difference.  Over 15 shot attempts, that would translate to 18 points vs. 17.4 points, or a measly 0.6 point difference.

Shot location can’t explain everything.  Especially when you realize that Lebron actually has been averaging 1.55 points per shot over the past week.

My shot charts are rarely this green.
My shot charts are rarely this green.

So to state the obvious, Lebron isn’t just shooting from better locations, he’s just shooting better.  Over 15 shot attempts, we would expect James to score 23.25pps at his new shooting efficiency and new shot location distribution, 5 points more than the expected 18pps given James’s prior efficiency level and new shot location distribution.

By the way, we’ve completely disregarded free throws from this analysis.  Chances are, he’s even more efficient than +5 points.  He’s shot double-digit free throws in 4 of the 6 games played in February, which he hadn’t done in nearly a month previous, since January 4 vs. Chicago.

One minor issue: turnovers.  James averages 2.8 turnovers/game, but has averaged 4.2 turnovers in these 5 games.

Yeah, but how did the team do?

Five brilliant individual performances, five Miami Heat wins.  Good, right?

Maybe.  Miami won five games, but didn’t necessarily dominate their opposition.  Looking at the average scoring lead/deficit during each of the five games (and NOT looking at the misleading final margin of victory), you realize that the Heat didn’t outright control those games, save for the pounding of the Clippers on national TV.

  • +1.5 points (vs. Raptors 2/3)
  • +2.2 points (vs. Bobcats 2/4)
  • +5.3 points (vs. Rockets 2/6)
  • +14.6 points (vs. Clippers 2/8)
  • +0.8 points (vs. Lakers 2/10)

Put another way, the Heat on average led the Raps/Bobcats/Lakers by only a basket or less over the course of each of those 3 games.  Note that the Heat achieved better results against the two current playoff teams (Rockets, Clippers), and struggled against the objectively worse, non-playoff teams, of which the Raptors are a little terrible, the Bobcats are very terrible, and the Lakers are probably terrible.

Amid the media exuberance over Lebron’s individual brilliance, it seems to me that team performance kind of got lost, ignored, or even misremembered.  From this game margin data, I believe the Heat should have handled these inferior teams more handily than by just 2 points, ESPECIALLY given Lebron’s level of play.  But Lebron’s freakish field goal percentage and the team’s 5-0 record seem to disguise this assertion.

We should celebrate great individual performances, but we shouldn’t overlook team performance when doing so, because individual domination may not always translate to team domination.

As always, you can play around with a lot of this data yourself.  Lebron James shot charts.

Vorped Shot Charts Updated – Now With More Options

I completed a minor redesign of the shot charts.  The goal was to make the charts easier to comprehend by simplifying the presentation of data.

You’ll find a few new features:

  • A summary text displays above the chart, showing you either the total shot count, field goal percentage, or points per shot of the shots selected.
  • Two new zone options: one for breaking out three pointers, long twos, and twos from the paint, and another one breaking out shots from the left, center, and right side of the court.  The 14-zone visual wasn’t always easy to understand, so hopefully reducing those 14 zones to 3 zones can reduce the mental load.
  • Metrics used to only be shown with the 14-zone visual, but now you can see those metrics on any of the 4 visual options.
  • Team charts and shot chart roulette both have the new shot chart options.

The new shot charts give you quite a bit of flexibility, allowing you to see the player from many different perspectives.  If I were a marketer, I could tell you that there are 9,600 unique ways to cut this shot chart data from the combinations of zone visuals and filter options.  Of course, not all 9,600 ways are useful, but you get the idea.

Please let me know what you think.  For example, check out Kyrie Irving, who apparently scores at least 0.9 points per shot from any distance.

Clippers’ 12-Game Win Streak Not As Impressive as Thunder’s

Not all win streaks are created equally. The Clippers and Thunder have both achieved 12-game win streaks in the early 2012-2013 regular season, and I explain why you should be more impressed with one than the other.

I awoke Saturday morning with Stephen A. Smith screaming at me through the television set, and it bothered me.  Not necessarily because Stephen A. screamed at me (apparently he has his personal volume setting at 11 all the time), but because of what he screamed about: that the Los Angeles Clippers should be considered contenders in the West.

That sentence sounds wrong.  Have the words “Clippers” and “contender” ever been used in the same sentence?  But it’s hard to deny when the Clippers have been on a 12-game win streak, and counting.

Having not seen too many Clipper games lately, I checked the data to see how impressed I should be.  And to put the streak into context, I compared this current 12-game streak to another 12-game streak in the early part of the 2012-2013 regular season by last year’s West champions, the Oklahoma City Thunder.

The Clippers played weaker teams

My first question: how good were the teams that the Clippers beat?  Apparently, not very.  Here are the opponents’ median win percentage as of December 22, 2012:

  • Clippers: 36% (median opponent’s win pct)
  • Thunder: 49%

The Clippers clearly played a lot of bad teams, including the dysfunctional Sacramento Kings twice.  Switching the perspective, we can also say that the Clippers didn’t play any good teams, having only played three teams with an above .500 record.

In comparison, the Thunder played five teams above .500, including San Antonio and Atlanta, both currently above .600.  The remaining games were split evenly between very bad teams (New Orleans, Sacramento, Charlotte) and average teams (Philly, Lakers, Utah).

There’s no doubt the Thunder played, and beat, better teams.

The Thunder had better wins

Though the Clippers’ menu of opponents wasn’t impressive, perhaps how badly they beat those teams could be impressive.  The conventional way to measure this would use margin of victory, and by this metric, the Clippers looked very good.

  • Clippers: +14 points (median margin of victory)
  • Thunder: +10.5 points

And if for some reason, you’re afraid of using medians instead of averages, even the average margin of victory would favor the Clippers:

  • Clippers: +14.8 points (average margin of victory)
  • Thunder: +13.8 points

However, I’ve never loved margin of victory as a metric, because you’re only looking at a single point of the game to judge and analyze a game in its entirety.

Maybe we can find a better metric.  Instead of looking only at the margin of victory occurring at the 48th minute of each game, what if we also looked at the score margin at the 1st, 2nd, and 3rd minute of the game, all the way up to the 48th minute?  Averaging across all 48 scoring margin snapshots within a game, we can capture not only IF the team won, but HOW convincingly the team controlled the game.

By extension, we can figure out which 12-game streak was more impressive by comparing each team’s scoring margin over the 12 games.  Here are the median game scoring margins for each streak:

  • Clippers: +4.9 points (median scoring margin)
  • Thunder: +7.8 points

This metric tells us that the Thunder tend to lead their opponents by nearly 8 points at any given point during a game, which is about 3 points better than the Clippers.  Not only did the Thunder play better teams, but they seem to command a game more convincingly too.

This wouldn’t have been apparent if you looked only at margin of victory.  This metric, which I’m internally calling “Naive Game Margin”, does a good job of deemphasizing analytically weird events like when tight games get blown open in the final minutes, or when a team comes back from a huge deficit in garbage time, but never had a realistic chance to win (gamblers like to call some of these situations “backdoor covers”).

The Clips had a couple games where the margin of victory disguised what really happened, like this 19-point win over the Raptors that was pretty close for the first 3 quarters, or this 18-point win over the Suns the night before that followed a similar script.  (Side note: since when did the NBA start scheduling back-to-back home games?)

By throwing out margin of victory, you find that the Clippers actually played in more close games than the Thunder, with 7 of the 12 games having a Naive Game Margin below +5.2 points.  The Thunder only had 3 games like this, meaning they controlled the other 9 the games during the streak pretty convincingly.

Don’t call the Clippers contenders… yet

While both teams possess lengthy winning streaks, Oklahoma City’s was more impressive because they beat better teams, and beat them more convincingly.  You probably shouldn’t call the Clippers contenders for the reasons described in this post.  But even more simply put, they haven’t beaten anyone very good over those 12 games.  The Derrick Rose-less Bulls were probably their most formidable opponent.

But that doesn’t mean the Clippers aren’t contenders.  It just means they haven’t proven it yet.

In college football, people had historically discounted 12-game undefeated winning streaks from non-BCS conference teams like Boise St., Hawaii, and TCU for the same reason: they hadn’t beaten anyone good.  Yet in the NBA, it seems we’re quicker to anoint a team as a contender after a long winning streak, without considering who they played, let alone how they won.

I find the double-standard very interesting.  But unlike college football, we’ll get proof at the end of the season if the Clippers actually become contenders.

PS: During the writing of this article, the Clippers pounded the Suns to extend the streak to 13 games.  Some people were not impressed.

Clippers’ win streak

DateOpponentNaive Game MarginMargin of Victory
2012-11-28vs. MIN-0.56
2012-12-01vs. SAC20.935
2012-12-03@ UTA-6.01
2012-12-05vs. DAL12.822
2012-12-08vs. PHO3.418
2012-12-09vs. TOR4.619
2012-12-11@ CHI3.25
2012-12-12@ CHA5.16
2012-12-15@ MIL14.926
2012-12-17@ DET3.312
2012-12-19vs. NO10.416
2012-12-21vs. SAC9.412

Thunder win streak

DateOpponentNaive Game MarginMargin of Victory
2012-11-24@ PHI4.27
2012-11-26vs. CHA29.145
2012-11-28vs. HOU10.222
2012-11-30vs. UTA8.312
2012-12-01@ NO13.721
2012-12-04@ BKN5.56
2012-12-07vs. LAL8.76
2012-12-09vs. IND2.911
2012-12-12vs. NO-3.34
2012-12-14vs. SAC10.510
2012-12-17vs. SA6.614
2012-12-19@ ATL7.38

Sidenote: Yes, you can have a negative Naive Game Margin but still win the game.  For example, you can lose for most of the game, but pull it out in the end, like the Clippers did vs. Utah on Dec. 3.