Baseball Hacks
I’d recommend Baseball Hacks to anyone who has ever hung around here (or other baseball analysis sites) and thought “I wish I could get detailed stats like those” but didn’t know where to start. If you want not just to digest baseball research but check it and tinker with it yourself, and you’re willing to get your hands dirty, this is your book. And the dirtier you’re willing to get, the more you can get out of it.
Here’s my quick-and-dirty summary of the book
Chapter 1, Basics of Baseball.
Baseball information is on the internet! Whee!
Chapter 2, Baseball Games from Past Years
This is good stuff: getting yourself databases with all kinds of past game stats, hooking it up, querying it… and this is where we start to get into the real work: using Perl makes an appearance. Still, it’s almost all database-and-SQL stuff, and isn’t that heavy – if you’re not scared of the word ‘database’ you’ll be fine.
Chapter 3, Stats from the Current Season
Noooow it starts to get heavy. Hack 25 is “Spider Baseball Sites for Data” for instance. Soon it’s into building and keeping current year stats updated.
Chapter 4, Visualize Baseball Statistics
This is cool stuff, and instead of being programming/technical heavy, it’s much more into statistical analysis and visualization.
Chapter 5, Formulas
How to calculate a bunch of stats.
Chapter 6, Sabermetric Thinking
This is where you’d think things get interesting, and that’s kinda true. Here it’s about how to use the data you’re getting to look for good stuff. I disagree with how he goes about some of it (Hack 64, on clutch hitting, specifically) but it is good to see what kind of things the data can offer you.
Then there’s some fantasy stuff, which I’m sure would be great if you were interested in using your newfound data to try and find some crazy advantage. I skipped it, because that’s not me at all. And really, when Baseball Prospectus has a pretty good budgeting-and-forecasting thing, it seems a little pointless.
So what can this all get you? If you’re interested in historical baseball stats, and know or are willing to learn a little bit about databases, it’s a nice walkthrough from getting a freely available database of historical stats (The Baseball Archive) and setting it up nicely so you can do cool stuff. From there, well… even I don’t get into the kind of data-scraping that’s in here: I’d rather put up with ESPN’s ads and use their splits, or build it out of Retrosheet box scores the hard way, or whatever. And I’m fairly technical and willing to tinker with this stuff. Some of the more advanced stuff seems geared towards someone with fair technical skills who wants to tinker with both baseball data and with building thier own framework, rather get started in baseball analysis.
I will say that there’s a lot of value in having access to even a nice historical database of raw stats: I find myself pawing around it all the time, looking for interesting stuff that ends up a throwaway reference in a piece here.
So this is a book where if you’re looking to get a lot more technical and want to do a lot more research independently, you’re going to dig it. If you’d just like to be able to baseball-reference-y things, that part’s fairly easy too, and I’ve found it quite rewarding.
However, it’s not about baseball, or really about baseball statistics, or anything. It’s about (as you’d guess from the title), using computers and freely available data to hack stuff together.
Anyway, I hope this helps determine whether it’ll be a good book for you or not. check it out if that sounds interesting.
You didn’t happen to see this question at Ask.Metafilter, did you?
http://ask.metafilter.com/mefi/34395
I’ve been considering this book ever since someone mentioned it at BTF a while back, so it’s great to read a useful review of it. Sounds like a keeper. If I can ask a specific question, though, let’s say I want to load every day’s starting pitchers for all games into a database along with their IP and RA on the season. Would this book help me do that?
Does it actually get into piecing together game data where Retrosheet doesn’t have it? While I enjoy the logic puzzle of reconstructing games from the 1930’s and 1940’s from box scores and articles, I’m curious if anyone’s actually put together a public database of that stuff, or a good programmatic approach to deducing it.
But yeah, I’ve been meaning to check this book out too. If nothing else, it’d be amusing to put it with my serious O’Reilly programming books on the shelf at work and see if anyone notices.
Minutes before Barnes and Noble called to tell me that my Baseball Prospectus had indeed made it safely to Medford, I was going to purchase the package deal on Amazon. Baseball Hacks + Baseball Prospectus. So now I’ll need order Hacks. It’s been on my “Get” list, so now I may just order it here soon.
Joe’s a good guy, and has contributed some stuff to the yahoo egroup baseball-databank. (That group is where baseball-reference and the Lahman database base their data from. It’s headed, as you may have gueseed, by Sean Lahman and Sean Forman. Don’t you love the internet?)
I’m still waiting for my copy, so I can’t offer a constructive review. However, Joe has been kind enough to send me a few scripts in the past, so I’m sure this would be the perfect book for someone who needs some hand-holding.
I guess maybe I didn’t express this well enough, but it seems to me that there’s a really big jump between the “get historical data and use it for fun stuff” and “keep a frequently updated set of stats on the season-to-date by regularly taking data from other sources” that I think is going to cause a lot of people to hit a wall and limit the value they’ll get out of the book.
[…] My new copy of the O’Reilly book Baseball Hacks came in the mail yesterday. I was excited to get it since it’s been recommended by people whose opinion on baseball I respect as a book that will allow for a better understanding of and appreciation for baseball by using statistical analysis. […]