Baseball Analytics:
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
2. Getting the data and setting up your machine
For this tutorial, we will use the Lahman’s Baseball Database. This Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. You can download the data from this this link. We will be using two files from this dataset:Salaries.csv
and Teams.csv
. To execute the code from this tutorial, you will need Python 2.7 and the following Python Libraries: Numpy, Scipy, Pandas and Matplotlib and statsmodels.
In [19]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Load in these CSV files from the Sean Lahman's Baseball Database. .
In [2]:
import requests
import zipfile
import os.path
import StringIO
import pandas as pd
In [2]:
request = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip')
zipFile = zipfile.ZipFile(StringIO.StringIO(request.content))
In [3]:
salariesDF = pd.read_csv(zipFile.open('Salaries.csv'))
salariesDF.head()
Out[3]:
In [4]:
teamsDF = pd.read_csv(zipFile.open('Teams.csv'))
teamsDF.head()
Out[4]:
Summarize the Salaries DataFrame to show the total salaries for each team for each year. Show the head of the new summarized DataFrame.
In [5]:
sumSalariesDF = salariesDF.groupby(['teamID', 'yearID']).sum()
sumSalariesDF.head()
Out[5]:
Merge the new summarized Salaries DataFrame and Teams DataFrame together to create a new DataFrame showing wins and total salaries for each team for each year year.
In [6]:
teamsSelectionDF = teamsDF.loc[:,['yearID', 'teamID', 'W']]
teamsSelectionDF.head()
Out[6]:
In [7]:
salariesSelectionDF = salariesDF.loc[:,['yearID', 'teamID', 'salary']]
salariesSelectionDF.head()
Out[7]:
In [9]:
salariesSummed = salariesSelectionDF.groupby(['teamID', 'yearID']).sum()
salariesSummed.head()
Out[9]:
In [10]:
salariesSummed = salariesSummed.reset_index()
salariesSummed.head()
Out[10]:
In [11]:
join = pd.merge(teamsSelectionDF, salariesSummed, on=['yearID', 'teamID'], how='inner')
join.head()
Out[11]:
Display the relationship between total wins and total salaries for a given year
In [13]:
grouped = join.groupby(['teamID', 'yearID']).sum()
grouped.head()
Out[13]:
Fit a linear regression to the data from each year and obtain the residuals.
In [14]:
grouped = grouped.reset_index()
grouped.head()
Out[14]:
In [15]:
grouped_98 = grouped[grouped.yearID==1998]
wins_98 = grouped_98[ 'W']
salaries_98 = grouped_98[ 'salary']
teams_98 = grouped_98[ 'teamID']
In [21]:
def plotYear(year):
grouped_year = grouped[grouped.yearID==year]
salaries_year = grouped_year['salary'].values
wins_year = grouped_year['W'].values
teams_year = grouped_year['teamID'].values
X = np.divide(salaries_year, salaries_scale_factor)
Y = wins_year
colors = ['r' if team=='OAK' else 'b' for team in teams_year]
plt.scatter(X,Y, c=colors, s=150, alpha=0.6, edgecolors='none')
plt.title('win/salaries in '+str(year)+' (red=OAK)')
plt.xlabel('tot salaries (M$)')
plt.ylabel('wins')
In [22]:
for year in range(1997,2007):
plotYear(year)
plt.figure()
plt.show()
Fit a linear regression to the data from each year and obtain the residuals
In [23]:
grouped_OAK = grouped[grouped.teamID=='OAK']
years = grouped_OAK.loc[:,[ 'yearID']].values
wins = grouped_OAK.loc[:,[ 'W']].values
salaries = grouped_OAK.loc[:,[ 'salary']].values
In [24]:
from sklearn import linear_model
linReg = linear_model.LinearRegression()
linReg.fit(salaries, wins)
Out[24]:
In [25]:
plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(years, wins)
plt.plot(years, wins)
plt.plot(years, linReg.predict(salaries))
plt.xticks(years, [year[0] for year in years], rotation=45)
plt.title('Oklahoma Wins + linear regression')
plt.ylabel('wins')
plt.grid(True)
plt.show()
In [26]:
plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')
predictions = linReg.predict(salaries)
errors = wins - predictions
plt.scatter(years, errors)
plt.plot(years, errors)
plt.plot(years, [0 for x in range(len(years))])
plt.xticks(years, [year[0] for year in years], rotation=45)
plt.title('Oklahoma linear regression residuals')
plt.ylabel('wins')
plt.grid(True)
plt.show()
No comments:
Post a Comment