Baseball Analytics:

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

2. Getting the data and setting up your machine

For this tutorial, we will use the Lahman’s Baseball Database. This Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. You can download the data from this this link. We will be using two files from this dataset:`Salaries.csv` and `Teams.csv`. To execute the code from this tutorial, you will need Python 2.7 and the following Python Libraries: Numpy, Scipy, Pandas and Matplotlib and statsmodels.

In [19]:

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load in these CSV files from the Sean Lahman's Baseball Database. .

In [2]:

import requests
import zipfile
import os.path
import StringIO
import pandas as pd

In [2]:

request = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip')
zipFile = zipfile.ZipFile(StringIO.StringIO(request.content))

In [3]:

salariesDF = pd.read_csv(zipFile.open('Salaries.csv'))
salariesDF.head()

Out[3]:

	yearID	teamID	lgID	playerID	salary
0	1985	BAL	AL	murraed02	1472819
1	1985	BAL	AL	lynnfr01	1090000
2	1985	BAL	AL	ripkeca01	800000
3	1985	BAL	AL	lacyle01	725000
4	1985	BAL	AL	flanami01	641667

In [4]:

teamsDF = pd.read_csv(zipFile.open('Teams.csv')) 
teamsDF.head()

Out[4]:

	yearID	lgID	teamID	franchID	divID	Rank	G	Ghome	W	L	...	DP	FP	name	park	attendance	BPF	PPF	teamIDBR	teamIDlahman45	teamIDretro
0	1871	NaN	PH1	PNA	NaN	1	28	NaN	21	7	...	NaN	0.84	Philadelphia Athletics	Jefferson Street Grounds	NaN	102	98	ATH	PH1	PH1
1	1871	NaN	CH1	CNA	NaN	2	28	NaN	19	9	...	NaN	0.82	Chicago White Stockings	Union Base-Ball Grounds	NaN	104	102	CHI	CH1	CH1
2	1871	NaN	BS1	BNA	NaN	3	31	NaN	20	10	...	NaN	0.83	Boston Red Stockings	South End Grounds I	NaN	103	98	BOS	BS1	BS1
3	1871	NaN	WS3	OLY	NaN	4	32	NaN	15	15	...	NaN	0.85	Washington Olympics	Olympics Grounds	NaN	94	98	OLY	WS3	WS3
4	1871	NaN	NY2	NNA	NaN	5	33	NaN	16	17	...	NaN	0.83	New York Mutuals	Union Grounds (Brooklyn)	NaN	90	88	NYU	NY2	NY2

5 rows × 48 columns

Summarize the Salaries DataFrame to show the total salaries for each team for each year. Show the head of the new summarized DataFrame.

In [5]:

sumSalariesDF = salariesDF.groupby(['teamID', 'yearID']).sum()
sumSalariesDF.head()

Out[5]:

		salary
teamID	yearID
ANA	1997	31135472
	1998	41281000
	1999	55388166
	2000	51464167
	2001	47535167

Merge the new summarized Salaries DataFrame and Teams DataFrame together to create a new DataFrame showing wins and total salaries for each team for each year year.

In [6]:

teamsSelectionDF = teamsDF.loc[:,['yearID', 'teamID', 'W']]
teamsSelectionDF.head()

Out[6]:

	yearID	teamID	W
0	1871	PH1	21
1	1871	CH1	19
2	1871	BS1	20
3	1871	WS3	15
4	1871	NY2	16

In [7]:

salariesSelectionDF = salariesDF.loc[:,['yearID', 'teamID', 'salary']]
salariesSelectionDF.head()

Out[7]:

	yearID	teamID	salary
0	1985	BAL	1472819
1	1985	BAL	1090000
2	1985	BAL	800000
3	1985	BAL	725000
4	1985	BAL	641667

In [9]:

salariesSummed = salariesSelectionDF.groupby(['teamID', 'yearID']).sum()
salariesSummed.head()

Out[9]:

		salary
teamID	yearID
ANA	1997	31135472
	1998	41281000
	1999	55388166
	2000	51464167
	2001	47535167

In [10]:

salariesSummed = salariesSummed.reset_index()
salariesSummed.head()

Out[10]:

	teamID	yearID	salary
0	ANA	1997	31135472
1	ANA	1998	41281000
2	ANA	1999	55388166
3	ANA	2000	51464167
4	ANA	2001	47535167

In [11]:

join = pd.merge(teamsSelectionDF, salariesSummed, on=['yearID', 'teamID'], how='inner')
join.head()

Out[11]:

	yearID	teamID	W	salary
0	1985	TOR	99	8812550
1	1985	NYA	97	14238204
2	1985	DET	84	10348143
3	1985	BAL	83	11560712
4	1985	BOS	81	10897560

Display the relationship between total wins and total salaries for a given year

In [13]:

grouped = join.groupby(['teamID', 'yearID']).sum()
grouped.head()

Out[13]:

		W	salary
teamID	yearID
ANA	1997	84	31135472
	1998	85	41281000
	1999	70	55388166
	2000	82	51464167
	2001	75	47535167

Fit a linear regression to the data from each year and obtain the residuals.

In [14]:

grouped = grouped.reset_index()
grouped.head()

Out[14]:

	teamID	yearID	W	salary
0	ANA	1997	84	31135472
1	ANA	1998	85	41281000
2	ANA	1999	70	55388166
3	ANA	2000	82	51464167
4	ANA	2001	75	47535167

In [15]:

grouped_98 = grouped[grouped.yearID==1998]
wins_98 = grouped_98[ 'W']
salaries_98 = grouped_98[ 'salary']
teams_98 = grouped_98[ 'teamID']

In [21]:

def plotYear(year):
    grouped_year = grouped[grouped.yearID==year]
    salaries_year = grouped_year['salary'].values
    wins_year = grouped_year['W'].values
    teams_year = grouped_year['teamID'].values  
      
    X = np.divide(salaries_year, salaries_scale_factor)
    Y = wins_year
    colors = ['r' if team=='OAK' else 'b' for team in teams_year]
   
    plt.scatter(X,Y, c=colors, s=150, alpha=0.6, edgecolors='none')        
    plt.title('win/salaries in '+str(year)+' (red=OAK)')    
    plt.xlabel('tot salaries (M$)')
    plt.ylabel('wins')

In [22]:

for year in range(1997,2007):
    plotYear(year)
    plt.figure()
plt.show()

<matplotlib.figure.Figure at 0x9e86b70>

Fit a linear regression to the data from each year and obtain the residuals

In [23]:

grouped_OAK = grouped[grouped.teamID=='OAK']
years = grouped_OAK.loc[:,[ 'yearID']].values
wins = grouped_OAK.loc[:,[ 'W']].values
salaries = grouped_OAK.loc[:,[ 'salary']].values

In [24]:

from sklearn import linear_model

linReg = linear_model.LinearRegression()
linReg.fit(salaries, wins)

Out[24]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [25]:

plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')

plt.scatter(years, wins)
plt.plot(years, wins)
plt.plot(years, linReg.predict(salaries))

plt.xticks(years, [year[0] for year in years], rotation=45)
plt.title('Oklahoma Wins + linear regression')
plt.ylabel('wins')
plt.grid(True)
plt.show()

In [26]:

plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')

predictions = linReg.predict(salaries)
errors = wins - predictions

plt.scatter(years, errors)
plt.plot(years, errors)
plt.plot(years, [0 for x in range(len(years))])

plt.xticks(years, [year[0] for year in years], rotation=45)
plt.title('Oklahoma linear regression residuals')
plt.ylabel('wins')
plt.grid(True)
plt.show()

Analytics Classroom

Wednesday, 3 August 2016

Baseball Analytics in Python

Baseball Analytics:

2. Getting the data and setting up your machine

No comments:

Post a Comment

Total Pageviews

Pages