Wednesday, 3 August 2016

Baseball Analytics in Python

Baseball Analytics:


The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.


2. Getting the data and setting up your machine

For this tutorial, we will use the Lahman’s Baseball Database. This Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. You can download the data from this this link. We will be using two files from this dataset:Salaries.csv and Teams.csv. To execute the code from this tutorial, you will need Python 2.7 and the following Python Libraries: Numpy, Scipy, Pandas and Matplotlib and statsmodels.

In [19]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
import requests
import zipfile
import os.path
import StringIO
import pandas as pd
In [2]:
request = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip')
zipFile = zipfile.ZipFile(StringIO.StringIO(request.content))
In [3]:
salariesDF = pd.read_csv(zipFile.open('Salaries.csv'))
salariesDF.head()
Out[3]:
yearID teamID lgID playerID salary
0 1985 BAL AL murraed02 1472819
1 1985 BAL AL lynnfr01 1090000
2 1985 BAL AL ripkeca01 800000
3 1985 BAL AL lacyle01 725000
4 1985 BAL AL flanami01 641667
In [4]:
teamsDF = pd.read_csv(zipFile.open('Teams.csv')) 
teamsDF.head()
Out[4]:
yearID lgID teamID franchID divID Rank G Ghome W L ... DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
0 1871 NaN PH1 PNA NaN 1 28 NaN 21 7 ... NaN 0.84 Philadelphia Athletics Jefferson Street Grounds NaN 102 98 ATH PH1 PH1
1 1871 NaN CH1 CNA NaN 2 28 NaN 19 9 ... NaN 0.82 Chicago White Stockings Union Base-Ball Grounds NaN 104 102 CHI CH1 CH1
2 1871 NaN BS1 BNA NaN 3 31 NaN 20 10 ... NaN 0.83 Boston Red Stockings South End Grounds I NaN 103 98 BOS BS1 BS1
3 1871 NaN WS3 OLY NaN 4 32 NaN 15 15 ... NaN 0.85 Washington Olympics Olympics Grounds NaN 94 98 OLY WS3 WS3
4 1871 NaN NY2 NNA NaN 5 33 NaN 16 17 ... NaN 0.83 New York Mutuals Union Grounds (Brooklyn) NaN 90 88 NYU NY2 NY2

5 rows × 48 columns

Summarize the Salaries DataFrame to show the total salaries for each team for each year. Show the head of the new summarized DataFrame.

In [5]:
sumSalariesDF = salariesDF.groupby(['teamID', 'yearID']).sum()
sumSalariesDF.head()
Out[5]:
salary
teamID yearID
ANA 1997 31135472
1998 41281000
1999 55388166
2000 51464167
2001 47535167

Merge the new summarized Salaries DataFrame and Teams DataFrame together to create a new DataFrame showing wins and total salaries for each team for each year year.

In [6]:
teamsSelectionDF = teamsDF.loc[:,['yearID', 'teamID', 'W']]
teamsSelectionDF.head()
Out[6]:
yearID teamID W
0 1871 PH1 21
1 1871 CH1 19
2 1871 BS1 20
3 1871 WS3 15
4 1871 NY2 16
In [7]:
salariesSelectionDF = salariesDF.loc[:,['yearID', 'teamID', 'salary']]
salariesSelectionDF.head()
Out[7]:
yearID teamID salary
0 1985 BAL 1472819
1 1985 BAL 1090000
2 1985 BAL 800000
3 1985 BAL 725000
4 1985 BAL 641667
In [9]:
salariesSummed = salariesSelectionDF.groupby(['teamID', 'yearID']).sum()
salariesSummed.head()
Out[9]:
salary
teamID yearID
ANA 1997 31135472
1998 41281000
1999 55388166
2000 51464167
2001 47535167
In [10]:
salariesSummed = salariesSummed.reset_index()
salariesSummed.head()
Out[10]:
teamID yearID salary
0 ANA 1997 31135472
1 ANA 1998 41281000
2 ANA 1999 55388166
3 ANA 2000 51464167
4 ANA 2001 47535167
In [11]:
join = pd.merge(teamsSelectionDF, salariesSummed, on=['yearID', 'teamID'], how='inner')
join.head()
Out[11]:
yearID teamID W salary
0 1985 TOR 99 8812550
1 1985 NYA 97 14238204
2 1985 DET 84 10348143
3 1985 BAL 83 11560712
4 1985 BOS 81 10897560

Display the relationship between total wins and total salaries for a given year

In [13]:
grouped = join.groupby(['teamID', 'yearID']).sum()
grouped.head()
Out[13]:
W salary
teamID yearID
ANA 1997 84 31135472
1998 85 41281000
1999 70 55388166
2000 82 51464167
2001 75 47535167

Fit a linear regression to the data from each year and obtain the residuals.

In [14]:
grouped = grouped.reset_index()
grouped.head()
Out[14]:
teamID yearID W salary
0 ANA 1997 84 31135472
1 ANA 1998 85 41281000
2 ANA 1999 70 55388166
3 ANA 2000 82 51464167
4 ANA 2001 75 47535167
In [15]:
grouped_98 = grouped[grouped.yearID==1998]
wins_98 = grouped_98[ 'W']
salaries_98 = grouped_98[ 'salary']
teams_98 = grouped_98[ 'teamID']
In [21]:
def plotYear(year):
    grouped_year = grouped[grouped.yearID==year]
    salaries_year = grouped_year['salary'].values
    wins_year = grouped_year['W'].values
    teams_year = grouped_year['teamID'].values  
      
    X = np.divide(salaries_year, salaries_scale_factor)
    Y = wins_year
    colors = ['r' if team=='OAK' else 'b' for team in teams_year]
   
    plt.scatter(X,Y, c=colors, s=150, alpha=0.6, edgecolors='none')        
    plt.title('win/salaries in '+str(year)+' (red=OAK)')    
    plt.xlabel('tot salaries (M$)')
    plt.ylabel('wins')
In [22]:
for year in range(1997,2007):
    plotYear(year)
    plt.figure()
plt.show()
<matplotlib.figure.Figure at 0x9e86b70>

Fit a linear regression to the data from each year and obtain the residuals

In [23]:
grouped_OAK = grouped[grouped.teamID=='OAK']
years = grouped_OAK.loc[:,[ 'yearID']].values
wins = grouped_OAK.loc[:,[ 'W']].values
salaries = grouped_OAK.loc[:,[ 'salary']].values
In [24]:
from sklearn import linear_model

linReg = linear_model.LinearRegression()
linReg.fit(salaries, wins)
Out[24]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [25]:
plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')

plt.scatter(years, wins)
plt.plot(years, wins)
plt.plot(years, linReg.predict(salaries))

plt.xticks(years, [year[0] for year in years], rotation=45)
plt.title('Oklahoma Wins + linear regression')
plt.ylabel('wins')
plt.grid(True)
plt.show()
In [26]:
plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')

predictions = linReg.predict(salaries)
errors = wins - predictions

plt.scatter(years, errors)
plt.plot(years, errors)
plt.plot(years, [0 for x in range(len(years))])

plt.xticks(years, [year[0] for year in years], rotation=45)
plt.title('Oklahoma linear regression residuals')
plt.ylabel('wins')
plt.grid(True)
plt.show()

No comments:

Post a Comment