Challenges¶

Working with data frame:¶

Here we consider the iris data to analysis

Read iris data from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data.

In [5]:

  Copied!     
 
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
iris=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
import pandas as pd import warnings warnings.simplefilter(action='ignore', category=FutureWarning) iris=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)

Display the first 6 rows.

In [6]:

  Copied!     
 
iris.head(6)
iris.head(6)

Out[6]:

	0	1	2	3	4
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa

What is the dimension?

In [7]:

  Copied!     
 
iris.shape
iris.shape

Out[7]:

(150, 5)

Get a summary of data,

In [8]:

  Copied!     
 
iris.describe()
iris.describe()

Out[8]:

	0	1	2	3
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

What are the types of data?

In [9]:

  Copied!     
 
iris.dtypes
iris.dtypes

Out[9]:

0    float64
1    float64
2    float64
3    float64
4     object
dtype: object

Add label to the data: sepallength,sepalwidth,petal length,petal width, class.

In [10]:

  Copied!     
 
iris.columns=['sepallength','sepalwidth','petal length','petalwidth', 'species']
iris.columns=['sepallength','sepalwidth','petal length','petalwidth', 'species']

Change the type of species

In [11]:

  Copied!     
 
iris['species'] = iris['species'].astype('category')
iris['species'] = iris['species'].astype('category')

How many different species are there?

In [12]:

  Copied!     
 
iris.species.unique()
iris.species.unique()

Out[12]:

['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
Categories (3, object): ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

Extract column sepallength

In [13]:

  Copied!     
 
iris['sepallength']
iris.sepallength
iris['sepallength'] iris.sepallength

Out[13]:

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepallength, Length: 150, dtype: float64

Extract the rows 2,5

In [14]:

  Copied!     
 
iris.iloc[[2,5]]
iris.iloc[[2,5]]

Out[14]:

	sepallength	sepalwidth	petal length	petalwidth	species
2	4.7	3.2	1.3	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa

Extract the rows 2,5 and columns 1,3

In [15]:

  Copied!     
 
iris.iloc[[2,5],[1,3]]
iris.iloc[[2,5],[1,3]]

Out[15]:

	sepalwidth	petalwidth
2	3.2	0.2
5	3.9	0.4

Using logical indexing select observations with sepallength<5.

In [16]:

  Copied!     
 
iris[iris.sepallength<5]
iris[iris.sepallength<5]

Out[16]:

	sepallength	sepalwidth	petal length	petalwidth	species
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa
11	4.8	3.4	1.6	0.2	Iris-setosa
12	4.8	3.0	1.4	0.1	Iris-setosa
13	4.3	3.0	1.1	0.1	Iris-setosa
22	4.6	3.6	1.0	0.2	Iris-setosa
24	4.8	3.4	1.9	0.2	Iris-setosa
29	4.7	3.2	1.6	0.2	Iris-setosa
30	4.8	3.1	1.6	0.2	Iris-setosa
34	4.9	3.1	1.5	0.1	Iris-setosa
37	4.9	3.1	1.5	0.1	Iris-setosa
38	4.4	3.0	1.3	0.2	Iris-setosa
41	4.5	2.3	1.3	0.3	Iris-setosa
42	4.4	3.2	1.3	0.2	Iris-setosa
45	4.8	3.0	1.4	0.3	Iris-setosa
47	4.6	3.2	1.4	0.2	Iris-setosa
57	4.9	2.4	3.3	1.0	Iris-versicolor
106	4.9	2.5	4.5	1.7	Iris-virginica

Sort data based on 'sepallength' and 'sepalwidth'.

In [17]:

  Copied!     
 
iris.sort_values(['sepallength','sepalwidth'])
iris.sort_values(['sepallength','sepalwidth'])

Out[17]:

	sepallength	sepalwidth	petal length	petalwidth	species
13	4.3	3.0	1.1	0.1	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
38	4.4	3.0	1.3	0.2	Iris-setosa
42	4.4	3.2	1.3	0.2	Iris-setosa
41	4.5	2.3	1.3	0.3	Iris-setosa
...	...	...	...	...	...
118	7.7	2.6	6.9	2.3	Iris-virginica
122	7.7	2.8	6.7	2.0	Iris-virginica
135	7.7	3.0	6.1	2.3	Iris-virginica
117	7.7	3.8	6.7	2.2	Iris-virginica
131	7.9	3.8	6.4	2.0	Iris-virginica

150 rows × 5 columns

Calculate summary of sepallength and sepalwidth for each class

In [18]:

  Copied!     
 
iris.groupby(['species'])[['sepallength','sepalwidth']].describe()
iris.groupby(['species'])[['sepallength','sepalwidth']].describe()

Out[18]:

	sepallength								sepalwidth
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
species
Iris-setosa	50.0	5.006	0.352490	4.3	4.800	5.0	5.2	5.8	50.0	3.418	0.381024	2.3	3.125	3.4	3.675	4.4
Iris-versicolor	50.0	5.936	0.516171	4.9	5.600	5.9	6.3	7.0	50.0	2.770	0.313798	2.0	2.525	2.8	3.000	3.4
Iris-virginica	50.0	6.588	0.635880	4.9	6.225	6.5	6.9	7.9	50.0	2.974	0.322497	2.2	2.800	3.0	3.175	3.8

Calculate median of sepalwidth for each class

In [19]:

  Copied!     
 
iris.groupby("species")['sepallength'].median()
iris.groupby("species")['sepallength'].median() 

Out[19]:

species
Iris-setosa        5.0
Iris-versicolor    5.9
Iris-virginica     6.5
Name: sepallength, dtype: float64

Calculate summary of 'count', 'min', 'max', 'mean', 'std' for each class

In [21]:

  Copied!     
 
values = ['count', 'min', 'max', 'mean', 'std']
iris.groupby(by='species').agg(values)
values = ['count', 'min', 'max', 'mean', 'std'] iris.groupby(by='species').agg(values)

Out[21]:

	sepallength					sepalwidth					petal length					petalwidth
	count	min	max	mean	std	count	min	max	mean	std	count	min	max	mean	std	count	min	max	mean	std
species
Iris-setosa	50	4.3	5.8	5.006	0.352490	50	2.3	4.4	3.418	0.381024	50	1.0	1.9	1.464	0.173511	50	0.1	0.6	0.244	0.107210
Iris-versicolor	50	4.9	7.0	5.936	0.516171	50	2.0	3.4	2.770	0.313798	50	3.0	5.1	4.260	0.469911	50	1.0	1.8	1.326	0.197753
Iris-virginica	50	4.9	7.9	6.588	0.635880	50	2.2	3.8	2.974	0.322497	50	4.5	6.9	5.552	0.551895	50	1.4	2.5	2.026	0.274650

Save the imported data to CV.

In [22]:

  Copied!     
 
iris.to_csv('iris.csv', index=False)
iris.to_csv('iris.csv', index=False)

working with html¶

We work with NBA games (https://www.basketball-reference.com/leagues/NBA_2020_games.html) here

Load librariess

In [57]:

  Copied!     
 
import numpy as np
import pandas as pd
pd.options.display.max_rows=10
import numpy as np import pandas as pd pd.options.display.max_rows=10

Import data, and show the data

In [63]:

  Copied!     
 
source = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2020_games.html")
NBA_game = source[0]
NBA_game.head()
source = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2020_games.html") NBA_game = source[0] NBA_game.head()

Out[63]:

	Date	Start (ET)	Visitor/Neutral	PTS	Home/Neutral	PTS.1	Unnamed: 6	Unnamed: 7	Attend.	LOG	Arena	Notes
0	Tue, Oct 22, 2019	8:00p	New Orleans Pelicans	122	Toronto Raptors	130	Box Score	OT	20787	2:50	Scotiabank Arena	NaN
1	Tue, Oct 22, 2019	10:30p	Los Angeles Lakers	102	Los Angeles Clippers	112	Box Score	NaN	19068	2:28	STAPLES Center	NaN
2	Wed, Oct 23, 2019	7:00p	Chicago Bulls	125	Charlotte Hornets	126	Box Score	NaN	15424	2:11	Spectrum Center	NaN
3	Wed, Oct 23, 2019	7:00p	Detroit Pistons	119	Indiana Pacers	110	Box Score	NaN	17923	2:22	Bankers Life Fieldhouse	NaN
4	Wed, Oct 23, 2019	7:00p	Cleveland Cavaliers	85	Orlando Magic	94	Box Score	NaN	18846	2:02	Amway Center	NaN

Print the column and type the data

In [64]:

  Copied!     
 
print(NBA_game.columns)
type(NBA_game)
print(NBA_game.columns) type(NBA_game)

Index(['Date', 'Start (ET)', 'Visitor/Neutral', 'PTS', 'Home/Neutral', 'PTS.1',
       'Unnamed: 6', 'Unnamed: 7', 'Attend.', 'LOG', 'Arena', 'Notes'],
      dtype='object')

Out[64]:

pandas.core.frame.DataFrame

Rename the column name of data

In [65]:

  Copied!     
 
column_names = {'Date': 'date', 'Start (ET)': 'start','Visitor/Neutral': 'visitor','PTS': 'visitor_points','Home/Neutral': 'home',
'PTS.1': 'home_points','Unnamed: 6': 'box','Unnamed: 7': 'ot','Attend.': 'attend','Notes': 'notes'}

NBA_game=NBA_game.rename(columns=column_names).dropna(thresh=5)[['date','visitor','visitor_points','home','home_points']].assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y')).set_index('date', append=True).rename_axis(["id", "date"]).sort_index()
NBA_game.head()
column_names = {'Date': 'date', 'Start (ET)': 'start','Visitor/Neutral': 'visitor','PTS': 'visitor_points','Home/Neutral': 'home', 'PTS.1': 'home_points','Unnamed: 6': 'box','Unnamed: 7': 'ot','Attend.': 'attend','Notes': 'notes'} NBA_game=NBA_game.rename(columns=column_names).dropna(thresh=5)[['date','visitor','visitor_points','home','home_points']].assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y')).set_index('date', append=True).rename_axis(["id", "date"]).sort_index() NBA_game.head()

Out[65]:

		visitor	visitor_points	home	home_points
id	date
0	2019-10-22	New Orleans Pelicans	122	Toronto Raptors	130
1	2019-10-22	Los Angeles Lakers	102	Los Angeles Clippers	112
2	2019-10-23	Chicago Bulls	125	Charlotte Hornets	126
3	2019-10-23	Detroit Pistons	119	Indiana Pacers	110
4	2019-10-23	Cleveland Cavaliers	85	Orlando Magic	94

Change the data to wide

In [67]:

  Copied!     
 
NBA_game_w= (NBA_game.reset_index()
    .melt(id_vars=['id', 'date'], value_vars=['visitor', 'home'],
          value_name='team', var_name='home_or_visitor')
)

NBA_game_w.head()
NBA_game_w= (NBA_game.reset_index() .melt(id_vars=['id', 'date'], value_vars=['visitor', 'home'], value_name='team', var_name='home_or_visitor') ) NBA_game_w.head()

Out[67]:

	id	date	home_or_visitor	team
0	0	2019-10-22	visitor	New Orleans Pelicans
1	1	2019-10-22	visitor	Los Angeles Lakers
2	2	2019-10-23	visitor	Chicago Bulls
3	3	2019-10-23	visitor	Detroit Pistons
4	4	2019-10-23	visitor	Cleveland Cavaliers

Caluclate the distance flight¶

This part credit goes for https://github.com/TomAugspurger/pandas-best-practices We us airport data here, https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat,

Load the libraries, and import the data

In [74]:

  Copied!     
 
import numpy as np
import pandas as pd
name_var=["id", "name", "city", "country", "faa", "icao", "lat", "lon", "alt", "tz", "dst", "tzone"]
airports_raw = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat",header=None,names=name_var)
import numpy as np import pandas as pd name_var=["id", "name", "city", "country", "faa", "icao", "lat", "lon", "alt", "tz", "dst", "tzone"] airports_raw = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat",header=None,names=name_var)

Get the head and info of data.

In [75]:

  Copied!     
 
airports_raw.head()
airports_raw.head()

Out[75]:

	id	name	city	country	faa	icao	lat	lon	alt	tz	dst	tzone
0	1	Goroka	Goroka	Papua New Guinea	GKA	AYGA	-6.081689	145.391881	5282	10.0	U	Pacific/Port_Moresby
1	2	Madang	Madang	Papua New Guinea	MAG	AYMD	-5.207083	145.788700	20	10.0	U	Pacific/Port_Moresby
2	3	Mount Hagen	Mount Hagen	Papua New Guinea	HGU	AYMH	-5.826789	144.295861	5388	10.0	U	Pacific/Port_Moresby
3	4	Nadzab	Nadzab	Papua New Guinea	LAE	AYNZ	-6.569828	146.726242	239	10.0	U	Pacific/Port_Moresby
4	5	Port Moresby Jacksons Intl	Port Moresby	Papua New Guinea	POM	AYPY	-9.443383	147.220050	146	10.0	U	Pacific/Port_Moresby

In [76]:

  Copied!     
 
airports_raw.info()
airports_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8107 entries, 0 to 8106
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       8107 non-null   int64  
 1   name     8107 non-null   object 
 2   city     8107 non-null   object 
 3   country  8107 non-null   object 
 4   faa      5880 non-null   object 
 5   icao     8043 non-null   object 
 6   lat      8107 non-null   float64
 7   lon      8107 non-null   float64
 8   alt      8107 non-null   int64  
 9   tz       8107 non-null   float64
 10  dst      8107 non-null   object 
 11  tzone    8107 non-null   object 
dtypes: float64(3), int64(2), object(7)
memory usage: 760.2+ KB

Clean data

Select just country USA and faa is not empty
drop id name icao
Drop duplicate from faa?

In [ ]:

  Copied!     
 
airports_raw.country.unique()
airports_usa=airports_raw.loc[(airports_raw['country']=='United States')&('faa'!='')]
airports_usa=airports_usa.drop(['id','name','icao'],axis=1)
airports_usa=airports_usa.drop_duplicates('faa')
airports_usa=airports_usa.set_index('faa')
airports_usa.head()
airports_raw.country.unique() airports_usa=airports_raw.loc[(airports_raw['country']=='United States')&('faa'!='')] airports_usa=airports_usa.drop(['id','name','icao'],axis=1) airports_usa=airports_usa.drop_duplicates('faa') airports_usa=airports_usa.set_index('faa') airports_usa.head()

Running cells with '/opt/homebrew/bin/python3.10' requires the ipykernel package.

Run the following command to install 'ipykernel' into the Python environment. 

Command: '/opt/homebrew/bin/python3.10 -m pip install ipykernel -U --user --force-reinstall'

Calculate distance

In [78]:

  Copied!     
 
def great_circle_distance(df, to="DSM"):
    # https://www.johndcook.com/blog/python_longitude_latitude/
    df = df.copy()
    lat = np.deg2rad(90 - df['lat'])
    lon = np.deg2rad(90 - df['lon'])
    to_lat, to_lon = df.loc[to, ['lat', 'lon']]
    cos = (np.sin(lat) * np.sin(to_lat) * np.cos(lon - to_lon) +
           np.cos(lat) * np.cos(to_lat))
    arc = np.arccos(cos)
    kilometers = 6373 * cos
    df[f'km_to_{to}'] = kilometers
    return df

great_circle_distance(airports_usa)
def great_circle_distance(df, to="DSM"): # https://www.johndcook.com/blog/python_longitude_latitude/ df = df.copy() lat = np.deg2rad(90 - df['lat']) lon = np.deg2rad(90 - df['lon']) to_lat, to_lon = df.loc[to, ['lat', 'lon']] cos = (np.sin(lat) * np.sin(to_lat) * np.cos(lon - to_lon) + np.cos(lat) * np.cos(to_lat)) arc = np.arccos(cos) kilometers = 6373 * cos df[f'km_to_{to}'] = kilometers return df great_circle_distance(airports_usa)

Out[78]:

	city	country	lat	lon	alt	tz	dst	tzone	km_to_DSM
faa
4I7	Greencastle	United States	39.633556	-86.813806	842	-5.0	U	America/New_York	-611.548347
C91	Dowagiac	United States	41.992934	-86.128012	748	-5.0	U	America/New_York	-874.022782
CDI	Cambridge	United States	39.975028	-81.577583	799	-5.0	U	America/New_York	-827.065264
SUE	Sturgeon Bay	United States	44.843667	-87.421556	725	-6.0	U	America/Chicago	-1122.842947
0P2	Stewartstown	United States	39.794824	-76.647191	1000	-5.0	U	America/New_York	-997.427349
...	...	...	...	...	...	...	...	...	...
UCA	Utica	United States	43.104167	-75.223333	456	-5.0	A	America/New_York	-1370.623054
CVO	Corvallis	United States	44.506700	-123.291500	250	-8.0	A	America/Los_Angeles	-531.679746
CWT	Chatsworth	United States	34.256944	-118.598889	978	-8.0	A	America/Los_Angeles	594.567494
DHB	Deer Harbor	United States	48.618397	-123.005960	0	-8.0	A	America/Los_Angeles	-985.750406
OLT	San Diego	United States	32.755200	-117.199500	0	-8.0	A	America/Los_Angeles	752.239018

1459 rows × 9 columns

Dow Jones Index¶

We are going to use Dow Jones stock Index.

Import the necessary libraries

In [79]:

  Copied!     
 
import numpy as np
import pandas as pd 
import datetime
import numpy as np import pandas as pd import datetime

Import data set

In [130]:

  Copied!     
 
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data')
dow_jones.head(10)
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data') dow_jones.head(10)

Out[130]:

	quarter	stock	date	open	high	low	close	volume	percent_change_price	percent_change_volume_over_last_wk	previous_weeks_volume	next_weeks_open	next_weeks_close	percent_change_next_weeks_price	days_to_next_dividend	percent_return_next_dividend
0	1	AA	1/7/2011	$15.82	$16.72	$15.78	$16.42	239655616	3.792670	NaN	NaN	$16.71	$15.97	-4.428490	26	0.182704
1	1	AA	1/14/2011	$16.71	$16.71	$15.64	$15.97	242963398	-4.428490	1.380223	239655616.0	$16.19	$15.79	-2.470660	19	0.187852
2	1	AA	1/21/2011	$16.19	$16.38	$15.60	$15.79	138428495	-2.470660	-43.024959	242963398.0	$15.87	$16.13	1.638310	12	0.189994
3	1	AA	1/28/2011	$15.87	$16.63	$15.82	$16.13	151379173	1.638310	9.355500	138428495.0	$16.18	$17.14	5.933250	5	0.185989
4	1	AA	2/4/2011	$16.18	$17.39	$16.18	$17.14	154387761	5.933250	1.987452	151379173.0	$17.33	$17.37	0.230814	97	0.175029
5	1	AA	2/11/2011	$17.33	$17.48	$16.97	$17.37	114691279	0.230814	-25.712195	154387761.0	$17.39	$17.28	-0.632547	90	0.172712
6	1	AA	2/18/2011	$17.39	$17.68	$17.28	$17.28	80023895	-0.632547	-30.226696	114691279.0	$16.98	$16.68	-1.766780	83	0.173611
7	1	AA	2/25/2011	$16.98	$17.15	$15.96	$16.68	132981863	-1.766780	66.177694	80023895.0	$16.81	$16.58	-1.368230	76	0.179856
8	1	AA	3/4/2011	$16.81	$16.94	$16.13	$16.58	109493077	-1.368230	-17.663150	132981863.0	$16.58	$16.03	-3.317250	69	0.180941
9	1	AA	3/11/2011	$16.58	$16.75	$15.42	$16.03	114332562	-3.317250	4.419900	109493077.0	$15.95	$16.11	1.003130	62	0.187149

Select 'stock','date','open','high','low' and check the type of them

In [131]:

  Copied!     
 
dow_jones=dow_jones.loc[:,['date','stock','open','high','low']] 
dow_jones.dtypes
 dow_jones=dow_jones.loc[:,['date','stock','open','high','low']] dow_jones.dtypes

Out[131]:

date     object
stock    object
open     object
high     object
low      object
dtype: object

Assign week number as index

In [30]:

  Copied!     
 
def to_week(x):
  return datetime.datetime.strptime(x, '%m/%d/%Y').strftime("%V")
def to_week(x): return datetime.datetime.strptime(x, '%m/%d/%Y').strftime("%V")

Apply the function to_week on the column data

In [132]:

  Copied!     
 
dow_jones['week']=dow_jones['date'].apply(to_week)
dow_jones = dow_jones.set_index(['week','stock'])
dow_jones=dow_jones.drop(['date'], axis=1)
dow_jones.head()
dow_jones['week']=dow_jones['date'].apply(to_week) dow_jones = dow_jones.set_index(['week','stock']) dow_jones=dow_jones.drop(['date'], axis=1) dow_jones.head()

Out[132]:

		open	high	low
week	stock
01	AA	$15.82	$16.72	$15.78
02	AA	$16.71	$16.71	$15.64
03	AA	$16.19	$16.38	$15.60
04	AA	$15.87	$16.63	$15.82
05	AA	$16.18	$17.39	$16.18

Is there any duplicate dates?

In [133]:

  Copied!     
 
dow_jones.index.is_unique
dow_jones.index.is_unique

Out[133]:

True

Drop doller sign from columns and change them to numeric

In [134]:

  Copied!     
 
for i in dow_jones.columns[1:]:
    dow_jones[i]=dow_jones[i].str.strip('$').astype(float)
for i in dow_jones.columns[1:]: dow_jones[i]=dow_jones[i].str.strip('$').astype(float)

Compute the mean for each week:

In [135]:

  Copied!     
 
dow_jones.iloc[:,1:4].groupby(dow_jones.index).mean()
dow_jones.iloc[:,1:4].groupby(dow_jones.index).mean()

Out[135]:

	high	low
(01, AA)	16.72	15.78
(01, AXP)	45.60	43.11
(01, BA)	70.10	66.00
(01, BAC)	14.69	13.80
(01, CAT)	94.81	92.30
...	...	...
(25, TRV)	58.28	56.12
(25, UTX)	86.21	83.69
(25, VZ)	36.17	35.20
(25, WMT)	53.70	52.35
(25, XOM)	81.12	76.78

750 rows × 2 columns

Let reconsider time as index.

In [137]:

  Copied!     
 
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data')
dow_jones=dow_jones.loc[:,['date','stock','open','high','low']] 

def to_datatime(x):
  return datetime.datetime.strptime(x, '%m/%d/%Y')

dow_jones['datatime']=dow_jones['date'].apply(to_datatime)
dow_jones = dow_jones.set_index(['datatime'])
dow_jones=dow_jones.drop(['date'], axis=1)
dow_jones.head()
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data') dow_jones=dow_jones.loc[:,['date','stock','open','high','low']] def to_datatime(x): return datetime.datetime.strptime(x, '%m/%d/%Y') dow_jones['datatime']=dow_jones['date'].apply(to_datatime) dow_jones = dow_jones.set_index(['datatime']) dow_jones=dow_jones.drop(['date'], axis=1) dow_jones.head()

Out[137]:

	stock	open	high	low
datatime
2011-01-07	AA	$15.82	$16.72	$15.78
2011-01-14	AA	$16.71	$16.71	$15.64
2011-01-21	AA	$16.19	$16.38	$15.60
2011-01-28	AA	$15.87	$16.63	$15.82
2011-02-04	AA	$16.18	$17.39	$16.18

In [138]:

  Copied!     
 
for i in dow_jones.columns[1:]:
    dow_jones[i]=dow_jones[i].str.strip('$').astype(float)
dow_jones.head()
for i in dow_jones.columns[1:]: dow_jones[i]=dow_jones[i].str.strip('$').astype(float) dow_jones.head()

Out[138]:

	stock	open	high	low
datatime
2011-01-07	AA	15.82	16.72	15.78
2011-01-14	AA	16.71	16.71	15.64
2011-01-21	AA	16.19	16.38	15.60
2011-01-28	AA	15.87	16.63	15.82
2011-02-04	AA	16.18	17.39	16.18

Get the last business day of each month

In [40]:

  Copied!     
 
dow_jones.iloc[:,1:4].resample('BME').mean()
dow_jones.iloc[:,1:4].resample('BME').mean()

Out[40]:

	open	high	low
datatime
2011-01-31	51.671083	52.839417	50.934167
2011-02-28	53.411250	54.465583	52.577917
2011-03-31	53.237083	54.142083	51.929833
2011-04-29	54.377600	55.574533	53.590400
2011-05-31	55.507417	56.335333	54.373500
2011-06-30	53.525167	54.436833	52.197583

Compute the mean of each month for each index.

In [139]:

  Copied!     
 
dow_jones.groupby(dow_jones.stock).mean().head()
dow_jones.groupby(dow_jones.stock).mean().head() 

Out[139]:

	open	high	low
stock
AA	16.5640	17.0152	16.0636
AXP	46.4880	47.5684	45.5976
BA	73.2972	74.7712	71.6432
BAC	13.1756	13.4548	12.7776
CAT	103.0672	105.5296	99.9932

In [148]:

  Copied!     
 
# dow_jones.groupby([dow_jones.index,dow_jones.stock]).mean()
dow_jones.groupby([dow_jones.index,dow_jones.stock]).resample('BME').mean()
# dow_jones.groupby([dow_jones.index,dow_jones.stock]).mean() dow_jones.groupby([dow_jones.index,dow_jones.stock]).resample('BME').mean()

Out[148]:

			open	high	low
datatime	stock	datatime
2011-01-07	AA	2011-01-31	15.82	16.72	15.78
	AXP	2011-01-31	43.30	45.60	43.11
	BA	2011-01-31	66.15	70.10	66.00
	BAC	2011-01-31	13.85	14.69	13.80
	CAT	2011-01-31	94.38	94.81	92.30
...	...	...	...	...	...
2011-06-24	TRV	2011-06-30	57.57	58.28	56.12
	UTX	2011-06-30	84.36	86.21	83.69
	VZ	2011-06-30	35.35	36.17	35.20
	WMT	2011-06-30	52.70	53.70	52.35
	XOM	2011-06-30	78.65	81.12	76.78

750 rows × 3 columns

In [126]:

  Copied!     
 
dow_jones_df = pd.DataFrame()
for i in list(set(dow_jones['stock'])):
    temp=dow_jones.loc[dow_jones['stock'].isin([i])].iloc[:,1:].resample('BME').mean()
    temp['stock']=i
    dow_jones_df=pd.concat([temp, dow_jones_df])
dow_jones_df = pd.DataFrame() for i in list(set(dow_jones['stock'])): temp=dow_jones.loc[dow_jones['stock'].isin([i])].iloc[:,1:].resample('BME').mean() temp['stock']=i dow_jones_df=pd.concat([temp, dow_jones_df])

In [127]:

  Copied!     
 
dow_jones_df
dow_jones_df

Out[127]:

	open	high	low	stock
datatime
2011-01-31	19.0025	19.5625	18.6225	GE
2011-02-28	20.8225	21.3150	20.5375	GE
2011-03-31	20.2475	20.4950	19.4875	GE
2011-04-29	20.1380	20.6020	19.7200	GE
2011-05-31	19.9600	20.1750	19.4875	GE
...	...	...	...	...
2011-02-28	16.9700	17.4250	16.5975	AA
2011-03-31	16.4300	16.8150	15.8100	AA
2011-04-29	17.1820	17.7340	16.6720	AA
2011-05-31	16.8475	17.3375	16.4175	AA
2011-06-30	15.6525	15.9900	15.0225	AA

180 rows × 4 columns

In [141]:

  Copied!     
 
dow_jones.loc[dow_jones['stock'].isin(['AA'])].head()
dow_jones.loc[dow_jones['stock'].isin(['AA'])].head()

Out[141]:

	stock	open	high	low
datatime
2011-01-07	AA	15.82	16.72	15.78
2011-01-14	AA	16.71	16.71	15.64
2011-01-21	AA	16.19	16.38	15.60
2011-01-28	AA	15.87	16.63	15.82
2011-02-04	AA	16.18	17.39	16.18

In [147]:

  Copied!     
 
dow_jones.loc[dow_jones['stock'].isin(['AA'])].iloc[:,1:].resample('BME').mean()
dow_jones.loc[dow_jones['stock'].isin(['AA'])].iloc[:,1:].resample('BME').mean()

Out[147]:

	open	high	low
datatime
2011-01-31	16.1475	16.6100	15.7100
2011-02-28	16.9700	17.4250	16.5975
2011-03-31	16.4300	16.8150	15.8100
2011-04-29	17.1820	17.7340	16.6720
2011-05-31	16.8475	17.3375	16.4175
2011-06-30	15.6525	15.9900	15.0225

Lowerizer the stock column

In [179]:

  Copied!     
 
lowerizer = lambda x: x.lower()
dow_jones['stock'].apply(lowerizer)
lowerizer = lambda x: x.lower() dow_jones['stock'].apply(lowerizer)

Out[179]:

datatime
2011-01-07     aa
2011-01-14     aa
2011-01-21     aa
2011-01-28     aa
2011-02-04     aa
             ... 
2011-05-27    xom
2011-06-03    xom
2011-06-10    xom
2011-06-17    xom
2011-06-24    xom
Name: stock, Length: 750, dtype: object