Challenges¶
Working with data frame:¶
Here we consider the iris data to analysis
- Read iris data from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data.
In [5]:
Copied!
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
iris=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
import pandas as pd import warnings warnings.simplefilter(action='ignore', category=FutureWarning) iris=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
- Display the first 6 rows.
In [6]:
Copied!
iris.head(6)
iris.head(6)
Out[6]:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
| 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
- What is the dimension?
In [7]:
Copied!
iris.shape
iris.shape
Out[7]:
(150, 5)
- Get a summary of data,
In [8]:
Copied!
iris.describe()
iris.describe()
Out[8]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
| std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
- What are the types of data?
In [9]:
Copied!
iris.dtypes
iris.dtypes
Out[9]:
0 float64 1 float64 2 float64 3 float64 4 object dtype: object
- Add label to the data: sepallength,sepalwidth,petal length,petal width, class.
In [10]:
Copied!
iris.columns=['sepallength','sepalwidth','petal length','petalwidth', 'species']
iris.columns=['sepallength','sepalwidth','petal length','petalwidth', 'species']
- Change the type of
species
In [11]:
Copied!
iris['species'] = iris['species'].astype('category')
iris['species'] = iris['species'].astype('category')
- How many different species are there?
In [12]:
Copied!
iris.species.unique()
iris.species.unique()
Out[12]:
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] Categories (3, object): ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
- Extract column sepallength
In [13]:
Copied!
iris['sepallength']
iris.sepallength
iris['sepallength'] iris.sepallength
Out[13]:
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
...
145 6.7
146 6.3
147 6.5
148 6.2
149 5.9
Name: sepallength, Length: 150, dtype: float64 - Extract the rows 2,5
In [14]:
Copied!
iris.iloc[[2,5]]
iris.iloc[[2,5]]
Out[14]:
| sepallength | sepalwidth | petal length | petalwidth | species | |
|---|---|---|---|---|---|
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
- Extract the rows 2,5 and columns 1,3
In [15]:
Copied!
iris.iloc[[2,5],[1,3]]
iris.iloc[[2,5],[1,3]]
Out[15]:
| sepalwidth | petalwidth | |
|---|---|---|
| 2 | 3.2 | 0.2 |
| 5 | 3.9 | 0.4 |
- Using logical indexing select observations with sepallength<5.
In [16]:
Copied!
iris[iris.sepallength<5]
iris[iris.sepallength<5]
Out[16]:
| sepallength | sepalwidth | petal length | petalwidth | species | |
|---|---|---|---|---|---|
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
| 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
| 11 | 4.8 | 3.4 | 1.6 | 0.2 | Iris-setosa |
| 12 | 4.8 | 3.0 | 1.4 | 0.1 | Iris-setosa |
| 13 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
| 22 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
| 24 | 4.8 | 3.4 | 1.9 | 0.2 | Iris-setosa |
| 29 | 4.7 | 3.2 | 1.6 | 0.2 | Iris-setosa |
| 30 | 4.8 | 3.1 | 1.6 | 0.2 | Iris-setosa |
| 34 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
| 37 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
| 38 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
| 41 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
| 42 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 45 | 4.8 | 3.0 | 1.4 | 0.3 | Iris-setosa |
| 47 | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
| 57 | 4.9 | 2.4 | 3.3 | 1.0 | Iris-versicolor |
| 106 | 4.9 | 2.5 | 4.5 | 1.7 | Iris-virginica |
Sort data based on 'sepallength' and 'sepalwidth'.
In [17]:
Copied!
iris.sort_values(['sepallength','sepalwidth'])
iris.sort_values(['sepallength','sepalwidth'])
Out[17]:
| sepallength | sepalwidth | petal length | petalwidth | species | |
|---|---|---|---|---|---|
| 13 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
| 38 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
| 42 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 41 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
| ... | ... | ... | ... | ... | ... |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
150 rows × 5 columns
- Calculate summary of sepallength and sepalwidth for each class
In [18]:
Copied!
iris.groupby(['species'])[['sepallength','sepalwidth']].describe()
iris.groupby(['species'])[['sepallength','sepalwidth']].describe()
Out[18]:
| sepallength | sepalwidth | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| species | ||||||||||||||||
| Iris-setosa | 50.0 | 5.006 | 0.352490 | 4.3 | 4.800 | 5.0 | 5.2 | 5.8 | 50.0 | 3.418 | 0.381024 | 2.3 | 3.125 | 3.4 | 3.675 | 4.4 |
| Iris-versicolor | 50.0 | 5.936 | 0.516171 | 4.9 | 5.600 | 5.9 | 6.3 | 7.0 | 50.0 | 2.770 | 0.313798 | 2.0 | 2.525 | 2.8 | 3.000 | 3.4 |
| Iris-virginica | 50.0 | 6.588 | 0.635880 | 4.9 | 6.225 | 6.5 | 6.9 | 7.9 | 50.0 | 2.974 | 0.322497 | 2.2 | 2.800 | 3.0 | 3.175 | 3.8 |
- Calculate median of sepalwidth for each class
In [19]:
Copied!
iris.groupby("species")['sepallength'].median()
iris.groupby("species")['sepallength'].median()
Out[19]:
species Iris-setosa 5.0 Iris-versicolor 5.9 Iris-virginica 6.5 Name: sepallength, dtype: float64
- Calculate summary of 'count', 'min', 'max', 'mean', 'std' for each class
In [21]:
Copied!
values = ['count', 'min', 'max', 'mean', 'std']
iris.groupby(by='species').agg(values)
values = ['count', 'min', 'max', 'mean', 'std'] iris.groupby(by='species').agg(values)
Out[21]:
| sepallength | sepalwidth | petal length | petalwidth | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | min | max | mean | std | count | min | max | mean | std | count | min | max | mean | std | count | min | max | mean | std | |
| species | ||||||||||||||||||||
| Iris-setosa | 50 | 4.3 | 5.8 | 5.006 | 0.352490 | 50 | 2.3 | 4.4 | 3.418 | 0.381024 | 50 | 1.0 | 1.9 | 1.464 | 0.173511 | 50 | 0.1 | 0.6 | 0.244 | 0.107210 |
| Iris-versicolor | 50 | 4.9 | 7.0 | 5.936 | 0.516171 | 50 | 2.0 | 3.4 | 2.770 | 0.313798 | 50 | 3.0 | 5.1 | 4.260 | 0.469911 | 50 | 1.0 | 1.8 | 1.326 | 0.197753 |
| Iris-virginica | 50 | 4.9 | 7.9 | 6.588 | 0.635880 | 50 | 2.2 | 3.8 | 2.974 | 0.322497 | 50 | 4.5 | 6.9 | 5.552 | 0.551895 | 50 | 1.4 | 2.5 | 2.026 | 0.274650 |
- Save the imported data to CV.
In [22]:
Copied!
iris.to_csv('iris.csv', index=False)
iris.to_csv('iris.csv', index=False)
working with html¶
We work with NBA games (https://www.basketball-reference.com/leagues/NBA_2020_games.html) here
- Load librariess
In [57]:
Copied!
import numpy as np
import pandas as pd
pd.options.display.max_rows=10
import numpy as np import pandas as pd pd.options.display.max_rows=10
- Import data, and show the data
In [63]:
Copied!
source = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2020_games.html")
NBA_game = source[0]
NBA_game.head()
source = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2020_games.html") NBA_game = source[0] NBA_game.head()
Out[63]:
| Date | Start (ET) | Visitor/Neutral | PTS | Home/Neutral | PTS.1 | Unnamed: 6 | Unnamed: 7 | Attend. | LOG | Arena | Notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue, Oct 22, 2019 | 8:00p | New Orleans Pelicans | 122 | Toronto Raptors | 130 | Box Score | OT | 20787 | 2:50 | Scotiabank Arena | NaN |
| 1 | Tue, Oct 22, 2019 | 10:30p | Los Angeles Lakers | 102 | Los Angeles Clippers | 112 | Box Score | NaN | 19068 | 2:28 | STAPLES Center | NaN |
| 2 | Wed, Oct 23, 2019 | 7:00p | Chicago Bulls | 125 | Charlotte Hornets | 126 | Box Score | NaN | 15424 | 2:11 | Spectrum Center | NaN |
| 3 | Wed, Oct 23, 2019 | 7:00p | Detroit Pistons | 119 | Indiana Pacers | 110 | Box Score | NaN | 17923 | 2:22 | Bankers Life Fieldhouse | NaN |
| 4 | Wed, Oct 23, 2019 | 7:00p | Cleveland Cavaliers | 85 | Orlando Magic | 94 | Box Score | NaN | 18846 | 2:02 | Amway Center | NaN |
Print the column and type the data
In [64]:
Copied!
print(NBA_game.columns)
type(NBA_game)
print(NBA_game.columns) type(NBA_game)
Index(['Date', 'Start (ET)', 'Visitor/Neutral', 'PTS', 'Home/Neutral', 'PTS.1',
'Unnamed: 6', 'Unnamed: 7', 'Attend.', 'LOG', 'Arena', 'Notes'],
dtype='object')
Out[64]:
pandas.core.frame.DataFrame
Rename the column name of data
In [65]:
Copied!
column_names = {'Date': 'date', 'Start (ET)': 'start','Visitor/Neutral': 'visitor','PTS': 'visitor_points','Home/Neutral': 'home',
'PTS.1': 'home_points','Unnamed: 6': 'box','Unnamed: 7': 'ot','Attend.': 'attend','Notes': 'notes'}
NBA_game=NBA_game.rename(columns=column_names).dropna(thresh=5)[['date','visitor','visitor_points','home','home_points']].assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y')).set_index('date', append=True).rename_axis(["id", "date"]).sort_index()
NBA_game.head()
column_names = {'Date': 'date', 'Start (ET)': 'start','Visitor/Neutral': 'visitor','PTS': 'visitor_points','Home/Neutral': 'home', 'PTS.1': 'home_points','Unnamed: 6': 'box','Unnamed: 7': 'ot','Attend.': 'attend','Notes': 'notes'} NBA_game=NBA_game.rename(columns=column_names).dropna(thresh=5)[['date','visitor','visitor_points','home','home_points']].assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y')).set_index('date', append=True).rename_axis(["id", "date"]).sort_index() NBA_game.head()
Out[65]:
| visitor | visitor_points | home | home_points | ||
|---|---|---|---|---|---|
| id | date | ||||
| 0 | 2019-10-22 | New Orleans Pelicans | 122 | Toronto Raptors | 130 |
| 1 | 2019-10-22 | Los Angeles Lakers | 102 | Los Angeles Clippers | 112 |
| 2 | 2019-10-23 | Chicago Bulls | 125 | Charlotte Hornets | 126 |
| 3 | 2019-10-23 | Detroit Pistons | 119 | Indiana Pacers | 110 |
| 4 | 2019-10-23 | Cleveland Cavaliers | 85 | Orlando Magic | 94 |
Change the data to wide
In [67]:
Copied!
NBA_game_w= (NBA_game.reset_index()
.melt(id_vars=['id', 'date'], value_vars=['visitor', 'home'],
value_name='team', var_name='home_or_visitor')
)
NBA_game_w.head()
NBA_game_w= (NBA_game.reset_index() .melt(id_vars=['id', 'date'], value_vars=['visitor', 'home'], value_name='team', var_name='home_or_visitor') ) NBA_game_w.head()
Out[67]:
| id | date | home_or_visitor | team | |
|---|---|---|---|---|
| 0 | 0 | 2019-10-22 | visitor | New Orleans Pelicans |
| 1 | 1 | 2019-10-22 | visitor | Los Angeles Lakers |
| 2 | 2 | 2019-10-23 | visitor | Chicago Bulls |
| 3 | 3 | 2019-10-23 | visitor | Detroit Pistons |
| 4 | 4 | 2019-10-23 | visitor | Cleveland Cavaliers |
Caluclate the distance flight¶
This part credit goes for https://github.com/TomAugspurger/pandas-best-practices We us airport data here, https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat,
- Load the libraries, and import the data
In [74]:
Copied!
import numpy as np
import pandas as pd
name_var=["id", "name", "city", "country", "faa", "icao", "lat", "lon", "alt", "tz", "dst", "tzone"]
airports_raw = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat",header=None,names=name_var)
import numpy as np import pandas as pd name_var=["id", "name", "city", "country", "faa", "icao", "lat", "lon", "alt", "tz", "dst", "tzone"] airports_raw = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat",header=None,names=name_var)
- Get the head and info of data.
In [75]:
Copied!
airports_raw.head()
airports_raw.head()
Out[75]:
| id | name | city | country | faa | icao | lat | lon | alt | tz | dst | tzone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Goroka | Goroka | Papua New Guinea | GKA | AYGA | -6.081689 | 145.391881 | 5282 | 10.0 | U | Pacific/Port_Moresby |
| 1 | 2 | Madang | Madang | Papua New Guinea | MAG | AYMD | -5.207083 | 145.788700 | 20 | 10.0 | U | Pacific/Port_Moresby |
| 2 | 3 | Mount Hagen | Mount Hagen | Papua New Guinea | HGU | AYMH | -5.826789 | 144.295861 | 5388 | 10.0 | U | Pacific/Port_Moresby |
| 3 | 4 | Nadzab | Nadzab | Papua New Guinea | LAE | AYNZ | -6.569828 | 146.726242 | 239 | 10.0 | U | Pacific/Port_Moresby |
| 4 | 5 | Port Moresby Jacksons Intl | Port Moresby | Papua New Guinea | POM | AYPY | -9.443383 | 147.220050 | 146 | 10.0 | U | Pacific/Port_Moresby |
In [76]:
Copied!
airports_raw.info()
airports_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8107 entries, 0 to 8106 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 8107 non-null int64 1 name 8107 non-null object 2 city 8107 non-null object 3 country 8107 non-null object 4 faa 5880 non-null object 5 icao 8043 non-null object 6 lat 8107 non-null float64 7 lon 8107 non-null float64 8 alt 8107 non-null int64 9 tz 8107 non-null float64 10 dst 8107 non-null object 11 tzone 8107 non-null object dtypes: float64(3), int64(2), object(7) memory usage: 760.2+ KB
- Clean data
- Select just country USA and faa is not empty
- drop id name icao
- Drop duplicate from faa?
In [ ]:
Copied!
airports_raw.country.unique()
airports_usa=airports_raw.loc[(airports_raw['country']=='United States')&('faa'!='')]
airports_usa=airports_usa.drop(['id','name','icao'],axis=1)
airports_usa=airports_usa.drop_duplicates('faa')
airports_usa=airports_usa.set_index('faa')
airports_usa.head()
airports_raw.country.unique() airports_usa=airports_raw.loc[(airports_raw['country']=='United States')&('faa'!='')] airports_usa=airports_usa.drop(['id','name','icao'],axis=1) airports_usa=airports_usa.drop_duplicates('faa') airports_usa=airports_usa.set_index('faa') airports_usa.head()
Running cells with '/opt/homebrew/bin/python3.10' requires the ipykernel package. Run the following command to install 'ipykernel' into the Python environment. Command: '/opt/homebrew/bin/python3.10 -m pip install ipykernel -U --user --force-reinstall'
Calculate distance
In [78]:
Copied!
def great_circle_distance(df, to="DSM"):
# https://www.johndcook.com/blog/python_longitude_latitude/
df = df.copy()
lat = np.deg2rad(90 - df['lat'])
lon = np.deg2rad(90 - df['lon'])
to_lat, to_lon = df.loc[to, ['lat', 'lon']]
cos = (np.sin(lat) * np.sin(to_lat) * np.cos(lon - to_lon) +
np.cos(lat) * np.cos(to_lat))
arc = np.arccos(cos)
kilometers = 6373 * cos
df[f'km_to_{to}'] = kilometers
return df
great_circle_distance(airports_usa)
def great_circle_distance(df, to="DSM"): # https://www.johndcook.com/blog/python_longitude_latitude/ df = df.copy() lat = np.deg2rad(90 - df['lat']) lon = np.deg2rad(90 - df['lon']) to_lat, to_lon = df.loc[to, ['lat', 'lon']] cos = (np.sin(lat) * np.sin(to_lat) * np.cos(lon - to_lon) + np.cos(lat) * np.cos(to_lat)) arc = np.arccos(cos) kilometers = 6373 * cos df[f'km_to_{to}'] = kilometers return df great_circle_distance(airports_usa)
Out[78]:
| city | country | lat | lon | alt | tz | dst | tzone | km_to_DSM | |
|---|---|---|---|---|---|---|---|---|---|
| faa | |||||||||
| 4I7 | Greencastle | United States | 39.633556 | -86.813806 | 842 | -5.0 | U | America/New_York | -611.548347 |
| C91 | Dowagiac | United States | 41.992934 | -86.128012 | 748 | -5.0 | U | America/New_York | -874.022782 |
| CDI | Cambridge | United States | 39.975028 | -81.577583 | 799 | -5.0 | U | America/New_York | -827.065264 |
| SUE | Sturgeon Bay | United States | 44.843667 | -87.421556 | 725 | -6.0 | U | America/Chicago | -1122.842947 |
| 0P2 | Stewartstown | United States | 39.794824 | -76.647191 | 1000 | -5.0 | U | America/New_York | -997.427349 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| UCA | Utica | United States | 43.104167 | -75.223333 | 456 | -5.0 | A | America/New_York | -1370.623054 |
| CVO | Corvallis | United States | 44.506700 | -123.291500 | 250 | -8.0 | A | America/Los_Angeles | -531.679746 |
| CWT | Chatsworth | United States | 34.256944 | -118.598889 | 978 | -8.0 | A | America/Los_Angeles | 594.567494 |
| DHB | Deer Harbor | United States | 48.618397 | -123.005960 | 0 | -8.0 | A | America/Los_Angeles | -985.750406 |
| OLT | San Diego | United States | 32.755200 | -117.199500 | 0 | -8.0 | A | America/Los_Angeles | 752.239018 |
1459 rows × 9 columns
Dow Jones Index¶
We are going to use Dow Jones stock Index.
- Import the necessary libraries
In [79]:
Copied!
import numpy as np
import pandas as pd
import datetime
import numpy as np import pandas as pd import datetime
- Import data set
In [130]:
Copied!
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data')
dow_jones.head(10)
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data') dow_jones.head(10)
Out[130]:
| quarter | stock | date | open | high | low | close | volume | percent_change_price | percent_change_volume_over_last_wk | previous_weeks_volume | next_weeks_open | next_weeks_close | percent_change_next_weeks_price | days_to_next_dividend | percent_return_next_dividend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | AA | 1/7/2011 | $15.82 | $16.72 | $15.78 | $16.42 | 239655616 | 3.792670 | NaN | NaN | $16.71 | $15.97 | -4.428490 | 26 | 0.182704 |
| 1 | 1 | AA | 1/14/2011 | $16.71 | $16.71 | $15.64 | $15.97 | 242963398 | -4.428490 | 1.380223 | 239655616.0 | $16.19 | $15.79 | -2.470660 | 19 | 0.187852 |
| 2 | 1 | AA | 1/21/2011 | $16.19 | $16.38 | $15.60 | $15.79 | 138428495 | -2.470660 | -43.024959 | 242963398.0 | $15.87 | $16.13 | 1.638310 | 12 | 0.189994 |
| 3 | 1 | AA | 1/28/2011 | $15.87 | $16.63 | $15.82 | $16.13 | 151379173 | 1.638310 | 9.355500 | 138428495.0 | $16.18 | $17.14 | 5.933250 | 5 | 0.185989 |
| 4 | 1 | AA | 2/4/2011 | $16.18 | $17.39 | $16.18 | $17.14 | 154387761 | 5.933250 | 1.987452 | 151379173.0 | $17.33 | $17.37 | 0.230814 | 97 | 0.175029 |
| 5 | 1 | AA | 2/11/2011 | $17.33 | $17.48 | $16.97 | $17.37 | 114691279 | 0.230814 | -25.712195 | 154387761.0 | $17.39 | $17.28 | -0.632547 | 90 | 0.172712 |
| 6 | 1 | AA | 2/18/2011 | $17.39 | $17.68 | $17.28 | $17.28 | 80023895 | -0.632547 | -30.226696 | 114691279.0 | $16.98 | $16.68 | -1.766780 | 83 | 0.173611 |
| 7 | 1 | AA | 2/25/2011 | $16.98 | $17.15 | $15.96 | $16.68 | 132981863 | -1.766780 | 66.177694 | 80023895.0 | $16.81 | $16.58 | -1.368230 | 76 | 0.179856 |
| 8 | 1 | AA | 3/4/2011 | $16.81 | $16.94 | $16.13 | $16.58 | 109493077 | -1.368230 | -17.663150 | 132981863.0 | $16.58 | $16.03 | -3.317250 | 69 | 0.180941 |
| 9 | 1 | AA | 3/11/2011 | $16.58 | $16.75 | $15.42 | $16.03 | 114332562 | -3.317250 | 4.419900 | 109493077.0 | $15.95 | $16.11 | 1.003130 | 62 | 0.187149 |
- Select 'stock','date','open','high','low' and check the type of them
In [131]:
Copied!
dow_jones=dow_jones.loc[:,['date','stock','open','high','low']]
dow_jones.dtypes
dow_jones=dow_jones.loc[:,['date','stock','open','high','low']] dow_jones.dtypes
Out[131]:
date object stock object open object high object low object dtype: object
- Assign week number as index
In [30]:
Copied!
def to_week(x):
return datetime.datetime.strptime(x, '%m/%d/%Y').strftime("%V")
def to_week(x): return datetime.datetime.strptime(x, '%m/%d/%Y').strftime("%V")
- Apply the function to_week on the column data
In [132]:
Copied!
dow_jones['week']=dow_jones['date'].apply(to_week)
dow_jones = dow_jones.set_index(['week','stock'])
dow_jones=dow_jones.drop(['date'], axis=1)
dow_jones.head()
dow_jones['week']=dow_jones['date'].apply(to_week) dow_jones = dow_jones.set_index(['week','stock']) dow_jones=dow_jones.drop(['date'], axis=1) dow_jones.head()
Out[132]:
| open | high | low | ||
|---|---|---|---|---|
| week | stock | |||
| 01 | AA | $15.82 | $16.72 | $15.78 |
| 02 | AA | $16.71 | $16.71 | $15.64 |
| 03 | AA | $16.19 | $16.38 | $15.60 |
| 04 | AA | $15.87 | $16.63 | $15.82 |
| 05 | AA | $16.18 | $17.39 | $16.18 |
- Is there any duplicate dates?
In [133]:
Copied!
dow_jones.index.is_unique
dow_jones.index.is_unique
Out[133]:
True
- Drop doller sign from columns and change them to numeric
In [134]:
Copied!
for i in dow_jones.columns[1:]:
dow_jones[i]=dow_jones[i].str.strip('$').astype(float)
for i in dow_jones.columns[1:]: dow_jones[i]=dow_jones[i].str.strip('$').astype(float)
- Compute the mean for each week:
In [135]:
Copied!
dow_jones.iloc[:,1:4].groupby(dow_jones.index).mean()
dow_jones.iloc[:,1:4].groupby(dow_jones.index).mean()
Out[135]:
| high | low | |
|---|---|---|
| (01, AA) | 16.72 | 15.78 |
| (01, AXP) | 45.60 | 43.11 |
| (01, BA) | 70.10 | 66.00 |
| (01, BAC) | 14.69 | 13.80 |
| (01, CAT) | 94.81 | 92.30 |
| ... | ... | ... |
| (25, TRV) | 58.28 | 56.12 |
| (25, UTX) | 86.21 | 83.69 |
| (25, VZ) | 36.17 | 35.20 |
| (25, WMT) | 53.70 | 52.35 |
| (25, XOM) | 81.12 | 76.78 |
750 rows × 2 columns
- Let reconsider time as index.
In [137]:
Copied!
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data')
dow_jones=dow_jones.loc[:,['date','stock','open','high','low']]
def to_datatime(x):
return datetime.datetime.strptime(x, '%m/%d/%Y')
dow_jones['datatime']=dow_jones['date'].apply(to_datatime)
dow_jones = dow_jones.set_index(['datatime'])
dow_jones=dow_jones.drop(['date'], axis=1)
dow_jones.head()
dow_jones = pd.read_csv('../data/dow_jones_index/dow_jones_index.data') dow_jones=dow_jones.loc[:,['date','stock','open','high','low']] def to_datatime(x): return datetime.datetime.strptime(x, '%m/%d/%Y') dow_jones['datatime']=dow_jones['date'].apply(to_datatime) dow_jones = dow_jones.set_index(['datatime']) dow_jones=dow_jones.drop(['date'], axis=1) dow_jones.head()
Out[137]:
| stock | open | high | low | |
|---|---|---|---|---|
| datatime | ||||
| 2011-01-07 | AA | $15.82 | $16.72 | $15.78 |
| 2011-01-14 | AA | $16.71 | $16.71 | $15.64 |
| 2011-01-21 | AA | $16.19 | $16.38 | $15.60 |
| 2011-01-28 | AA | $15.87 | $16.63 | $15.82 |
| 2011-02-04 | AA | $16.18 | $17.39 | $16.18 |
In [138]:
Copied!
for i in dow_jones.columns[1:]:
dow_jones[i]=dow_jones[i].str.strip('$').astype(float)
dow_jones.head()
for i in dow_jones.columns[1:]: dow_jones[i]=dow_jones[i].str.strip('$').astype(float) dow_jones.head()
Out[138]:
| stock | open | high | low | |
|---|---|---|---|---|
| datatime | ||||
| 2011-01-07 | AA | 15.82 | 16.72 | 15.78 |
| 2011-01-14 | AA | 16.71 | 16.71 | 15.64 |
| 2011-01-21 | AA | 16.19 | 16.38 | 15.60 |
| 2011-01-28 | AA | 15.87 | 16.63 | 15.82 |
| 2011-02-04 | AA | 16.18 | 17.39 | 16.18 |
- Get the last business day of each month
In [40]:
Copied!
dow_jones.iloc[:,1:4].resample('BME').mean()
dow_jones.iloc[:,1:4].resample('BME').mean()
Out[40]:
| open | high | low | |
|---|---|---|---|
| datatime | |||
| 2011-01-31 | 51.671083 | 52.839417 | 50.934167 |
| 2011-02-28 | 53.411250 | 54.465583 | 52.577917 |
| 2011-03-31 | 53.237083 | 54.142083 | 51.929833 |
| 2011-04-29 | 54.377600 | 55.574533 | 53.590400 |
| 2011-05-31 | 55.507417 | 56.335333 | 54.373500 |
| 2011-06-30 | 53.525167 | 54.436833 | 52.197583 |
- Compute the mean of each month for each index.
In [139]:
Copied!
dow_jones.groupby(dow_jones.stock).mean().head()
dow_jones.groupby(dow_jones.stock).mean().head()
Out[139]:
| open | high | low | |
|---|---|---|---|
| stock | |||
| AA | 16.5640 | 17.0152 | 16.0636 |
| AXP | 46.4880 | 47.5684 | 45.5976 |
| BA | 73.2972 | 74.7712 | 71.6432 |
| BAC | 13.1756 | 13.4548 | 12.7776 |
| CAT | 103.0672 | 105.5296 | 99.9932 |
In [148]:
Copied!
# dow_jones.groupby([dow_jones.index,dow_jones.stock]).mean()
dow_jones.groupby([dow_jones.index,dow_jones.stock]).resample('BME').mean()
# dow_jones.groupby([dow_jones.index,dow_jones.stock]).mean() dow_jones.groupby([dow_jones.index,dow_jones.stock]).resample('BME').mean()
Out[148]:
| open | high | low | |||
|---|---|---|---|---|---|
| datatime | stock | datatime | |||
| 2011-01-07 | AA | 2011-01-31 | 15.82 | 16.72 | 15.78 |
| AXP | 2011-01-31 | 43.30 | 45.60 | 43.11 | |
| BA | 2011-01-31 | 66.15 | 70.10 | 66.00 | |
| BAC | 2011-01-31 | 13.85 | 14.69 | 13.80 | |
| CAT | 2011-01-31 | 94.38 | 94.81 | 92.30 | |
| ... | ... | ... | ... | ... | ... |
| 2011-06-24 | TRV | 2011-06-30 | 57.57 | 58.28 | 56.12 |
| UTX | 2011-06-30 | 84.36 | 86.21 | 83.69 | |
| VZ | 2011-06-30 | 35.35 | 36.17 | 35.20 | |
| WMT | 2011-06-30 | 52.70 | 53.70 | 52.35 | |
| XOM | 2011-06-30 | 78.65 | 81.12 | 76.78 |
750 rows × 3 columns
In [126]:
Copied!
dow_jones_df = pd.DataFrame()
for i in list(set(dow_jones['stock'])):
temp=dow_jones.loc[dow_jones['stock'].isin([i])].iloc[:,1:].resample('BME').mean()
temp['stock']=i
dow_jones_df=pd.concat([temp, dow_jones_df])
dow_jones_df = pd.DataFrame() for i in list(set(dow_jones['stock'])): temp=dow_jones.loc[dow_jones['stock'].isin([i])].iloc[:,1:].resample('BME').mean() temp['stock']=i dow_jones_df=pd.concat([temp, dow_jones_df])
In [127]:
Copied!
dow_jones_df
dow_jones_df
Out[127]:
| open | high | low | stock | |
|---|---|---|---|---|
| datatime | ||||
| 2011-01-31 | 19.0025 | 19.5625 | 18.6225 | GE |
| 2011-02-28 | 20.8225 | 21.3150 | 20.5375 | GE |
| 2011-03-31 | 20.2475 | 20.4950 | 19.4875 | GE |
| 2011-04-29 | 20.1380 | 20.6020 | 19.7200 | GE |
| 2011-05-31 | 19.9600 | 20.1750 | 19.4875 | GE |
| ... | ... | ... | ... | ... |
| 2011-02-28 | 16.9700 | 17.4250 | 16.5975 | AA |
| 2011-03-31 | 16.4300 | 16.8150 | 15.8100 | AA |
| 2011-04-29 | 17.1820 | 17.7340 | 16.6720 | AA |
| 2011-05-31 | 16.8475 | 17.3375 | 16.4175 | AA |
| 2011-06-30 | 15.6525 | 15.9900 | 15.0225 | AA |
180 rows × 4 columns
In [141]:
Copied!
dow_jones.loc[dow_jones['stock'].isin(['AA'])].head()
dow_jones.loc[dow_jones['stock'].isin(['AA'])].head()
Out[141]:
| stock | open | high | low | |
|---|---|---|---|---|
| datatime | ||||
| 2011-01-07 | AA | 15.82 | 16.72 | 15.78 |
| 2011-01-14 | AA | 16.71 | 16.71 | 15.64 |
| 2011-01-21 | AA | 16.19 | 16.38 | 15.60 |
| 2011-01-28 | AA | 15.87 | 16.63 | 15.82 |
| 2011-02-04 | AA | 16.18 | 17.39 | 16.18 |
In [147]:
Copied!
dow_jones.loc[dow_jones['stock'].isin(['AA'])].iloc[:,1:].resample('BME').mean()
dow_jones.loc[dow_jones['stock'].isin(['AA'])].iloc[:,1:].resample('BME').mean()
Out[147]:
| open | high | low | |
|---|---|---|---|
| datatime | |||
| 2011-01-31 | 16.1475 | 16.6100 | 15.7100 |
| 2011-02-28 | 16.9700 | 17.4250 | 16.5975 |
| 2011-03-31 | 16.4300 | 16.8150 | 15.8100 |
| 2011-04-29 | 17.1820 | 17.7340 | 16.6720 |
| 2011-05-31 | 16.8475 | 17.3375 | 16.4175 |
| 2011-06-30 | 15.6525 | 15.9900 | 15.0225 |
- Lowerizer the stock column
In [179]:
Copied!
lowerizer = lambda x: x.lower()
dow_jones['stock'].apply(lowerizer)
lowerizer = lambda x: x.lower() dow_jones['stock'].apply(lowerizer)
Out[179]:
datatime
2011-01-07 aa
2011-01-14 aa
2011-01-21 aa
2011-01-28 aa
2011-02-04 aa
...
2011-05-27 xom
2011-06-03 xom
2011-06-10 xom
2011-06-17 xom
2011-06-24 xom
Name: stock, Length: 750, dtype: object