# Project 1

## 1

A package is a group of Python modules that, when imported into your Python workspace allow for the use of various, helpful commands. Within each package are a series of libraries that each have a specific function relating to the main package. To install a package and library, you must first make sure the intended package is installed within the Python interpreter - found under Preferences in PyCharm. If it is not installed, a package can be easily found by typing in its name and adding it to the interpreter. Once the package is installed, you then have to import it into the workspace with a line of code. 

In [3]:
import pandas as pd

This would import the pandas library into your workspace under an alias 'pd'. Using an alias can be helpful in keeping code short and easy to read.

In [4]:
from datetime import datetime

This would import the datetime function from the datetime library. This is helpful when you only need access to a certain function from within a library; rather than importing the entire library, you can import certain functions specifically. 

## 2

A data frame is a structure that allows for the reading of data sets by organizing data entries into a table of columns and values. When working with dataframes, the pandas library is particularly helpful as it allows for the reading of the data in addition to data subsetting and manipulation. To read a file in its remote location in your file system, you would need to write a `read_()` command, specifying the library you are using and the path to the file. The pandas library has already been imported under an alias, so the code used would look something like this:

In [5]:
path_to_data = 'gapminder.tsv'

In [6]:
data = pd.read_csv(path_to_data, sep = '\t')

The dataset we're working this is tab-separated; however the pandas library will assume it is a comma-separated file. To get around this, within the `read_()` command, specify that you want data read as a csv file, `read_csv()`, while including that the data originally was tab-separated, `sep = '\t'`. 

To return a description of the data, use the `describe()` command:

In [22]:
data.describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165877
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846989
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


To determine how many rows and columns are included in the dataset, use the .shape() command.

In [28]:
data.shape[0]

1704

In [29]:
data.shape[1]

6

There are 1704 rows and 6 columns in this dataframe. To get a summary of the names of the columns, use `data.info()`, or `data.columns()` as a list:

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


In [31]:
list(data.columns)

['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap']

## 3

In [32]:
data['year'].unique().tolist()

[1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007]

In [33]:
len(data[(data['year'] == data['year'].max())])

142

In [34]:
len(data[(data['year'] == 2002)])

142

Starting in 1952, the years come in five-year increments. To make it more current, data from 2012 and 2017 should be included, adding 142 records for each year to the data frame - 284 in total.

## 4

In [35]:
data[(data['lifeExp'] == data['lifeExp'].min())]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1292,Rwanda,Africa,1992,23.599,7290203,737.068595



In this data frame, the lowest life expectancy occured in Rwanda in 1992, where life expectancy was about 23 years. This was around the time of the Rwanda genocide, when up to a million people were killed. This, in addition to the nation's poor public health at the time was likely the biggest factor in driving down average life expectancy. 

## 5

In [None]:
total_gdp = (data['pop'] * data['gdpPercap'])
data['totalGDP'] = total_gdp.tolist()


In 2007, total GDP for Germany, France, Italy, and Spain were as follows: 

In [None]:
data_europe2007 = data[(data['continent']=='Europe') & (data['year'] == data['year'].max())]
data_fgis2007 = data_europe2007[(data_europe2007['country']=='Spain') | (data_europe2007['country']=='France') | (data_europe2007['country']=='Germany') | (data_europe2007['country']=='Italy')]
name_gdp = data_fgis2007[['country', 'totalGDP']]
name_gdp.sort_values(by=['totalGDP'], ascending = False)

In 2002, total GDP for Germany, France, Italy, and Spain were as follows: 

In [None]:
data_europe2002 = data[(data['continent']=='Europe') & (data['year'] == 2002)]
data_fgis2002 = data_europe2002[(data_europe2002['country']=='Spain') | (data_europe2002['country']=='France') | (data_europe2002['country']=='Germany') | (data_europe2002['country']=='Italy')]
name_gdp = data_fgis2002[['country', 'totalGDP']]
name_gdp.sort_values(by=['totalGDP'], ascending=False)


Spain experienced the greatest increase in total GDP between 2002 and 2007, increasing from .97 trillion USD to 1.16 trillion USD.

## 6

The '&' symbol is used to represent 'and' and would be used in cases where you are looking for data that has both of the given criteria. The '==' symbol is used for checking if a variable is equal to a certain value in conditional statements. Using a single equal sign would only be appropriate when assigning values to a variable name. The following code would return values where country is equal to Europe and the year is equal to 2007.

In [7]:
data_europe2007 = data[(data['continent']=='Europe') & (data['year'] == data['year'].max())]
data_europe2007

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
23,Albania,Europe,2007,76.423,3600523,5937.029526
83,Austria,Europe,2007,79.829,8199783,36126.4927
119,Belgium,Europe,2007,79.441,10392226,33692.60508
155,Bosnia and Herzegovina,Europe,2007,74.852,4552198,7446.298803
191,Bulgaria,Europe,2007,73.005,7322858,10680.79282
383,Croatia,Europe,2007,75.748,4493312,14619.22272
407,Czech Republic,Europe,2007,76.486,10228744,22833.30851
419,Denmark,Europe,2007,78.332,5468120,35278.41874
527,Finland,Europe,2007,79.313,5238460,33207.0844
539,France,Europe,2007,80.657,61083916,30470.0167



The '|' symbol is used to represent 'or' and would return True as long as at least one of the given arguments is True. The following code will return True even though 3 is not greater than 4, because it is only asking if one of the given arguements is correct.

In [8]:
(1+2 == 3) | (3 > 4)

True

The '^' symbol is used to represent an exclusive or. It will return True if one argument is True and the other False, but will return False if both are True or both are False. The following will return False as both are False.

In [9]:
('cat' == 'dog') ^ (2 > 1)

True

## 7

The .loc command is a location command used to extract a row of data when given its index label as a parameter. The .iloc works similarly but returns a row by it's integer position in the dataset rather than its index label. The following code will return all rows of data between the integer positions 1691 and 1702, not including entry 1702.

In [11]:
data.iloc[1691:1702]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1691,Zambia,Africa,2007,42.384,11746035,1271.211593
1692,Zimbabwe,Africa,1952,48.451,3080907,406.884115
1693,Zimbabwe,Africa,1957,50.469,3646340,518.764268
1694,Zimbabwe,Africa,1962,52.358,4277736,527.272182
1695,Zimbabwe,Africa,1967,53.995,4995432,569.795071
1696,Zimbabwe,Africa,1972,55.635,5861135,799.362176
1697,Zimbabwe,Africa,1977,57.674,6642107,685.587682
1698,Zimbabwe,Africa,1982,60.363,7636524,788.855041
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786


The following code will return all observations from the second, third, and fourth columns. 

In [16]:
data.iloc[:, 1:4]

Unnamed: 0,continent,year,lifeExp
0,Asia,1952,28.801
1,Asia,1957,30.332
2,Asia,1962,31.997
3,Asia,1967,34.020
4,Asia,1972,36.088
5,Asia,1977,38.438
6,Asia,1982,39.854
7,Asia,1987,40.822
8,Asia,1992,41.674
9,Asia,1997,41.763


## 8

An ai is an Application Programming Interface, and it a part of a website's remote server that receives and processes requests. It allows different applications to work together.

To pull data from a remote server, you first have to send a request using the requests library.

In [10]:
import requests

You then must specify where the data is coming from in a url. 

In [11]:
url = "https://api.covidtracking.com/v1/states/daily.csv"

Then you can write it to a local file in a folder on your server using the os library.

In [12]:
import os

data_folder = 'data'
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

After setting a filename for your data, you can retrieve the data from the request and populate your new file.

In [14]:
from datetime import datetime as dt
import pytz
 
file_name_short = 'ctp_' + str(dt.now(tz = pytz.utc)).replace(' ', '_') + '.csv'
file_name = os.path.join(data_folder, file_name_short)

r = requests.get(url)

with open(file_name, 'wb') as f:
    f.write(r.content)

Once your file is written, you cna import it into your workspace using pandas.

In [15]:
import pandas as pd
df = pd.read_csv(file_name)

## 9

The apply() function from the pandas library allows for the user to take any function and apply it to all values in the given series. For example, the apply() function could be used to sum all values for a particular column. Using apply() could be an alternative to writing out an actual function and then having to execute it across all series objects, which could lower output speed and efficiency, while taking up more space and increasing the chance of making a mistake. 

## 10 


Instead of using .iloc to filter columns, you could just make a new subset of your data frame. The following two lines of code return the same output; however, using the first option is potentially easier and allows both consecutive and non-consecutive columns to be extracted by name. 

In [16]:
data[["country", "continent", "year", "lifeExp"]]

Unnamed: 0,country,continent,year,lifeExp
0,Afghanistan,Asia,1952,28.801
1,Afghanistan,Asia,1957,30.332
2,Afghanistan,Asia,1962,31.997
3,Afghanistan,Asia,1967,34.020
4,Afghanistan,Asia,1972,36.088
5,Afghanistan,Asia,1977,38.438
6,Afghanistan,Asia,1982,39.854
7,Afghanistan,Asia,1987,40.822
8,Afghanistan,Asia,1992,41.674
9,Afghanistan,Asia,1997,41.763


In [17]:
data.iloc[:, 0:4]

Unnamed: 0,country,continent,year,lifeExp
0,Afghanistan,Asia,1952,28.801
1,Afghanistan,Asia,1957,30.332
2,Afghanistan,Asia,1962,31.997
3,Afghanistan,Asia,1967,34.020
4,Afghanistan,Asia,1972,36.088
5,Afghanistan,Asia,1977,38.438
6,Afghanistan,Asia,1982,39.854
7,Afghanistan,Asia,1987,40.822
8,Afghanistan,Asia,1992,41.674
9,Afghanistan,Asia,1997,41.763
