Useful Notes and Links

Reynier Cruz-Torres, PhD

Pandas

Series

mySeries = pd.Series(data=[1,2,3],index=['A','B','C'])

or

mydict = {'A':1,'B':2,'C':3}
mySeries = pd.Series(mydict)

Selecting a specific type category

df.dtypes
my_object_df = df.select_dtypes(include=['object'])
my_numeric_df = df.select_dtypes(exclude=['object'])

One-hot encoding categorical data:

df_objects_dummies = pd.get_dummies(my_object_df,drop_first=True)

Grabbing a sample of a dataframe

df = df.sample(frac=0.1,random_state=101)

Apply and lambda expression:

def multiply(x,y): return x*y
df['A_x_B'] = df.apply(lambda x: multiply(x['A'],x['B']),axis=1)

DataFrame indices

df = df.set_index('col_name')
df = df.reset_index()

Useful commands:

df.head()
df.tail()
df.info()
df.describe.transpose()

Handling missing data:

df.dropna()
df.dropna(axis=1)
df.fillna()
df['A'] = df['A'].fillna(value=0.0)
df['B'] = df['B'].fillna(value=df['B'].mean())
df = df.fillna(value=df.mean())

groupby:

df.groupby('Year').sum(numeric_only=True).sort_index(ascending=False)
df.groupby(['Year','Sector']).sum(numeric_only=True)
df.groupby('Year').describe()

Other operations:

df['Year'].unique()
df['Year'].nunique()
df['Year'].value_counts()
df['col1'].max()
df['col1'].idxmax()
df.sort_values('col1',ascending=False)

Joining dataframes:

df_new = pd.concatenate([df1,df2],axis=1)
df_new = pd.merge(left=df1,right=df2,on='col1',how='inner')

drop rows with all zeros

df.loc[~(df==0).all(axis=1)]
df.loc[(df!=0).any(axis=1)]

see discussion here.