Learn Pandas in One Article

What is Pandas?

Pandas is one of the most widely used Python packages which is a simple and powerful tool for Data Science.  


Importing Pandas

import pandas as pd


Pandas Data Structure

There are two types of data structures in Pandas, Series and DataFrame.

Series

Series is a one-dimensional labeled array that can hold any data type.

Example:
pd.Series([1, 2, 3, 4],  index=['a', 'b', 'c', 'd'])

DataFrame

DataFrame is a two-dimensional and heterogeneous tabular data structure.
So, Series is the data structure for a single column of a DataFrame

Example:
data_mobile = {'Mobile': ['iPhone', 'Samsung', 'Redmi'], 'Color': ['Red', 'White''Black'], 'Price': ['High''Medium''Low']}

df = pd.DataFrame(data_mobile, columns=['Mobile''Color''Price'])

Reading Files in Pandas:

Pandas library offers a set of functions that can read a wide range of files

pd.read_csv("filename")
pd.read_table("filename")
pd.read_excel("filename")
pd.read_sql(query, connection_object)
pd.read_json(json_string)

Writing Files in Pandas:

Similarly, Pandas library offers many functions which are useful for writing data into a file.

df.to_csv("filename")
df.to_excel("filename")
df.to_sql(table_name, connection_object)
df.to_json("filename")

Creating Test Data

Pandas library equally allows to create test data to test code segments. 

pd.DataFrame(np.random.rand(4,3)) # 3 columns and 4 rows of random floats

Pandas Operations

Viewing

df.head(n): looking at first n rows of the DataFrame.
df.tail(n): looking at last n rows of the DataFrame.
df.shape(): gives the number of rows and columns.
df.info(): gives information of Index, Datatype and Memory.
df.describe(): gives summary statistics for numerical columns

Selecting

If we want to select a set of data from our DataFrame. There are two ways to do that. Selecting by position and selecting by label.

  • Selecting by position with iloc:

df.iloc[0]: selects first row of data frame
df.iloc[1]: selects second row of data frame
df.iloc[-1]: selects last row of data frame
df.iloc[:,0]: selects first column of data frame
df.iloc[:,1]: selects second column of data frame

  • Selecting by label using loc:

df.loc([0], [column labels]): selects single value by row position & column labels
df.loc[‘row1′:’row3’, ‘column1′:’column3’]: selects and slicing on labels

Sorting

df.sort_index(): sorts by labels along an axis
df.sort_values(column1): sorts values by column1 in ascending order
df.sort_values(column2, ascending=False): sorts values by column2 in descending order

Groupby

By using groupby you can create a group of categories and then it can be helpful while applying a function to the categories. 

df.groupby(column): returns a groupby object for values from one column
df.groupby([column1,column2]): returns a groupby object values from multiple columns
df.groupby(column1)[column2].mean(): returns the mean of the values in column2, grouped by the values in column1
df.groupby(column1)[column2].median(): returns the median of the values in column2, grouped by the values in column1

Calculations

df.mean(): mean
df.median(): median
df.std(): standard deviation:
df.max(): Max
df.min(): Min
df.count(): number of non-null values i
df.describe(): summary statistics 

Plotting

df.plot.hist(): histogram
df.plot.scatter(x=’column1′,y=’column2′): scatter plot