Learn Pandas in One Article
What is Pandas?
Pandas is one of the most widely used Python packages which is a simple and powerful tool for Data Science.
Importing Pandas
import pandas as pd
Pandas Data Structure
There are two types of data structures in Pandas, Series and DataFrame.
Series
Series is a one-dimensional labeled array that can hold any data type.
Example:
pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
DataFrame
DataFrame is a two-dimensional and heterogeneous tabular data structure.
So, Series is the data structure for a single column of a DataFrame
Example:
data_mobile = {'Mobile': ['iPhone', 'Samsung', 'Redmi'], 'Color': ['Red', 'White', 'Black'], 'Price': ['High', 'Medium', 'Low']}
df = pd.DataFrame(data_mobile, columns=['Mobile', 'Color', 'Price'])
Reading Files in Pandas:
Pandas library offers a set of functions that can read a wide range of files
pd.read_csv("filename")
pd.read_table("filename")
pd.read_excel("filename")
pd.read_sql(query, connection_object)
pd.read_json(json_string)
Writing Files in Pandas:
Similarly, Pandas library offers many functions which are useful for writing data into a file.
df.to_csv("filename")
df.to_excel("filename")
df.to_sql(table_name, connection_object)
df.to_json("filename")
Creating Test Data
Pandas library equally allows to create test data to test code segments.
pd.DataFrame(np.random.rand(4,3)) # 3 columns and 4 rows of random floats
Pandas Operations
Viewing
df.head(n): looking at first n rows of the DataFrame.
df.tail(n): looking at last n rows of the DataFrame.
df.shape(): gives the number of rows and columns.
df.info(): gives information of Index, Datatype and Memory.
df.describe(): gives summary statistics for numerical columns
Selecting
If we want to select a set of data from our DataFrame. There are two ways to do that. Selecting by position and selecting by label.
- Selecting by position with iloc:
df.iloc[0]: selects first row of data frame
df.iloc[1]: selects second row of data frame
df.iloc[-1]: selects last row of data frame
df.iloc[:,0]: selects first column of data frame
df.iloc[:,1]: selects second column of data frame
- Selecting by label using loc:
df.loc([0], [column labels]): selects single value by row position & column labels
df.loc[‘row1′:’row3’, ‘column1′:’column3’]: selects and slicing on labels
Sorting
df.sort_index(): sorts by labels along an axis
df.sort_values(column1): sorts values by column1 in ascending order
df.sort_values(column2, ascending=False): sorts values by column2 in descending order
Groupby
By using groupby you can create a group of categories and then it can be helpful while applying a function to the categories.
df.groupby(column): returns a groupby object for values from one column
df.groupby([column1,column2]): returns a groupby object values from multiple columns
df.groupby(column1)[column2].mean(): returns the mean of the values in column2, grouped by the values in column1
df.groupby(column1)[column2].median(): returns the median of the values in column2, grouped by the values in column1
Calculations
df.mean(): mean
df.median(): median
df.std(): standard deviation:
df.max(): Max
df.min(): Min
df.count(): number of non-null values i
df.describe(): summary statistics
Plotting
df.plot.hist(): histogram
df.plot.scatter(x=’column1′,y=’column2′): scatter plot
Post a Comment