All about Pandas¶
Intro¶
- Pandas is a powerful open-source library for data analysis and manipulation in Python. It's essentially a toolbox providing data structures and functions to work with data efficiently.
- Data Structures:
Series
: A one-dimensional labeled array that can hold any data type like integers, strings, etc. Think of it as a list with labels.DataFrame
: A two-dimensional labeled data structure with rows and columns, similar to a spreadsheet.
- Functionality:
- Loading data from various sources like CSV, Excel files, databases.
- Data cleaning and manipulation tasks like filtering, sorting, grouping.
- Performing data analysis with descriptive statistics and time series analysis.
- Data visualization by integrating with libraries like Matplotlib.
- Pandas is often used alongside other data science libraries like NumPy (numerical computing) and Scikit-learn (machine learning).
- Key Components
Data
- The heart of a DataFrame lies in the two-dimensional array of data.
- Each element within the array occupies a specific row and column, creating a structured representation of your data.
Rows
- Each horizontal line in the DataFrame represents a row, often referred to as a Series (another fundamental pandas structure)
- Rows hold collections of data associated with a particular record or observation.
Columns
- Vertical sections in the DataFrame are columns, each holding data of a specific type (e.g., integers, strings, dates).
- Columns represent attributes or features you're analyzing.
Index (Row Labels)
- DataFrames can have optional labels associated with each row, making it easier to identify and access specific rows.
- These labels can be integers, strings, or any other hashable data type.
Columns Labels
- Similar to row labels, DataFrames can have labels assigned to each column, providing descriptive names for the data being held.
- Essential Operations
- Once you have a DataFrame, pandas equips you with a rich set of tools to manage, analyze, and transform your data.
- Indexing and Selection
- Access specific rows or columns using labels, positions, Boolean criteria (filtering), or powerful indexing techniques like
.loc
and.iloc
.
- Access specific rows or columns using labels, positions, Boolean criteria (filtering), or powerful indexing techniques like
- Data Cleaning
- Address missing values, outliers, or inconsistencies in your data using methods like
.fillna()
,.dropna()
, or custom logic.
- Address missing values, outliers, or inconsistencies in your data using methods like
- Data Manipulation
- Reshape your data using
.pivot_table()
,.groupby()
, or other methods to create summaries, aggregations, or pivot tables.
- Reshape your data using
- Sorting and Ranking
- Order your data based on column values using
.sort_values()
, allowing you to analyze trends and patterns.
- Order your data based on column values using
- Calculations and Aggregations
- Perform computations on entire columns or groups of rows using vectorized operations, aggregation functions (e.g.,
.mean()
,.sum()
), or apply custom functions.
- Perform computations on entire columns or groups of rows using vectorized operations, aggregation functions (e.g.,
- Merging and Joining
- Combine DataFrames from different sources using operations like
.concat()
,.merge()
, or.join()
to create comprehensive datasets.
- Combine DataFrames from different sources using operations like
- Data Visualization
- While pandas doesn't offer built-in visualization capabilities, it seamlessly integrates with libraries like Matplotlib and Seaborn to create insightful charts and graphs.
- Indexing and Selection
- Once you have a DataFrame, pandas equips you with a rich set of tools to manage, analyze, and transform your data.
- Beyond the Basics
- The power of pandas extends beyond these fundamental operations. Here's a glimpse into more advanced features.
- Handling Time Series Data
- Pandas offers specialized data types for dates and times, along with functionality for working with time series data (e.g., resampling, date/time manipulation).
- Working with Hierarchical Data
- Data can have hierarchical structures (e.g., multi-index DataFrames). Pandas provides advanced indexing and selection techniques to navigate these complexities.
- Performance Optimization
- For large datasets, pandas offers tools like vectorized operations and the option to work with memory-mapped files to accelerate computations.
- Integration with Other Libraries
- Pandas plays well with other scientific Python libraries like NumPy, SciPy, and Matplotlib, forming a powerful data science ecosystem.
- Handling Time Series Data
- The power of pandas extends beyond these fundamental operations. Here's a glimpse into more advanced features.
- Install Pandas
pip install pandas
create dataframe¶
- Pandas empowers you to create DataFrames in various ways, catering to different data sources and preferences.
- create dataframe using the
DataFrame()
function- This method is flexible and allows you to specify the data and column names directly.
- This versatile method grants you granular control over the creation process.
In [ ]:
Copied!
import pandas as pd
# Create an empty DataFrame
df = pd.DataFrame()
# Create a DataFrame with data directly
data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print(df)
import pandas as pd
# Create an empty DataFrame
df = pd.DataFrame()
# Create a DataFrame with data directly
data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print(df)
create dataframe from Lists/NumPy Arrays¶
- This method is useful when you have your data in a list format.
- Convert existing list or NumPy array structures to DataFrames using the
DataFrame()
function.
In [ ]:
Copied!
import pandas as pd
data = [['chaitu-ycr', 25], ['Bob', 30], ['Charlie', 22]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
import pandas as pd
data = [['chaitu-ycr', 25], ['Bob', 30], ['Charlie', 22]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
create dataframe from Dictionaries¶
- If your data resides in dictionaries, pandas can seamlessly transform it into a DataFrame.
- The dictionary keys become column names, and values are used to populate the respective columns.
In [ ]:
Copied!
import pandas as pd
data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print(df)
import pandas as pd
data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print(df)
create dataframe from csv/excel files¶
- Data often resides in CSV, Excel, or other file formats.
- Pandas provides functions like
read_csv()
andread_excel()
to import this data directly into DataFrames.
In [ ]:
Copied!
import pandas as pd
# Replace '*.csv' with the actual path to your CSV file
data_path = '_data/random_data_from_web.csv'
# Read the CSV file using pandas.read_csv()
df = pd.read_csv(data_path)
# Print the first few rows of the DataFrame (optional)
print(df.head())
import pandas as pd
# Replace '*.csv' with the actual path to your CSV file
data_path = '_data/random_data_from_web.csv'
# Read the CSV file using pandas.read_csv()
df = pd.read_csv(data_path)
# Print the first few rows of the DataFrame (optional)
print(df.head())
create dataframe from SQL databases¶
- Leverage pandas' integration with database libraries (e.g., SQLAlchemy) to fetch data from relational databases and construct DataFrames.
import pandas as pd from sqlalchemy import create_engine # Replace these details with your actual database credentials DATABASE_USER = 'your_username' DATABASE_PASSWORD = 'your_password' DATABASE_HOST = 'your_host' DATABASE_NAME = 'your_database' # Define the SQL query (replace with your specific query) sql_query = "SELECT * FROM your_table_name" # Change this to your desired query # Construct the connection string using SQLAlchemy engine_string = f"postgresql://{DATABASE_USER}:{DATABASE_PASSWORD}@" \ f"{DATABASE_HOST}/{DATABASE_NAME}" # Adjust for your database type # Create a SQLAlchemy engine engine = create_engine(engine_string) # Read the data from the database using pandas.read_sql_query() df = pd.read_sql_query(sql_query, engine) # Print the first few rows of the DataFrame (optional) print(df.head()) # Close the connection (optional) engine.dispose()