All about Pandas¶

Intro¶

Pandas is a powerful open-source library for data analysis and manipulation in Python. It's essentially a toolbox providing data structures and functions to work with data efficiently.
Data Structures:
- Series: A one-dimensional labeled array that can hold any data type like integers, strings, etc. Think of it as a list with labels.
- DataFrame: A two-dimensional labeled data structure with rows and columns, similar to a spreadsheet.
Functionality:
- Loading data from various sources like CSV, Excel files, databases.
- Data cleaning and manipulation tasks like filtering, sorting, grouping.
- Performing data analysis with descriptive statistics and time series analysis.
- Data visualization by integrating with libraries like Matplotlib.
Pandas is often used alongside other data science libraries like NumPy (numerical computing) and Scikit-learn (machine learning).
Key Components
- Data
  - The heart of a DataFrame lies in the two-dimensional array of data.
  - Each element within the array occupies a specific row and column, creating a structured representation of your data.
- Rows
  - Each horizontal line in the DataFrame represents a row, often referred to as a Series (another fundamental pandas structure)
  - Rows hold collections of data associated with a particular record or observation.
- Columns
  - Vertical sections in the DataFrame are columns, each holding data of a specific type (e.g., integers, strings, dates).
  - Columns represent attributes or features you're analyzing.
- Index (Row Labels)
  - DataFrames can have optional labels associated with each row, making it easier to identify and access specific rows.
  - These labels can be integers, strings, or any other hashable data type.
- Columns Labels
  - Similar to row labels, DataFrames can have labels assigned to each column, providing descriptive names for the data being held.
Essential Operations
- Once you have a DataFrame, pandas equips you with a rich set of tools to manage, analyze, and transform your data.
  - Indexing and Selection
    - Access specific rows or columns using labels, positions, Boolean criteria (filtering), or powerful indexing techniques like .loc and .iloc.
  - Data Cleaning
    - Address missing values, outliers, or inconsistencies in your data using methods like .fillna(), .dropna(), or custom logic.
  - Data Manipulation
    - Reshape your data using .pivot_table(), .groupby(), or other methods to create summaries, aggregations, or pivot tables.
  - Sorting and Ranking
    - Order your data based on column values using .sort_values(), allowing you to analyze trends and patterns.
  - Calculations and Aggregations
    - Perform computations on entire columns or groups of rows using vectorized operations, aggregation functions (e.g., .mean(), .sum()), or apply custom functions.
  - Merging and Joining
    - Combine DataFrames from different sources using operations like .concat(), .merge(), or .join() to create comprehensive datasets.
  - Data Visualization
    - While pandas doesn't offer built-in visualization capabilities, it seamlessly integrates with libraries like Matplotlib and Seaborn to create insightful charts and graphs.
Beyond the Basics
- The power of pandas extends beyond these fundamental operations. Here's a glimpse into more advanced features.
  - Handling Time Series Data
    - Pandas offers specialized data types for dates and times, along with functionality for working with time series data (e.g., resampling, date/time manipulation).
  - Working with Hierarchical Data
    - Data can have hierarchical structures (e.g., multi-index DataFrames). Pandas provides advanced indexing and selection techniques to navigate these complexities.
  - Performance Optimization
    - For large datasets, pandas offers tools like vectorized operations and the option to work with memory-mapped files to accelerate computations.
  - Integration with Other Libraries
    - Pandas plays well with other scientific Python libraries like NumPy, SciPy, and Matplotlib, forming a powerful data science ecosystem.
Install Pandas

pip install pandas

create dataframe¶

Pandas empowers you to create DataFrames in various ways, catering to different data sources and preferences.

create dataframe using the DataFrame() function
- This method is flexible and allows you to specify the data and column names directly.
- This versatile method grants you granular control over the creation process.

In [ ]:

Copied!

import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()

# Create a DataFrame with data directly
data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)

print(df)
import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()

# Create a DataFrame with data directly
data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)

print(df)

create dataframe from Lists/NumPy Arrays¶

This method is useful when you have your data in a list format.
Convert existing list or NumPy array structures to DataFrames using the DataFrame() function.

In [ ]:

Copied!

import pandas as pd

data = [['chaitu-ycr', 25], ['Bob', 30], ['Charlie', 22]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

print(df)
import pandas as pd

data = [['chaitu-ycr', 25], ['Bob', 30], ['Charlie', 22]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

print(df)

create dataframe from Dictionaries¶

If your data resides in dictionaries, pandas can seamlessly transform it into a DataFrame.
The dictionary keys become column names, and values are used to populate the respective columns.

In [ ]:

Copied!

import pandas as pd

data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)

print(df)
import pandas as pd

data = {'Name': ['chaitu-ycr', 'Bob', 'Charlie'], 'Age': [25, 30, 22]}
df = pd.DataFrame(data)

print(df)

create dataframe from csv/excel files¶

Data often resides in CSV, Excel, or other file formats.
Pandas provides functions like read_csv() and read_excel() to import this data directly into DataFrames.

In [ ]:

Copied!

import pandas as pd

# Replace '*.csv' with the actual path to your CSV file
data_path = '_data/random_data_from_web.csv'

# Read the CSV file using pandas.read_csv()
df = pd.read_csv(data_path)

# Print the first few rows of the DataFrame (optional)
print(df.head())
import pandas as pd

# Replace '*.csv' with the actual path to your CSV file
data_path = '_data/random_data_from_web.csv'

# Read the CSV file using pandas.read_csv()
df = pd.read_csv(data_path)

# Print the first few rows of the DataFrame (optional)
print(df.head())

create dataframe from SQL databases¶

Leverage pandas' integration with database libraries (e.g., SQLAlchemy) to fetch data from relational databases and construct DataFrames.

import pandas as pd
from sqlalchemy import create_engine

# Replace these details with your actual database credentials
DATABASE_USER = 'your_username'
DATABASE_PASSWORD = 'your_password'
DATABASE_HOST = 'your_host'
DATABASE_NAME = 'your_database'

# Define the SQL query (replace with your specific query)
sql_query = "SELECT * FROM your_table_name"  # Change this to your desired query

# Construct the connection string using SQLAlchemy
engine_string = f"postgresql://{DATABASE_USER}:{DATABASE_PASSWORD}@" \
                f"{DATABASE_HOST}/{DATABASE_NAME}"  # Adjust for your database type

# Create a SQLAlchemy engine
engine = create_engine(engine_string)

# Read the data from the database using pandas.read_sql_query()
df = pd.read_sql_query(sql_query, engine)

# Print the first few rows of the DataFrame (optional)
print(df.head())

# Close the connection (optional)
engine.dispose()

All about Pandas¶

Intro¶

create dataframe¶

create dataframe from Lists/NumPy Arrays¶

create dataframe from Dictionaries¶

create dataframe from csv/excel files¶

create dataframe from SQL databases¶

Examples with different use cases¶

use case 2 (TODO: replace with actual use case title)¶

use case 2 (TODO: replace with actual use case title)¶