Add Data Sources

Release v3 is currently in beta. This documentation reflects the features and functionality in progress and may change before the final release.

What type of data does PandasAI support?

PandasAI mission is to make data analysis and manipulation more efficient and accessible to everyone. You can work with data in various ways:

CSV and Excel Files: Load data directly from files using simple Python functions
SQL Databases: Connect to various SQL databases using our extensions
Cloud Data: Work with enterprise-scale data using our specialized extensions (requires Enterprise License)

Let’s start with the basics of loading CSV files, and then we’ll explore the different extensions available.

How to work with CSV files in PandasAI?

Loading data from CSV files is straightforward with PandasAI:

import pandasai as pai

# Basic CSV loading
file = pai.read_csv("data.csv")

# Use the semantic layer on CSV
df = pai.create(
    path="company/sales-data",
    df = file,
    description="Sales data from our retail stores",
    columns={
        "transaction_id": {"type": "string", "description": "Unique identifier for each sale"},
        "sale_date": {"type": "datetime", "description": "Date and time of the sale"},
        "product_id": {"type": "string", "description": "Product identifier"},
        "quantity": {"type": "integer", "description": "Number of units sold"},
        "price": {"type": "float", "description": "Price per unit"}
    },
)

# Chat with the dataframe
response = df.chat("Which product has the highest sales?")

How to work with SQL in PandasAI?

PandasAI provides a sql extension for you to work with SQL, PostgreSQL, MySQL, and CockroachDB databases. To make the library lightweight and easy to use, the basic installation of the library does not include this extension. It can be easily installed using pip with the specific database you want to use:

pip install pandasai-sql[postgres]
pip install pandasai-sql[mysql]
pip install pandasai-sql[cockroachdb]

Once you have installed the extension, you can use the semantic data layer and perform data transformations.

sql_table = pai.create(
    path="example/mysql-dataset",
    description="Heart disease dataset from MySQL database",
    source={
        "type": "mysql",
        "connection": {
            "host": "database.example.com",
            "port": 3306,
            "user": "${DB_USER}",
            "password": "${DB_PASSWORD}",
            "database": "medical_data"
        },
        "table": "heart_data",
        "columns": [
            {"name": "Age", "type": "integer", "description": "Age of the patient in years"},
            {"name": "Sex", "type": "string", "description": "Gender of the patient (M = male, F = female)"},
            {"name": "ChestPainType", "type": "string", "description": "Type of chest pain (ATA, NAP, ASY, TA)"},
            {"name": "RestingBP", "type": "integer", "description": "Resting blood pressure in mm Hg"},
            {"name": "Cholesterol", "type": "integer", "description": "Serum cholesterol in mg/dl"},
            {"name": "FastingBS", "type": "integer", "description": "Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)"},
            {"name": "RestingECG", "type": "string", "description": "Resting electrocardiogram results (Normal, ST, LVH)"},
            {"name": "MaxHR", "type": "integer", "description": "Maximum heart rate achieved"},
            {"name": "ExerciseAngina", "type": "string", "description": "Exercise-induced angina (Y = yes, N = no)"},
            {"name": "Oldpeak", "type": "float", "description": "ST depression induced by exercise relative to rest"},
            {"name": "ST_Slope", "type": "string", "description": "Slope of the peak exercise ST segment (Up, Flat, Down)"},
            {"name": "HeartDisease", "type": "integer", "description": "Heart disease diagnosis (1 = present, 0 = absent)"}
        ]
    }
)

How to work with Enterprise Cloud Data in PandasAI?

PandasAI provides Enterprise Edition extensions for connecting to cloud data. These extensions require an Enterprise License or Data Platform team plan. Once you have installed a enterprise cloud data extension, you can use it to connect to your cloud data.

Snowflake extension (ee)

First, install the extension:

poetry add pandasai-snowflake
# or
pip install pandasai-snowflake

Then use it:

name: sales_data

source:
  type: snowflake
  connection:
    account: your-account
    warehouse: your-warehouse
    database: your-database
    schema: your-schema
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
  table: sales_data

destination:
  type: local
  format: parquet
  path: company/snowflake-sales

columns:
  - name: transaction_id
    type: string
    description: Unique identifier for each sale
  - name: sale_date
    type: datetime
    description: Date and time of the sale
  - name: product_id
    type: string
    description: Product identifier
  - name: quantity
    type: integer
    description: Number of units sold
  - name: price
    type: float
    description: Price per unit

transformations:
  - type: convert_timezone
    params:
      column: sale_date
      from: UTC
      to: America/Chicago
  - type: calculate
    params:
      column: revenue
      formula: quantity * price
  - type: round
    params:
      column: revenue
      decimals: 2

update_frequency: daily

order_by:
  - sale_date DESC

limit: 100000

Databricks extension (ee)

First, install the extension:

poetry add pandasai-databricks
# or
pip install pandasai-databricks

Then use it:

name: customer_data

source:
  type: databricks
  connection:
    host: your-workspace-url
    token: ${DATABRICKS_TOKEN}
  table: customers

destination:
  type: local
  format: parquet
  path: company/databricks-customers

columns:
  - name: customer_id
    type: string
    description: Unique identifier for each customer
  - name: name
    type: string
    description: Customer's full name
  - name: email
    type: string
    description: Customer's email address
  - name: join_date
    type: datetime
    description: Date when customer joined
  - name: total_purchases
    type: integer
    description: Total number of purchases made

transformations:
  - type: anonymize
    params:
      columns: [email, name]
  - type: convert_timezone
    params:
      column: join_date
      from: UTC
      to: Europe/London
  - type: calculate
    params:
      column: customer_tier
      formula: "CASE WHEN total_purchases > 100 THEN 'Gold' WHEN total_purchases > 50 THEN 'Silver' ELSE 'Bronze' END"

update_frequency: daily

order_by:
  - join_date DESC

limit: 100000

BigQuery extension (ee)

First, install the extension:

poetry add pandasai-bigquery
# or
pip install pandasai-bigquery

Then use it:

name: inventory_data

source:
  type: bigquery
  connection:
    project_id: your-project-id
    credentials: ${GOOGLE_APPLICATION_CREDENTIALS}
  table: inventory

destination:
  type: local
  format: parquet
  path: company/bigquery-inventory

columns:
  - name: product_id
    type: string
    description: Unique identifier for each product
  - name: product_name
    type: string
    description: Name of the product
  - name: category
    type: string
    description: Product category
  - name: stock_level
    type: integer
    description: Current quantity in stock
  - name: last_updated
    type: datetime
    description: Last inventory update timestamp

transformations:
  - type: categorize
    params:
      column: stock_level
      bins: [0, 20, 100, 500]
      labels: ["Low", "Medium", "High"]
  - type: extract
    params:
      column: product_name
      pattern: "(.*?)\\s*-\\s*(.*)"
      into: [brand, model]
  - type: convert_timezone
    params:
      column: last_updated
      from: UTC
      to: Asia/Tokyo

update_frequency: hourly

order_by:
  - last_updated DESC

limit: 50000

Oracle extension (ee)

First, install the extension:

poetry add pandasai-oracle
# or
pip install pandasai-oracle

Then use it:

name: sales_data

source:
  type: oracle
  connection:
    host: your-host
    port: 1521
    service_name: your-service
    user: ${ORACLE_USER}
    password: ${ORACLE_PASSWORD}
  table: sales_data

destination:
  type: local
  format: parquet
  path: company/oracle-sales

columns:
  - name: transaction_id
    type: string
    description: Unique identifier for each sale
  - name: sale_date
    type: datetime
    description: Date and time of the sale
  - name: product_id
    type: string
    description: Product identifier
  - name: quantity
    type: integer
    description: Number of units sold
  - name: price
    type: float
    description: Price per unit

transformations:
  - type: convert_timezone
    params:
      column: sale_date
      from: UTC
      to: Australia/Sydney
  - type: calculate
    params:
      column: total_amount
      formula: quantity * price
  - type: round
    params:
      column: total_amount
      decimals: 2
  - type: calculate
    params:
      column: discount
      formula: "CASE WHEN quantity > 10 THEN 0.1 WHEN quantity > 5 THEN 0.05 ELSE 0 END"

update_frequency: daily

order_by:
  - sale_date DESC

limit: 100000

Yahoo Finance extension

First, install the extension:

poetry add pandasai-yfinance
# or
pip install pandasai-yfinance

Then use it:

name: stock_data

source:
  type: yahoo_finance
  symbols: 
    - GOOG
    - MSFT
    - AAPL
  start_date: 2023-01-01
  end_date: 2023-12-31

destination:
  type: local
  format: parquet
  path: company/market-data

columns:
  - name: date
    type: datetime
    description: Date of the trading day
  - name: open
    type: float
    description: Opening price of the stock
  - name: high
    type: float
    description: Highest price of the stock during the day
  - name: low
    type: float
    description: Lowest price of the stock during the day
  - name: close
    type: float
    description: Closing price of the stock
  - name: volume
    type: integer
    description: Number of shares traded during the day

transformations:
  - type: calculate
    params:
      column: daily_return
      formula: (close - open) / open * 100
  - type: calculate
    params:
      column: price_range
      formula: high - low
  - type: round
    params:
      columns: [daily_return, price_range]
      decimals: 2
  - type: convert_timezone
    params:
      column: date
      from: UTC
      to: America/New_York

update_frequency: daily

order_by:
  - date DESC

limit: 100000

All data extensions

extension	install with poetry	install with pip	need ee license?
pandasai_sql	`poetry add pandasai-sql[postgres]`	`pip install pandasai-sql[postgres]`	No
pandasai_yfinance	`poetry add pandasai-yfinance`	`pip install pandasai-yfinance`	No
pandasai_snowflake	`poetry add pandasai-snowflake`	`pip install pandasai-snowflake`	Yes
pandasai_databricks	`poetry add pandasai-databricks`	`pip install pandasai-databricks`	Yes
pandasai_bigquery	`poetry add pandasai-bigquery`	`pip install pandasai-bigquery`	Yes
pandasai_oracle	`poetry add pandasai-oracle`	`pip install pandasai-oracle`	Yes

Overview

Data layer

Natural Language

Data Platform

Advanced Usage

Backwards Compatibility

About

What type of data does PandasAI support?

How to work with CSV files in PandasAI?

How to work with SQL in PandasAI?

How to work with Enterprise Cloud Data in PandasAI?

Snowflake extension (ee)

Databricks extension (ee)

BigQuery extension (ee)

Oracle extension (ee)

Yahoo Finance extension

All data extensions

Overview

Data layer

Natural Language

Data Platform

Advanced Usage

Backwards Compatibility

About

​What type of data does PandasAI support?

​How to work with CSV files in PandasAI?

​How to work with SQL in PandasAI?

​How to work with Enterprise Cloud Data in PandasAI?

​Snowflake extension (ee)

​Databricks extension (ee)

​BigQuery extension (ee)

​Oracle extension (ee)

​Yahoo Finance extension

​All data extensions

What type of data does PandasAI support?

How to work with CSV files in PandasAI?

How to work with SQL in PandasAI?

How to work with Enterprise Cloud Data in PandasAI?

Snowflake extension (ee)

Databricks extension (ee)

BigQuery extension (ee)

Oracle extension (ee)

Yahoo Finance extension

All data extensions