Introduction to the Semantic Agent

The SemanticAgent (currently in beta) extends the capabilities of the PandasAI library by adding a semantic layer to its results. Unlike the standard Agent, the SemanticAgent generates a JSON query, which can then be used to produce Python or SQL code. This approach ensures more accurate and interpretable outputs.

Note: Usage of the Semantic Agent in production is subject to a license. For more details, refer to the license documentation. If you plan to use it in production, contact us.

Instantiating the Semantic Agent

Creating an instance of the SemanticAgent is similar to creating an instance of an Agent.

from pandasai.ee.agents.semantic_agent import SemanticAgent
import pandas as pd

df = pd.read_csv('revenue.csv')

agent = SemanticAgent(df, config=config)
agent.chat("What are the top 5 revenue streams?")

How the Semantic Agent Works

The Semantic Agent operates in two main steps:

  1. Schema generation
  2. JSON query generation

Schema Generation

The first step is schema generation, which structures the data into a schema that the Semantic Agent can use to generate JSON queries. By default, this schema is automatically created, but you can also provide a custom schema if necessary.

Automatic Schema Generation

By default, the SemanticAgent considers all dataframes passed to it and generates an appropriate schema.

Custom Schema

To provide a custom schema, pass a schema parameter during the instantiation of the SemanticAgent.

salaries_df = pd.DataFrame(
    {
        "EmployeeID": [1, 2, 3, 4, 5],
        "Salary": [5000, 6000, 4500, 7000, 5500],
    }
)

employees_df = pd.DataFrame(
    {
        "EmployeeID": [1, 2, 3, 4, 5],
        "Name": ["John", "Emma", "Liam", "Olivia", "William"],
        "Department": ["HR", "Marketing", "IT", "Marketing", "Finance"],
    }
)

schema = [
    {
        "name": "Employees",
        "table": "Employees",
        "measures": [
            {
                "name": "count",
                "type": "count",
                "sql": "EmployeeID"
            }
        ],
        "dimensions": [
            {
                "name": "EmployeeID",
                "type": "string",
                "sql": "EmployeeID"
            },
            {
                "name": "Department",
                "type": "string",
                "sql": "Department"
            }
        ],
        "joins": [
            {
                "name": "Salaries",
                "join_type":"left",
                "sql": "Employees.EmployeeID = Salaries.EmployeeID"
            }
        ]
    },
    {
        "name": "Salaries",
        "table": "Salaries",
        "measures": [
            {
                "name": "count",
                "type": "count",
                "sql": "EmployeeID"
            },
            {
                "name": "avg_salary",
                "type": "avg",
                "sql": "Salary"
            },
            {
                "name": "max_salary",
                "type": "max",
                "sql": "Salary"
            }
        ],
        "dimensions": [
            {
                "name": "EmployeeID",
                "type": "string",
                "sql": "EmployeeID"
            },
            {
                "name": "Salary",
                "type": "string",
                "sql": "Salary"
            }
        ],
        "joins": [
            {
                "name": "Employees",
                "join_type":"left",
                "sql": "Contracts.contract_code = Fees.contract_id"
            }
        ]
    }
]

agent = SemanticAgent([employees_df, salaries_df], schema=schema)

JSON Query Generation

The second step involves generating a JSON query based on the schema. This query is then used to produce the Python or SQL code required for execution.

Example JSON Query

Here’s an example of a JSON query generated by the SemanticAgent:

{
  "type": "number",
  "dimensions": [],
  "measures": ["Salaries.avg_salary"],
  "timeDimensions": [],
  "filters": [],
  "order": []
}

This query is interpreted by the Semantic Agent and converted into executable Python or SQL code.

Deep Dive into the Schema and the Query

Understanding the Schema Structure

A schema in the SemanticAgent is a comprehensive representation of the data, including tables, columns, measures, dimensions, and relationships between tables. Here’s a breakdown of its components:

Measures

Measures are the quantitative metrics used in the analysis, such as sums, averages, counts, etc.

  • name: The identifier for the measure.
  • type: The type of aggregation (e.g., count, avg, sum, max, min).
  • sql: The column or expression in SQL to compute the measure.

Example:

{
  "name": "avg_salary",
  "type": "avg",
  "sql": "Salary"
}

Dimensions

Dimensions are the categorical variables used to slice and dice the data.

  • name: The identifier for the dimension.
  • type: The data type (e.g., string, date).
  • sql: The column or expression in SQL to reference the dimension.

Example:

{
  "name": "Department",
  "type": "string",
  "sql": "Department"
}

Joins

Joins define the relationships between tables, specifying how they should be connected in queries.

  • name: The name of the related table.
  • join_type: The type of join (e.g., left, right, inner).
  • sql: The SQL expression to perform the join.

Example:

{
  "name": "Salaries",
  "join_type": "left",
  "sql": "Employees.EmployeeID = Salaries.EmployeeID"
}

Understanding the Query Structure

The JSON query is a structured representation of the request, specifying what data to retrieve and how to process it. Here’s a detailed look at its fields:

Type

The type of query determines the format of the result, such as a single number, a table, or a chart.

  • type: Can be “number”, “pie”, “bar”, “line”.

Example:

{
  "type": "number",
  ...
}

Dimensions

Columns used to group the data. In an SQL GROUP BY clause, these would be the columns listed.

  • dimensions: An array of dimension identifiers.

Example:

{
  ...,
  "dimensions": ["Department"]
}

Measures

Columns used to calculate data, typically involving aggregate functions like sum, average, count, etc.

  • measures: An array of measure identifiers.

Example:

{
  ...,
  "measures": ["Salaries.avg_salary"]
}

Time Dimensions

Columns used to group the data by time, often involving date functions. Each timeDimensions entry specifies a time period and its granularity. The dateRange field allows various formats, including specific dates such as ["2022-01-01", "2023-03-31"], relative periods like “last week”, “last month”, “this month”, “this week”, “today”, “this year”, and “last year”.

Example:

{
  ...,
  "timeDimensions": [
    {
      "dimension": "Sales.time_period",
      "dateRange": ["2023-01-01", "2023-03-31"],
      "granularity": "day"
    }
  ]
}

Filters

Conditions to filter the data, equivalent to SQL WHERE clauses. Each filter specifies a member, an operator, and a set of values. The operators allowed include: “equals”, “notEquals”, “contains”, “notContains”, “startsWith”, “endsWith”, “gt” (greater than), “gte” (greater than or equal to), “lt” (less than), “lte” (less than or equal to), “set”, “notSet”, “inDateRange”, “notInDateRange”, “beforeDate”, and “afterDate”.

  • filters: An array of filter conditions.

Example:

{
  ...,
  "filters": [
    {
      "member": "Ticket.category",
      "operator": "notEquals",
      "values": ["null"]
    }
  ]
}

Order

Columns used to order the data, equivalent to SQL ORDER BY clauses. Each entry in the order array specifies an identifier and the direction of sorting. The direction can be either “asc” for ascending or “desc” for descending order.

  • order: An array of ordering specifications.

Example:

{
  ...,
  "order": [
    {
      "id": "Contratti.contract_count",
      "direction": "asc"
    }
  ]
}

Combining the Components

When these components come together, they form a complete query that the Semantic Agent can interpret and execute. Here’s an example that combines all elements:

{
  "type": "table",
  "dimensions": ["Department"],
  "measures": ["Salaries.avg_salary"],
  "timeDimensions": [],
  "filters": [
    {
      "member": "Department",
      "operator": "equals",
      "values": ["Marketing", "IT"]
    }
  ],
  "order": [
    {
      "measure": "Salaries.avg_salary",
      "direction": "desc"
    }
  ]
}

This query translates to an SQL statement like:

SELECT Department, AVG(Salary) AS avg_salary,
FROM Employees
JOIN Salaries ON Employees.EmployeeID = Salaries.EmployeeID
WHERE Department IN ('Marketing', 'IT')
GROUP BY Department
ORDER BY avg_salary DESC;