Semantic Agent
Enhance the PandasAI library with the Semantic Agent for more accurate and interpretable results.
Introduction to the Semantic Agent
The SemanticAgent
(currently in beta) extends the capabilities of the PandasAI library by adding a semantic layer to its results. Unlike the standard Agent
, the SemanticAgent
generates a JSON query, which can then be used to produce Python or SQL code. This approach ensures more accurate and interpretable outputs.
Note: Usage of the Semantic Agent in production is subject to a license. For more details, refer to the license documentation. If you plan to use it in production, contact us.
Instantiating the Semantic Agent
Creating an instance of the SemanticAgent
is similar to creating an instance of an Agent
.
from pandasai.ee.agents.semantic_agent import SemanticAgent
import pandas as pd
df = pd.read_csv('revenue.csv')
agent = SemanticAgent(df, config=config)
agent.chat("What are the top 5 revenue streams?")
How the Semantic Agent Works
The Semantic Agent operates in two main steps:
- Schema generation
- JSON query generation
Schema Generation
The first step is schema generation, which structures the data into a schema that the Semantic Agent can use to generate JSON queries. By default, this schema is automatically created, but you can also provide a custom schema if necessary.
Automatic Schema Generation
By default, the SemanticAgent
considers all dataframes passed to it and generates an appropriate schema.
Custom Schema
To provide a custom schema, pass a schema
parameter during the instantiation of the SemanticAgent
.
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Marketing", "IT", "Marketing", "Finance"],
}
)
schema = [
{
"name": "Employees",
"table": "Employees",
"measures": [
{
"name": "count",
"type": "count",
"sql": "EmployeeID"
}
],
"dimensions": [
{
"name": "EmployeeID",
"type": "string",
"sql": "EmployeeID"
},
{
"name": "Department",
"type": "string",
"sql": "Department"
}
],
"joins": [
{
"name": "Salaries",
"join_type":"left",
"sql": "Employees.EmployeeID = Salaries.EmployeeID"
}
]
},
{
"name": "Salaries",
"table": "Salaries",
"measures": [
{
"name": "count",
"type": "count",
"sql": "EmployeeID"
},
{
"name": "avg_salary",
"type": "avg",
"sql": "Salary"
},
{
"name": "max_salary",
"type": "max",
"sql": "Salary"
}
],
"dimensions": [
{
"name": "EmployeeID",
"type": "string",
"sql": "EmployeeID"
},
{
"name": "Salary",
"type": "string",
"sql": "Salary"
}
],
"joins": [
{
"name": "Employees",
"join_type":"left",
"sql": "Contracts.contract_code = Fees.contract_id"
}
]
}
]
agent = SemanticAgent([employees_df, salaries_df], schema=schema)
JSON Query Generation
The second step involves generating a JSON query based on the schema. This query is then used to produce the Python or SQL code required for execution.
Example JSON Query
Here’s an example of a JSON query generated by the SemanticAgent
:
{
"type": "number",
"dimensions": [],
"measures": ["Salaries.avg_salary"],
"timeDimensions": [],
"filters": [],
"order": []
}
This query is interpreted by the Semantic Agent and converted into executable Python or SQL code.
Deep Dive into the Schema and the Query
Understanding the Schema Structure
A schema in the SemanticAgent
is a comprehensive representation of the data, including tables, columns, measures, dimensions, and relationships between tables. Here’s a breakdown of its components:
Measures
Measures are the quantitative metrics used in the analysis, such as sums, averages, counts, etc.
- name: The identifier for the measure.
- type: The type of aggregation (e.g.,
count
,avg
,sum
,max
,min
). - sql: The column or expression in SQL to compute the measure.
Example:
{
"name": "avg_salary",
"type": "avg",
"sql": "Salary"
}
Dimensions
Dimensions are the categorical variables used to slice and dice the data.
- name: The identifier for the dimension.
- type: The data type (e.g., string, date).
- sql: The column or expression in SQL to reference the dimension.
Example:
{
"name": "Department",
"type": "string",
"sql": "Department"
}
Joins
Joins define the relationships between tables, specifying how they should be connected in queries.
- name: The name of the related table.
- join_type: The type of join (e.g.,
left
,right
,inner
). - sql: The SQL expression to perform the join.
Example:
{
"name": "Salaries",
"join_type": "left",
"sql": "Employees.EmployeeID = Salaries.EmployeeID"
}
Understanding the Query Structure
The JSON query is a structured representation of the request, specifying what data to retrieve and how to process it. Here’s a detailed look at its fields:
Type
The type of query determines the format of the result, such as a single number, a table, or a chart.
- type: Can be “number”, “pie”, “bar”, “line”.
Example:
{
"type": "number",
...
}
Dimensions
Columns used to group the data. In an SQL GROUP BY
clause, these would be the columns listed.
- dimensions: An array of dimension identifiers.
Example:
{
...,
"dimensions": ["Department"]
}
Measures
Columns used to calculate data, typically involving aggregate functions like sum, average, count, etc.
- measures: An array of measure identifiers.
Example:
{
...,
"measures": ["Salaries.avg_salary"]
}
Time Dimensions
Columns used to group the data by time, often involving date functions. Each timeDimensions
entry specifies a time period and its granularity. The dateRange
field allows various formats, including specific dates such as ["2022-01-01", "2023-03-31"]
, relative periods like “last week”, “last month”, “this month”, “this week”, “today”, “this year”, and “last year”.
Example:
{
...,
"timeDimensions": [
{
"dimension": "Sales.time_period",
"dateRange": ["2023-01-01", "2023-03-31"],
"granularity": "day"
}
]
}
Filters
Conditions to filter the data, equivalent to SQL WHERE
clauses. Each filter specifies a member, an operator, and a set of values. The operators allowed include: “equals”, “notEquals”, “contains”, “notContains”, “startsWith”, “endsWith”, “gt” (greater than), “gte” (greater than or equal to), “lt” (less than), “lte” (less than or equal to), “set”, “notSet”, “inDateRange”, “notInDateRange”, “beforeDate”, and “afterDate”.
- filters: An array of filter conditions.
Example:
{
...,
"filters": [
{
"member": "Ticket.category",
"operator": "notEquals",
"values": ["null"]
}
]
}
Order
Columns used to order the data, equivalent to SQL ORDER BY
clauses. Each entry in the order
array specifies an identifier and the direction of sorting. The direction can be either “asc” for ascending or “desc” for descending order.
- order: An array of ordering specifications.
Example:
{
...,
"order": [
{
"id": "Contratti.contract_count",
"direction": "asc"
}
]
}
Combining the Components
When these components come together, they form a complete query that the Semantic Agent can interpret and execute. Here’s an example that combines all elements:
{
"type": "table",
"dimensions": ["Department"],
"measures": ["Salaries.avg_salary"],
"timeDimensions": [],
"filters": [
{
"member": "Department",
"operator": "equals",
"values": ["Marketing", "IT"]
}
],
"order": [
{
"measure": "Salaries.avg_salary",
"direction": "desc"
}
]
}
This query translates to an SQL statement like:
SELECT Department, AVG(Salary) AS avg_salary,
FROM Employees
JOIN Salaries ON Employees.EmployeeID = Salaries.EmployeeID
WHERE Department IN ('Marketing', 'IT')
GROUP BY Department
ORDER BY avg_salary DESC;
Was this page helpful?