In this post, originally written by Glenn Paulley and posted to sybase.com in April of 2009, Glenn describes how recursive queries work in general, and in SQL Anywhere.
Recursive SQL queries--also known as bill-of-materials or transitive closure queries--have been supported in SQL Anywhere since Version 9. There is a wide variety of resources available online and in book form about SQL recursion, including reference [1] below; what I want to discuss in particular are
- the (minor) differences that exist between SQL Anywhere's implementation and the recursive query syntax in the ANSI/ISO SQL:2008 standard;
- three connection option settings that one should consider when utilizing recursive queries in an application; and
- a brief description of the graphical plan for a recursive query.
In a subsequent post, I'll describe how to use recursive queries with Hibernate, since Hibernate's HQL language does not directly support recursion.
Example
A simple, but illustrative, example of a recursive query is the following: over the SQL Anywhere example database, demo.db, produce a list of all managers and the number of employees managed by each. A glance at the Employees table in demo.db indicates the bill-of-materials nature of the problem: each employee as a self-referencing foreign key to their manager, and most managers, though not all, themselves also have managers. Here's a straightforward listing from the Employees table of immediate managers:
The problem is that this simple result set---ignore the redundant ManagerID column for the moment---doesn't show that employees 501 and 1293 have no managers (they "manage themselves"), 902 works for 1293, and 703 and 1576 work for 902--which means that their respective counts have to be aggregated with their managers. The result set we want is:
Manager ID | count(*) |
---|---|
501 | 22 |
703 | 8 |
1576 | 15 |
902 | 43 |
1293 | 53 |
The issue, of course, is how to conjure a recursive query to compute it.
SQL Standard recursion fundamentals
Recursive queries in SQL are a very specific form of recursion; Jim Melton and Alan Simon present a very useful description of the SQL Standard's implementation in reference [1] (section 9.13). In a nutshell, one forms a recursive SQL query by:
- Constructing an initial result set with a seed query; and then
UNION
ing additional rows to that result set with a second query specification, using the recursive intermediate result as one of the table references in thatSELECT
statement.
Recursive execution is halted once a fixpoint is reached. Fixpoint semantics are more commonly used in functional programming languages, including Datalog, upon which Semmle's .QL query languageis based. In ANSI standard SQL [1], fixpoint semantics are loosely defined as the situation
.....when further efforts to identify more rows to insert into the result find no such additional rows.
In ANSI-standard SQL, recursive queries are formed using common table expressions that contain the WITH RECURSIVE
clause. Below is a modified subset of SQL grammar that is pertinent to recursive queries:
<query expression> ::= [ <with clause> ] <query expression body> [ <order by clause> ] <with clause> ::= WITH [ RECURSIVE ] <with list><with list> ::= <with list element> [ { <comma> <with list element> }... ]<with list element> ::= <query name> [ <left paren> <with column list> <right paren> ] AS <table subquery> [ <search or cycle clause> ]<with column list> ::= <column name list><query expression body> ::= <query term> | <query expression body> UNION ALL <query term><query term> ::= <query primary><query primary> ::= <simple table> | <left paren> <query expression body> [ <order by clause> ] <right paren><simple table> ::= <query specification><order by clause> ::= ORDER BY <sort specification list>
Recursive queries are restricted as follows:
- In Standard SQL, recursive query expressions are restricted to
UNION ALL
; moreover, each virtual recursive table can be referenced at most once in the definition of that table. - ANSI-standard SQL permits only monotonic progressions; that is, as the recursion continues the (virtual) recursive table is only permitted to grow in size.
- A consequence of monotonicity is that rows from the (virtual) recursive table cannot be removed (via
SELECT DISTINCT
, for example), nor can already-present rows in the (virtual) recursive table be modified on-the-fly.
The SQL Standard includes additional syntax to (1) specify whether or not traversal through the search order is DEPTH FIRST
or BREADTH FIRST
, and (2) specify a limit on recursive computation to prevent runaway queries. This additional syntax from the SQL Standard is as follows:
<search or cycle clause> ::= <search clause> | <cycle clause> | <search clause> <cycle clause><search clause> ::= SEARCH <recursive search order> SET <sequence column><recursive search order> ::= DEPTH FIRST BY <column name list> | BREADTH FIRST BY <column name list><sequence column> ::= <column name><cycle clause> ::= CYCLE <cycle column list> SET <cycle mark column> TO <cycle mark value> DEFAULT <non-cycle mark value> USING <path column><cycle column list> ::= <cycle column> [ { <comma> <cycle column> }... ]<cycle column> ::= <column name><cycle mark column> ::= <column name><path column> ::= <column name><cycle mark value> ::= <value expression><non-cycle mark value> ::= <value expression>
Neither the SEARCH
nor the CYCLE
clause are supported by SQL Anywhere, though they may be in future releases. Finally, note that recursive queries permit a non-procedural computation of a bill-of-materials problem. Nothing prevents one from conjuring a recursive result set using, for example, a stored procedure that loops forever, adding rows to a result set until the result set no longer grows. In fact, that is precisely how earlier versions of Semmle's .QL language handled recursive queries with a Microsoft SQL Server database on the backend. However, computing a result through a recursive UNION query may be both simpler and more efficient.
Back to the example
It should be clear, then, that given the restrictions on recursion---monotonicity and non-negation---imposed by the ANSI SQL model, generating the intended result (the aggregated employee count) with a single recursive query per se cannot be accomplished directly. What we need to do is to construct an intermediate result using recursion that, once constructed, can be used as input to another query that can generate the required result. In the description of ANSI recursive SQL queries above, I mentioned the need to develop a seed query, and then conjure a recursive query expression that UNION
s its rows with the (virtual) recursive intermediate result. In our case, the seed query is the simple GROUP BY
query illustrated above that computes employee counts for immediate managers. But now what? What we can do is generate additional intermediate result rows that reflect the management hierarchy: hence the need for the two attributes (EmployeeID and ManagerID) in the virtual recursive result. Starting with those immediate managers, we can add additional rows that contain the same employee counts, but for the next manager in the management chain. Here's the complete query and the generated result set:
Here are some specific points about this intermediate result:
- The table expression
FROM Employees e JOIN EmpsByManager em ON (e.EmployeeID = em.ManagerID)
joins the Employees table with the (virtual) table of managers to produce the additional rows that document those employees that work for more senior managers. - The restriction condition
WHERE e.ManagerID <> e.EmployeeID
prevents the re-inclusion of rows where managers manage themselves. In the demo database, managers that manage themselves are identified using a foreign key value (ManagerID) to the same row. As a counterexample, if rows representing top-level managers utilizedNULL
values, then this additional clause would not be required.
Now that this intermediate result has been computed, producing the final result desired is straightforward---we simply replace the simple
SELECT * FROM EmpsByManager
with:
Select ManagerID, sum(TotalEmps) From EmpsByManager Group by ManagerID
SQL Anywhere considerations
With SQL Anywhere there are three connection options pertinent to recursive queries that deserve attention:
- While SQL Anywhere does not support the
CYCLE
clause, SQL Anywhere provides a connection optionMAX_RECURSIVE_ITERATIONS
that will halt runaway recursive queries with a SQLerror if the number of recursive iterations is exceeded; its default value is 100. If you think about it, the number of recursions required in a particular query should be equivalent to the depth of the recursive hierarchy; in most situations this is almost certainly going to be considerably less than 100. Setting this connection option to a smaller value will prevent extraordinary resource consumption of ill-conceived, runaway queries. - SQL Anywhere provides two additional connection options,
MAX_TEMP_SPACE
andTEMP_SPACE_LIMIT_CHECK
, that can greatly assist in avoiding catastrophic server failure in cases of erroneous recursive queries. The former option prevents the server from utilizing an unlimited amount of space for the TEMP file; the latter option,TEMP_SPACE_LIMIT_CHECK
, is a parameter to the server governor that limits temporary space usage on a connection-by-connection basis.
SQL Anywhere computes recursive queries using two specialized query execution operators: Recursive Hash [Outer] Join and Recursive Nested-Loop [Outer] Join. The graphical plan for the recursive query above is as follows:
In this plan, the "seed" query is on the left-hand-side of the recursive UNION ALL
operator (RU). RT signifies the (virtual) recursive table that contains the recursive intermediate result, "EmpsByManager". JHR signifies the recursive hash join operator. Like all other memory-intensive operators in the SQL Anywhere server, JHR has a low-memory alternative strategy---recursive [outer] nested-loop join---that can be chosen if either memory becomes scarce at query execution time, or the JHR operator determines that the cardinality of the inputs for this join instance warrant a cheaper, nested-loop execution strategy. This change in join execution strategy is performed on-the-fly.
Figuring it out
When you're struggling with conjuring a recursive SQL query, a useful technique is to hand-code the first iteration of recursion, and from there determine what attributes and result rows need to be added to the (virtual) recursive intermediate result. In our example above, the first iteration is as follows:
Here, we can see from the three rows that join that we can UNION
additional rows containing row counts previously generated by the seed query (for immediate managers) with the senior managers as identified by the join with the Employees table. Additional examples of recursive queries can be found in the SQL Anywhere server documentation or online on the Sybase DocCommentXchange.